Text Message Clustering

Authors

  • Daniil Vedmiediev National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”
  • Nataliia Shapoval National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” https://orcid.org/0000-0002-8509-6886

DOI:

https://doi.org/10.18372/1990-5548.78.18255

Keywords:

text message analysis, machine learning, Embedded Word2Vec, Mini Batch K-means, longest common subsequence method, clustering, SMS

Abstract

The division into groups of text messages is considered, which can be useful when building a personalized approach in different systems. Тo solve this problem, the Embedded Word2Vec was proposed. To enhance the division into groups, the suggestion of employing mini-batch k-means is presented, offering a method with lower computational demands. This recommendation aligns with the practical need for efficient and scalable clustering methods, especially when dealing with large datasets. Furthermore, the proposed metric based on the greatest common sequence is highlighted as a valuable tool for evaluating the similarity of texts. This metric not only serves as a means to assess clustering quality but also underscores the methodological approach of directly working with text data. The combination of these techniques presents a comprehensive framework for robust and effective text clustering, with potential applications in diverse fields, such as personalized system interactions and information retrieval.

Author Biographies

Daniil Vedmiediev , National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Master's degree student

Nataliia Shapoval , National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”

Candidate of Science (Engineering)

Associate Professor

References

Frank Lin and William W. Cohen, “A Very Fast Method for Clustering Big Text Datasets,” In: Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence, 2010, pp. 303–308.

Andrew Ng, Michael Jordan, and Yair Weiss. “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, 2001, 14.

Von Luxburg, Ulrike. A Tutorial on Spectral Clustering. Statistics and Computing. Data Structures and Algorithms (cs. DS); Machine Learning, pp. 395–416.

Rohan Saha, ‘Influence of various text embeddings on clustering performance in NLP’, 2023.

Abdi A., Hajsaeedi M., Hooshmand M., "Longest Common Substring in Longest Common Subsequence's Solution Service: A Novel Hyper-heuristic," Computational Biology and Chemistry, vol. 105, p. 107882, 2023. https://doi.org/10.1016/j.compbiolchem.2023.107882

Negev Shekel Nosatzki, “Approximating the Longest Common Subsequence problem within a sub-polynomial factor in linear time,” arXiv e-prints, 2021, https://doi.org/10.48550/arXiv.2112.08454

G. Yamini, Dr. B. Renuka Devi, “A New Hybrid Clustering Technique Based on Mini-batch K-means and K-means++ for Analysing Big Data,” International Journal of Recent Research Aspects, 2018.

Carl Allen and Timothy Hospedales, “Analogies Explained: Towards Understanding Word Embeddings,” Proceedings of the 36th International Conference on Machine Learning, PMLR 97:223–231, 2019.

Downloads

Published

2023-12-27

Issue

Section

COMPUTER SCIENCES AND INFORMATION TECHNOLOGIES