Text Message Clustering
DOI:
https://doi.org/10.18372/1990-5548.78.18255Keywords:
text message analysis, machine learning, Embedded Word2Vec, Mini Batch K-means, longest common subsequence method, clustering, SMSAbstract
The division into groups of text messages is considered, which can be useful when building a personalized approach in different systems. Тo solve this problem, the Embedded Word2Vec was proposed. To enhance the division into groups, the suggestion of employing mini-batch k-means is presented, offering a method with lower computational demands. This recommendation aligns with the practical need for efficient and scalable clustering methods, especially when dealing with large datasets. Furthermore, the proposed metric based on the greatest common sequence is highlighted as a valuable tool for evaluating the similarity of texts. This metric not only serves as a means to assess clustering quality but also underscores the methodological approach of directly working with text data. The combination of these techniques presents a comprehensive framework for robust and effective text clustering, with potential applications in diverse fields, such as personalized system interactions and information retrieval.
References
Frank Lin and William W. Cohen, “A Very Fast Method for Clustering Big Text Datasets,” In: Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence, 2010, pp. 303–308.
Andrew Ng, Michael Jordan, and Yair Weiss. “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, 2001, 14.
Von Luxburg, Ulrike. A Tutorial on Spectral Clustering. Statistics and Computing. Data Structures and Algorithms (cs. DS); Machine Learning, pp. 395–416.
Rohan Saha, ‘Influence of various text embeddings on clustering performance in NLP’, 2023.
Abdi A., Hajsaeedi M., Hooshmand M., "Longest Common Substring in Longest Common Subsequence's Solution Service: A Novel Hyper-heuristic," Computational Biology and Chemistry, vol. 105, p. 107882, 2023. https://doi.org/10.1016/j.compbiolchem.2023.107882
Negev Shekel Nosatzki, “Approximating the Longest Common Subsequence problem within a sub-polynomial factor in linear time,” arXiv e-prints, 2021, https://doi.org/10.48550/arXiv.2112.08454
G. Yamini, Dr. B. Renuka Devi, “A New Hybrid Clustering Technique Based on Mini-batch K-means and K-means++ for Analysing Big Data,” International Journal of Recent Research Aspects, 2018.
Carl Allen and Timothy Hospedales, “Analogies Explained: Towards Understanding Word Embeddings,” Proceedings of the 36th International Conference on Machine Learning, PMLR 97:223–231, 2019.
Downloads
Published
Issue
Section
License
Authors who publish with this journal agree to the following terms:
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).