Comparative Analysis of Text Vectorization Methods
Keywords:intellectual text analysis, natural language processing, text embeddings, opinion mining, machine learning, Word2Vec, TF-IDF, statistical embeddings, context-based embeddings
The paper considers methods of vectorization of textual properties of natural language in the context of the task of intellectual text analysis. The most common methods of statistical analysis of feature extraction and methods that taking into account the context are analyzed. The work describes the above types of text embeddings and their most common variations and implementations. Their comparative analysis was performed, which showed the relationship between the type of task of intellectual text analysis and the method showing the best metrics. The topology of the neural network, which is the basis for solving the problem and obtaining metrics, is described, and implemented. The comparative analysis was carried out using the relative time analysis of the theory of algorithms and classification metrics: accuracy, f1-score, precision, recall. The classification metrics are taken from the results of building a neural network model using the described framing methods. As a result, in the task of analyzing the tonality of the text, the statistical method of framing based on n-grams of character sequences turned out to be the best.
Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze, “Introduction to Information Retrieval,” Cambridge University Press, 2008. https://doi.org/10.1017/CBO9780511809071.
Tomáš Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012.
Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs], January 2013.
Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: global vectors for word representation,” In Proc. of the Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pp. 1532–1543, Doha, Qatar. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1162.
Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: global vectors for word representation,” In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1162.
T. T. Vu, V. A. Nguyen, & T. B. Le, “Combining Word2Vec and TF-IDF with Supervised Learning for Short Text Classification,” In 2020 3rd International Conference on Computational Intelligence (ICCI), 2020, pp. 241–245, IEEE.
M. Lin, S. Liao, & Y. Huang, “Hybrid word2vec and TF-IDF approach for sentiment classification,” Journal of Information Science, 45(6), 797–806, 2019.
Authors who publish with this journal agree to the following terms:
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).