Comparative Analysis of Text Vectorization Methods

Authors

DOI:

https://doi.org/10.18372/1990-5548.76.17663

Keywords:

intellectual text analysis, natural language processing, text embeddings, opinion mining, machine learning, Word2Vec, TF-IDF, statistical embeddings, context-based embeddings

Abstract

The paper considers methods of vectorization of textual properties of natural language in the context of the task of intellectual text analysis. The most common methods of statistical analysis of feature extraction and methods that taking into account the context are analyzed. The work describes the above types of text embeddings and their most common variations and implementations. Their comparative analysis was performed, which showed the relationship between the type of task of intellectual text analysis and the method showing the best metrics. The topology of the neural network, which is the basis for solving the problem and obtaining metrics, is described, and implemented. The comparative analysis was carried out using the relative time analysis of the theory of algorithms and classification metrics: accuracy, f1-score, precision, recall. The classification metrics are taken from the results of building a neural network model using the described framing methods. As a result, in the task of analyzing the tonality of the text, the statistical method of framing based on n-grams of character sequences turned out to be the best.

Author Biographies

Victor Sineglazov , National Aviation University, Kyiv

Doctor of Engineering Science

Professor

Head of the Department of Aviation Computer-Integrated Complexes 

Faculty of Air Navigation Electronics and Telecommunications

Illia Savenko , National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

MSc in Computer Science

Artificial Intelligence Department, Institute for Applied System Analysis

References

Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze, “Introduction to Information Retrieval,” Cambridge University Press, 2008. https://doi.org/10.1017/CBO9780511809071.

Tomáš Mikolov, Statistical language models based on neural networks, Ph.D. thesis, Brno University of Technology, 2012.

Tomáš Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, Efficient estimation of word representations in vector space. arXiv:1301.3781 [cs], January 2013.

Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: global vectors for word representation,” In Proc. of the Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pp. 1532–1543, Doha, Qatar. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1162.

Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: global vectors for word representation,” In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar. Association for Computational Linguistics, October 2014. https://doi.org/10.3115/v1/D14-1162.

T. T. Vu, V. A. Nguyen, & T. B. Le, “Combining Word2Vec and TF-IDF with Supervised Learning for Short Text Classification,” In 2020 3rd International Conference on Computational Intelligence (ICCI), 2020, pp. 241–245, IEEE.

M. Lin, S. Liao, & Y. Huang, “Hybrid word2vec and TF-IDF approach for sentiment classification,” Journal of Information Science, 45(6), 797–806, 2019.

Downloads

Published

2023-06-23

Issue

Section

COMPUTER SCIENCES AND INFORMATION TECHNOLOGIES