Analysis and selection of keyword extraction methods in texts: review of existing approaches and practical application
DOI:
https://doi.org/10.15276/ict.02.2025.48Keywords:
Natural language processing, keywords, TF-IDF, RAKE, TextRank, BERT, KeyBERT, embeddings, spaCy, ConceptNetAbstract
The article addresses the problem of automatic keyword extraction from texts, which is an important stage in natural language processing (NLP). The relevance of the topic is driven by the rapid growth of textual data, which requires systematic organization and analysis. The main approaches to keyword extraction are analyzed: classical statistical methods (TF-IDF, RAKE, TextRank), modern semantic algorithms (BERT, KeyBERT, embeddings with clustering), as well as third-party tools and APIs (ConceptNet, spaCy, HuggingFace Transformers). It is shown that statistical methods are simple to implement but are less accurate than modern models because they do not account for context and semantics. Semantic approaches provide higher-quality results, although they are more resource-intensive. Particular attention is given to practical experiments with Ukrainian texts, which were pre-translated into English to use English-language models. This approach allowed better results, as most libraries are optimized for English corpora. However, attempts at back-translation revealed issues with preserving the original meaning. Experimental studies showed that KeyBERT demonstrated the highest effectiveness among the considered methods: it combines result relevance, speed, and ease of integration, making it suitable for both scientific research and applied information systems. In conclusion, the use of KeyBERT in combination with English-language texts is justified as the optimal solution for keyword extraction tasks. Prospective directions for development are also outlined: support for multilingual corpora, adaptation to domainspecific texts, and optimization of models for large-scale data processing