A TRANSFORMER-BASED ALGORITHM FOR CLASSIFYING LONG TEXTS
Abstract
The article is devoted to the urgent problem of representing and classifying long text documents using transformers. Transformers-based text representation methods cannot effectively process long sequences due to their self-attention process, which scales quadratically with the sequence length. This limitation leads to high computational complexity and the inability to apply such models for processing long documents. To eliminate this drawback, the article developed an algorithm based on the SBERT transformer, which allows building a vector representation of long text documents. The key idea of the algorithm is the application of two different procedures for creating a vector representation: the first is based on text segmentation and averaging of segment vectors, and the second is based on concatenation of segment vectors. This combination of procedures allows preserving important information from long documents. To verify the effectiveness of the algorithm, a computational experiment was conducted on a group of classifiers built on the basis of the proposed algorithm and a group of well-known text vectorization methods, such as TF-IDF, LSA, and BoWC. The results of the computational experiment showed that transformer-based classifiers generally achieve better classification accuracy results compared to classical methods. However, this advantage is achieved at the cost of higher computational complexity and, accordingly, longer training and application times for such models. On the other hand, classical text vectorization methods, such as TF-IDF, LSA, and BoWC, demonstrated higher speed, making them more preferable in cases where pre-encoding is not allowed and real-time operation is required. The proposed algorithm has proven its high efficiency and led and led to an increase in the classification accuracy of the BBC dataset by 0.5% according to the F1 criterion.
References
Intelligence Review, 2019, Vol. 52, No. 1, pp. 273-292.
2. Minaee S., Kalchbrenner N., Cambria E., Nikzad N., Chenaghlu M., Gao J. Deep learning-based text
classification: A comprehensive review, arXiv preprint arXiv:.03705, 2020.
3. Mansour A., J.M. and Y.K. Text Vectorization Method Based on Concept Mining Using Clustering
Techniquesm 2022 VI International Conference on Information Technologies in Engineering Education
(Inforino). IEEE, 2022, pp. 1-10.
4. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I. Attention
is all you need, Advances in neural information processing systems, 2017, Vol. 30.
5. Mansour A.M., Mohammad J.H., Kravchenko Y.A. Text vectorization using data mining methods,
Izvestia SFedU. Technical science, 2021, No. 2.
6. Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for
language understanding, arXiv preprint arXiv:.04805, 2018.
7. Ni P., Li Y., Chang V. Research on text classification based on automatically extracted keywords, International
Journal of Enterprise Information Systems (IJEIS), 2020, Vol. 16, No. 4, pp. 1-16.
8. Grootendorst M., Vanschoren J. Beyond Bag-of-Concepts: Vectors of Locally Aggregated Concepts,
Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer,
2019, pp. 681-696.
9. Qader W.A., Ameen M.M., Ahmed B.I. An overview of bag of words; importance, implementation,
applications, and challenges, 2019 International Engineering Conference (IEC). IEEE, 2019,
pp. 200-204.
10. Kim H.K., Kim H., Cho. Bag-of-concepts: Comprehending document representation through clustering
words in distributed representation, Neurocomputing, 2017, Vol. 266, pp. 336-352.
11. Rücklé A., Eger S., Peyrard M., Gurevych I. Concatenated power mean word embeddings as universal
cross-lingual sentence representations, arXiv preprint arXiv:.01400, 2018.
12. Beltagy I., Peters M.E., Cohan A. Longformer: The long-document transformer, arXiv preprint
arXiv:.05150, 2020.
13. Zaheer M., Guruganesh G., Dubey K.A., Ainslie J., Alberti C., Ontanon S., Pham P., Ravula A., Wang
Q., Yang L. Big bird: Transformers for longer sequences, Advances in neural information processing
systems, 2020, Vol. 33. Big bird, pp. 17283-17297.
14. Floridi L., Chiriatti M. GPT-3: Its Nature, Scope, Limits, and Consequences, Minds and Machines,
2020, Vol. 30. GPT-3. No. 4, pp. 681-694.
15. Ethayarajh K. How Contextual are Contextualized Word Representations? Comparing the Geometry
of BERT, ELMo, and GPT-2 Embeddings, How Contextual are Contextualized Word Representations?,
2019.
16. Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,
arXiv preprint arXiv, 2019, Vol. abs/1908.10084.
17. Decoding Sentence-BERT | Continuum Labs. Available at: https://training.continuumlabs.ai/knowledge/
vector-databases/decoding-sentence-bert (accessed 06 July 2024).
18. Greene,D., Cunningham P. Practical solutions to the problem of diagonal dominance in kernel document
clustering, Proceedings of the 23rd international conference on Machine learning, 2006,
pp. 377-384.
19. Adhikari A., Ram A., Tang R., Lin J. Docbert: Bert for document classification, arXiv preprint
arXiv:.08398, 2019.
20. Sabbah T., Selamat A., Selamat M.H., Al-Anzi F.S., Viedma E.H., Krejcar O., Fujita H. Modified frequency-
based term weighting schemes for text classification, Applied Soft Computing, 2017, Vol. 58,
pp. 193-206.
21. Sun Y., Zheng Y., Hao C., Qiu H. NSP-BERT: A Prompt-based Few-Shot Learner Through an Original
Pre-training Task Next Sentence Prediction, 2022. NSP-BERT.
22. Wettig A., Gao T., Zhong Z., Chen D. Should You Mask 15% in Masked Language Modeling?, arXiv
preprint arXiv:2202.08005, 2022.