SOFTWARE SYSTEM FOR ASSESSING THE READABILITY OF DISTORTED TEXTS FOR INFORMATIONAL LANGUAGE MEASUREMENTS
Abstract
The language information measurements are necessary for constructing language models used in optical character recognition, speech recognition, text data compression, detecting errors and automatic text correction. Since the quality of the automatic corrector should be close to the quality of work of a qualified specialist, an in-depth study of the possibilities of expert work with distorted text, a search for the objective laws of their work in correcting errors is necessary. To objectify expert assessments of the information characteristics of the language in the problem of correcting distorted texts, a software system for linguistic evaluation of the readability of dis-torted texts has been developed. The software system is a client-server web application. The client part is executed directly in the user's browser, and the server part is executed on the remote serv-er. When correcting a distorted text, the expert gives an assessment of readability and marks the text fragments with different colors: 1 – corrected without significant effort, 2 – require significant effort, 3 – cannot be unambiguously corrected. The quality of the correction of distorted texts by linguistic experts is influenced by such factors as the degree of text distortion, the level of expert knowledge of the language, the level of text complexity (grammar, vocabulary, style), the degree of expert knowledge of the text subject, as well as various realities (local place names, personalities, media names, specific events, etc.). Using the created system, an experimental comparison was made of the results of manual correction of distorted Arabic texts by experts and software auto-matic correction. Two options were considered under which the manual correction is performed: 1 – no time limit, 2 – correction time is limited to 30 minutes for each text. Correction accuracy was estimated using the F1 measure. The effectiveness of the developed system for comparing the results of manual and automatic correction of distorted texts is demonstrated. A significant effect of expert qualifications on the quality of correction was revealed.
References
2. Piotrovskiy R.G. Informatsionnye izmereniya yazyka [nformation dimensions of the language]. Izd-vo «Nauka», Len. otd., 1968, 117 p.
3. Yaglom A.M., Yaglom I.M. Veroyatnost' i informatsiya [Probability and information]. 5 ed. Moscow: KomKniga, 2007, 512 p.
4. Al-Suwayl M.I. On the entropy of Arabic, The Arabian Journal for Science and Engineering, 31 Oct. 1991, Vol. 16, Issue 4 (s), pp. 557-563.
5. Manin D.Yu. Experiments on predictability of word in context and information rate in natural language, J. Information Processes, 2006, No. 6 (3), pp. 229-236.
6. Alvarez-Lacalle E., Dorow B., Eckmann J.-P., Moses E. Hierarchical structures induce long-range dynamical correlations in written texts, Proceedings of the National Academy of Sciences of the United States of America, 2006, Vol. 103 (21), pp. 7956-7961. Doi: 10.1073 pnas.0510673103.
7. Montemurro M.A. Quantifying the information in the long-range order of words: Semantic structures and universal linguistic constraints, Cortex, 2014, Vol. 55, pp. 5-16. Doi:10.1016/j.cortex.2013.08.008.
8. Estevez-Rams E., Mesa-Rodriguez A., Estevez-Moya D. Complexity-entropy analysis at differ-ent levels of organisation in written language, PLoS One, 2019 May 8;14(5):e0214863. Doi: 10.1371/journal.pone.0214863.
9. Van Leijenhorst D.C., Van der Weide Th. P. A formal derivation of Heaps’ Law, Information Sciences, 2005, Vol. 170 (2-4), pp. 263–272. Doi:10.1016/j.ins.2004.03.006.
10. Altmann E.G., Gerlach M. Statistical Laws in Linguistics, in: M. Degli Esposti et al. (eds.), Creativity and Universality in Language. Lecture Notes in Morphogenesis, Springer, Cham, 2016, pp. 7-26. Doi: 10.1007/978-3-319-24403-7_2.
11. Ferrer-I-Cancho R., Elvevåg B. Random texts do not exhibit the real Zipf's law-like rank dis-tribution, PLoS One, 2010 Mar 9, 5 (3):e9411. Doi: 10.1371/journal.pone.0009411.
12. Serrano M.Á., Flammini A., Menczer F. Modeling Statistical Properties of Written Text, PLoS One, 2009, No. 4 (4), e5372. Doi: 10.1371/journal.pone.0005372.
13. Tanaka-Ishii K., Bunde A. Long-Range Memory in Literary Texts: On the Universal Cluster-ing of the Rare Words, PLoS One, 2016 Nov 28, 11(11):e0164658. Doi: 10.1371/journal.pone.0164658.
14. Hahn L.W., Sivley R.M. Entropy, semantic relatedness and proximity, Behavior Research Methods, 2011, pp. 746-760. Doi: 10.3758/s13428-011-0087-7.
15. Yu S., Cong J., Liang J., Lie H. The distribution of information content in English sentences, Retrieved from https://arXiv:1609.07681, 2016.
16. Teahan W., Cleary J. The Entropy Of English Using PPM-based Models, Proceedings of Data Compression Conference-DCC'96, IEEE Computer Society Press, 1996, pp. 53-62.
17. Alghamdi N., Berriche L. Capacity Investigation of Markov Chain-Based Statistical Text Steganog-raphy: Arabic Language Case, In Proceedings of the 2019 Asia Pacific Information Technology Conference (APIT 2019). ACM, New York, USA, pp. 37-43. Doi: 10.1145/3314527.3314532.
18. Mel'nikov S.Yu., Peresypkin V.A. O primenenii veroyatnostnykh modeley yazyka dlya obnaruzheniya oshibok v iskazhennykh tekstakh [On the use of probabilistic language models for detecting errors in distorted texts], Vestnik komp'yuternykh i informatsionnykh tekhnologiy [Bulletin of computer and information technologies], 2016, No. 5, pp. 29-34. Doi: 10.14489/vkit.2016.05.pp.029-033.
19. Subramaniam L.V. et al. A survey of types of text noise and techniques to handle noisy text, Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, July 23-24, 2009, Barcelona, Spain. Doi: 10.1145/1568296.1568315.
20. Potapova R.K., Potapov V.V., Khitina M.V. Opredelenie temy teksta, vosprinyatogo v zatrudnennykh usloviyakh (eksperimental'noe issledovanie) [Determining the topic of a text perceived in difficult conditions (experimental study)], Proceedings of the 14th International Conference “Speech and computer” (SPECOM 2011), Moscow-Kazan, 2011, pp. 168-172.
21. Reber A.S. Bol'shoy tolkovyy psikhologicheskiy slovar': Osnovnye terminy i ponyatiya po psikhologii i psikhiatrii [Large explanatory psychological dictionary: Basic terms and concepts in psychology and psychiatry]: In 2 vol. Vol. 2: P-Ya (transl. from engl. Chebotareva E.Yu.). Moscow, -AST, Veche, 2003, 560 p.
22. Dubay W.H. The Principles of Readability. Cosa Mesa, CA: Impact Information. 2004, 72 p.
23. Birin D.A., Mel'nikov S.Yu., Peresypkin V.A., Pisarev I.A., Tsopkalo N.N. Ob effektivnosti sredstv korrektsii iskazhennykh tekstov v zavisimosti ot kharaktera iskazheniy [On the effectiveness of cor-rection tools for distorted texts depending on the nature of distortions], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2018, No. 8 (202), pp. 104-114.
24. Shirinkina L.V. Vospriyatie teksta kak psikhologicheskiy fenomen: diss. … kand. psikhol. nauk [Per-ception of text as a psychological phenomenon: cand. psychol. sc. diss.]. Perm': PGU, 2004, 235 p.
25. Benajiba Y., Rosso P. Towards a measure for Arabic corpora quality, In Proc. of the Interna-tional Colloquium on Arabic Language Processing - CITALA-2007. Rabat, Morroco June 18-19. 2007, pp. 213-221