Bulletin of TUIT: Management and Communication Technologies


The article deals with the topic of tokenization of text corpora of the Uzbek language, which should take into account the linguistic features of the spelling of the language. The types of words are analyzed and a mathematical model of word formation for complex, paired, repeated, and compound words is proposed. And also based on the models developed a finite state machine to demonstrate the spelling rules of the language, and the rules are implemented by regular expressions


[1] Ó. Aıtbaıuly, Osnovi kazaxskoy terminologii. Almata: Абзал-Ай, 2014.

[2] K. A. Biyaliev, Spravochnik po grammatike kirgizskogo yazika. Bishkek: Slavyanskiy universitet, 2013.

[3] Z. Rehman, W. Anwar, U. I. Bajwa, W. Xuan, and Z. Chaoying, “Morpheme Matching Based Text Tokenization for a Scarce Resourced Language,” PLoS One, vol. 8, no. 8, 2013, doi: 10.1371/journal.pone.0068178.

[4] A. R. Raximova, “Strukturnie i semanticheskie osobennosti slojnix slov, xarakterizuyushix cheloveka, v tatarskom yazike,” Uchenie Zap. Kazan. Univ., vol. 157, no. 5, pp. 205–217, 2015.

[5] L. M. Xusainova, “Pravopisanie slojnix slov-kalek v bashkirskom yazike,” Vestn. ChGPU im. I. Ya. Yakovleva, vol. 1, no. 97, pp. 60–65, 2018.

[6] M. B, X. O’, and A. N, O’zbek tilidan universal qo’llanma. Toshkent: Akademiknashr, 2019.

[7] J. Joseph and J. R. Jeba, “Information Extraction using Tokenization and Clustering Methods,” Int. J. Recent Technol. Eng., vol. 8, no. 4, pp. 3690–3692, 2019, doi: 10.35940/ijrte.d7943.118419.

[8] K. Ronald M, “A Method for Tokenizing Text,” Complex. Educ. Inq. into Words, Constraints Context., no. January 2005, pp. 55–64, 2005.

[9] T. Hiraoka, H. Shindo, and Y. Matsumoto, “Stochastic tokenization with a language model for neural text classification,” in ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 2020, no. January 2019, pp. 1620–1629, doi: 10.18653/v1/p19-1158.

[10] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, “A comparative study on Thai word segmentation approaches,” 2008 5th Int. Conf. Electr. Eng. Comput. Telecommun. Inf. Technol., vol. 1, pp. 125–128, 2008.

[11] A. L. F. Han, D. F. Wong, L. S. Chao, L. He, L. Zhu, and S. Li, “A study of Chinese word segmentation based on the characteristics of Chinese,” in International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), 2013, vol. 8105 LNAI, no. May, pp. 111–118, doi: 10.1007/978-3-642-40722-2_12.

[12] M. A. Attia, “Arabic tokenization system,” in SEMITIC@ACL 2007, 2007, no. June, p. 65, doi: 10.3115/1654576.1654588.

[13] S. Ahmadi, “A Tokenization System for the Kurdish Language,” in VARDIAL, 2020, pp. 114–127.

[14] C. Haruechaiyasak and A. Kongthon, “LexToPlus: A Thai Lexeme Tokenization and Normalization Tool,” in Proceedings of the 4th Workshop on South and Southeast Asian Natural Language Processing, 2013, pp. 14–18.

[15] V. S and J. R, “Text Mining: open Source Tokenization Tools – An Analysis,” Adv. Comput. Intell. An Int. J., vol. 3, no. 1, pp. 37–47, 2016, doi: 10.5121/acii.2016.3104.

[16] K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y. Uchida, and Y. Matsumoto, “Sudachi: A Japanese tokenizer for business,” Lr. 2018 - 11th Int. Conf. Lang. Resour. Eval., pp. 2246–2249, 2019.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.