Abstract
Within this paper we will account for a cooperation between Ghent University based Assyriologists and computational linguists that has set up a pilot study to analyse the language used in Old Babylonian (OB) letters using Natural Language Processing (NLP) techniques. OB letters make up an interesting dataset because (1) they form an invaluable source for everyday vernacular language, and (2) more than 5000 have been recovered, many of which are accessible in transliteration and translation through the series Altbabylonische Briefe and the Cuneiform Digital Library Initiative. Based on a first batch of letters from OB Sippar, later extended by other Akkadian letters, we aim to develop machine learning approaches to perform semi-automatic text analysis and annotation of the letters. We will here present a Part-of-Speech (PoS) tag prediction model using machine learning. The input data is Akkadian in transliteration and the best performing model is a fine-tuned Multilingual BERT Transformer with Word embeddings (weighted avg F1: 90.19 %). When compared to the benchmark attempt of PoS tagging on a larger Akkadian corpus (97.67 %), it leaves room for improvement. However, analysing the results shows us that multilingual word embeddings improve the model performance and with an enlargement of the corpus targeting certain classes, we could considerably better the macro average F1 scores.
Funding source: Beslpo
Award Identifier / Grant number: B2/212/P2/CUNE-IIIF-ORM
About the authors
Gustav Ryberg Smidt is a PhD student at Ghent University on the CUNE-IIIF-ORM project. He is trained in Assyriology at the University of Copenhagen and has since his studies worked with digital methods to support his research. He is focused on the Old Babylonian period and the Akkadian language. Currently, his methodology leans heavily on Digital and Computational Humanities, especially computational linguistics.
Prof. Dr. Katrien De Graef (PhD Ghent University 2004) is Associated Professor of Assyriology and History of the Ancient Near East at Ghent University. Her research interests include the socio-economic history of Babylonia and Elam (late 3rd and early 2nd millennium BCE), gender studies, sealing praxis, multilingualism and digital Assyriology.
Prof. Dr. Els Lefever (PhD Ghent University 2012) is Associated Professor of computational semantics at Ghent University. Her research interests include machine learning of natural language and multilingual NLP, with a special interest for emotion, irony and argumentation mining in text and language modelling for low-resourced and ancient languages.
-
Research ethics: Not applicable.
-
Author contributions: Gustav Ryberg Smidt has had main responsibility with data collecting and writing of the paper. He has assisted in data analysis; Katrien De Graef has assisted in data collecting and writing of the paper; Els Lefever has had main responsibility of data analysis. She has assisted in writing of the paper. All authors have accepted responsibility for the content of the paper.
-
Competing interests: The authors state no competing interests.
-
Research funding: Belspo (B2/212/P2/CUNE-IIIF-ORM).
-
Data availability: The raw data can be reproduced in future when the platform with data content is made public, which will likely be summer of 2026 at the latest.
References
[1] H. Hameeuw, K. De Graef, G. R. Smidt, A. Goddeeris, T. Homburg, and K. K. Thirukokaranam Chandrasekar, “Preparing multi-layered visualisations of Old Babylonian cuneiform tablets for a machine learning OCR training model towards automated sign recognition,” IT-Inf. Technol., vol. 65, no. 6, pp. 229–242, 2023. https://doi.org/10.1515/itit-2023-0063.Search in Google Scholar
[2] T. Sommerschield, et al.., “Machine learning for ancient languages: a survey,” Comput. Ling., vol. 49, no. 3, pp. 703–744, 2023. https://doi.org/10.1162/coli_a_00481.Search in Google Scholar
[3] M. P. Streck, Old Babylonian Gammar, volume 1, Handbook of Oriental Studies. Section 1: The Near and Middle East, vol. 168.1, Leiden-Boston, Brill, 2022.Search in Google Scholar
[4] P. Koch and W. Oesterreicher, “Schriftlichkeit und Sprache,” in Writing and its Use. An Interdisciplinary Handbook of International Research, vol. 1, H. Günther, and O. Ludwig, Eds., Berlin, Mouton de Gruyter, 1994, pp. 587–604.10.1515/9783110111293.1.5.587Search in Google Scholar
[5] S. Elspass, “The use of private letters and diaries in sociolinguistic investigation,” in The Handbook of Historical Sociolinguistics, J. M. Hernández-Campoy and J. C. Conde-Silvestre, Eds., Chichester, Wiley-Blackwell, 2012, pp. 156–169.10.1002/9781118257227.ch9Search in Google Scholar
[6] J. Huehnergard, A Grammar of Akkadian. Winona Lake, Eisenbrauns, 2000.Search in Google Scholar
[7] W. Sallaberger, “„Wenn Du mein Bruder bist,…“ Interaktion und Textgestaltung in altbabylonischen Alltagsbriefen,” in Cuneiform Monographs, vol. 16, Groningen, Styx, 1999.10.1163/9789004664449Search in Google Scholar
[8] A. Roaf, St J. Simpson, S. Gillies, J. Åhlfeldt, J. Becker, C. Johansson, T. Elliott, DARMC, R. Talbert, and R. Rattenborg, “Sippar: a Pleiades place resource,” Pleiades, 2023. Available at: https://pleiades.stoa.org/places/894089 [Accessed: May 7, 2024].Search in Google Scholar
[9] S. Tinney and E. Robson, “Oracc JSON Data: a brief introduction for programmers,” Oracc: The Open Richly Annotated Cuneiform Corpus. Available at: http://oracc.museum.upenn.edu/doc/opendata/json/ Accessed: Dec 4, 2023.Search in Google Scholar
[10] S. Tinney, “L2: how it works,” Oracc: The Open Richly Annotated Cuneiform Corpus. Available at: http://oracc.museum.upenn.edu/doc/help/lemmatising/howl2works/ Accessed: Dec 4, 2023.Search in Google Scholar
[11] E. Robson and S. Tinney, “QNP: Oracc Linguistic Annotation for Proper Nouns,” Oracc, 2019. Available at: http://oracc.museum.upenn.edu/doc/help/languages/propernouns/ [Accessed: May 7, 2024].Search in Google Scholar
[12] E. Robson, “AKK: Oracc Linguistic Annotation for Akkadian,” Oracc, 2019. Available at: http://oracc.museum.upenn.edu/doc/help/languages/akkadian/ [Accessed: May 7, 2024].Search in Google Scholar
[13] S. Wintner, “Morphological processing of semitic languages,” in Natural Language Processing of Semitic Languages, I. Zitouni, Ed., Heidelberg, Springer, 2014, pp. 43–66.10.1007/978-3-642-45358-8_2Search in Google Scholar
[14] W. Von Soden, Grundriss der Akkadischen Grammatik, 3rd edition, Analecta Orientalia, vol. 33, Rome, Pontificium Institutum Biblicum, 1995.Search in Google Scholar
[15] D. Jurafsky and J. H. Martin, Speech and Language Processing – An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd ed. draft, 2024. Available at: https://web.stanford.edu/~jurafsky/slp3/.Search in Google Scholar
[16] N. Veldhuis, et al.., “Sumerian networks JupyterBook,” Github. Available at: https://niekveldhuis.github.io/sumnet/welcome.html Accessed: Dec 4, 2023.Search in Google Scholar
[17] W. Mercelis and A. Keersmaekers, “An ELECTRA model for Latin token tagging tasks,” in Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, Marseille, France, 2020, pp. 189–192.Search in Google Scholar
[18] F. Riemenschneider and A. Frank, “Exploring large language models for classical philology,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 2023, pp. 15181–15199.10.18653/v1/2023.acl-long.846Search in Google Scholar
[19] C. Swaelens, I. De Vos, and E. Lefever, “Medieval social media: manual and automatic annotation of byzantine Greek marginal writing,” in Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII), Toronto, Canada, 2023, pp. 1–9.10.18653/v1/2023.law-1.1Search in Google Scholar
[20] A. Sahala, M. Silfverberg, A. Arppe, and K. Lindén, “Automated phonological transcription of Akkadian cuneiform text,” in Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 3528–3534. Available at: https://aclanthology.org/2020.lrec-1.433.Search in Google Scholar
[21] A. Vaswani, et al.., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 30, pp. 1–11, 2017.Search in Google Scholar
[22] Q. Liu, M. Kusner, and P. Blunsom, “A survey on contextual embeddings,” arXiv preprint, arXiv:2003.07278, 2020.Search in Google Scholar
[23] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, USA, 2019, pp. 4171–4186.Search in Google Scholar
[24] Y. Liu, et al.., “RoBERTa: a robustly optimized bert pretraining approach,” arXiv preprint, arXiv:1907.11692, 2019.Search in Google Scholar
[25] M. Joshi, D. Chen, Y. Liu, D. W. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: improving pre-training by representing and predicting spans,” Trans. Assoc. Comput. Ling., vol. 8, pp. 64–77, 2020. https://doi.org/10.1162/tacl_a_00300.Search in Google Scholar
[26] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf, “FLAIR: an easy-to-use framework for state-of-the-art NLP,” in Proceedings of NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, USA, 2019, pp. 54–59.Search in Google Scholar
[27] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence labeling,” in Proceedings of COLING 2018, 27th International Conference on Computational Linguistics, Santa Fe, New-Mexico, USA, 2018, pp. 1638–1649.Search in Google Scholar
[28] Ž. Agić and I. Vulić, “JW300: a wide-coverage parallel corpus for low-resource languages,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 3204–3210.10.18653/v1/P19-1310Search in Google Scholar
[29] B. Heinzerling and M. Strubem, “BPEmb: tokenization-free pre-trained subword embeddings in 275 languages,” in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.Search in Google Scholar
[30] J. Pennington, R. Socher, and C. Manning, “GloVe: global vectors for word representation,” in Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2014), 2014, pp. 1532–1543.10.3115/v1/D14-1162Search in Google Scholar
[31] R. Bansal, H. Choudhary, R. Punia, N. Schenk, J. L. Dahl, and É. Pagé-Perron, “How low is too low? A computational perspective on extremely low-resource languages,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, 2021, pp. 44–59.10.18653/v1/2021.acl-srw.5Search in Google Scholar
[32] A. Sahala and K. Lindén, “BabyLemmatizer 2.0 – a neural pipeline for POS-tagging and lemmatizing cuneiform languages,” in Proceedings of the Ancient Language Processing Workshop, Varna, Bulgaria, 2023, pp. 203–212. Available at: https://aclanthology.org/2023.alp-1.23.Search in Google Scholar
[33] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “OpenNMT: open-source toolkit for neural machine translation,” in Proceedings of ACL 2017, System Demonstrations, Vancouver, Canada, 2017, pp. 67–72. Available at: https://aclanthology.org/P17-4012.10.18653/v1/P17-4012Search in Google Scholar
[34] T. Jauhiainen, H. Jauhiainen, T. Alstola, and K. Lindén, “Language and dialect identification of cuneiform texts,” in Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, 2019, pp. 89–98.10.18653/v1/W19-1409Search in Google Scholar
[35] M. P. Streck, “Großes Fach Altorientalistik: Der Umfang des keilschriftlichen Textkorpus,” Mitteilungen der Deutschen Orient-Gesellschaft zu Berlin, vol. 142, pp. 35–58, 2010.Search in Google Scholar
[36] F. R. Kraus, and K. R. Veenhof, Eds., Albabylonische Briefe in Umschrift und Übersetzung, vol. 14, Brill, 1964–2005.Search in Google Scholar
[37] B. Siewert-Mayer, W. Röllig, H. Kopp, S. Gillies, J. Becker, E. Kansa, C. Johansson, F. Deblauwe, and R. Rattenborg, “Māri: a Pleiades place resource,” Pleiades: A Gazetteer of Past Places, 2023. Available at: https://pleiades.stoa.org/places/286681704 Accessed: May 12, 2024Search in Google Scholar
© 2024 Walter de Gruyter GmbH, Berlin/Boston