research-article

Language models and fusion for authorship attribution

Authors:

Olga Fourkioti,

Symeon Symeonidis,

Avi ArampatzisAuthors Info & Claims

Volume 56, Issue 6

https://doi.org/10.1016/j.ipm.2019.102061

Published: 01 November 2019 Publication History

Abstract

We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components.

References

[1]

C. Akimushkin, D.R. Amancio, O.N.O. Jr., On the role of words in the network structure of texts: application to authorship attribution, CoRR (2017).

[2]

M. Allamanis, C. Sutton, Mining source code repositories at massive scale using language modeling, Proceedings of the 10th working conference on mining software repositories, MSR ’13, IEEE Press, Piscataway, NJ, USA, 2013, pp. 207–216.

[3]

H. Antony, Some simple measures of richness of vocabulary, Association for Literary and Linguistic Computing Bulletin, 7 (2) (1979) 172–177.

[4]

H. Baayen, H.V. Halteren, A. Neijt, F. Tweedie, An experiment in authorship attribution, 6th JADT, I, 2002, pp. 69–75.

[5]

D. Bagnall, Author identification using multi-headed recurrent neural networks, in: L. Cappellato, N. Ferro, G.J.F. Jones, E. SanJuan (Eds.), Working notes of CLEF 2015 – conference and labs of the evaluation forum, Toulouse, france, september 8–11, 2015., in: CEUR Workshop Proceedings, 1391, CEUR-WS.org, 2015.

[6]

A.O. Bayer, G. Riccardi, Semantic language models with deep neural networks, Computer Speech & Language 40 (2016) 1–22,.

Digital Library

[7]

S.H.H. Ding, B.C.M. Fung, F. Iqbal, W.K. Cheung, Learning stylometric representations for authorship analysis, IEEE Transactionson Cybernetics 49 (1) (2019) 107–121,.

[8]

E. Ferracane, S. Wang, R.J. Mooney, Leveraging discourse information effectively for authorship attribution, Proceedings of the eighth international joint conference on natural language processing, IJCNLP 2017, Taipei, Taiwan, November 27, - December 1, 2017 - volume 1: Long papers, Asian Federation of Natural Language Processing, 2017, pp. 584–593.

[9]

O. Fourkioti, S. Symeonidis, A. Arampatzis, A comparative study of language modeling to instance-based methods, and feature combinations for authorship attribution, Research and advanced technology for digital libraries – 21st international conference on theory and practice of digital libraries, TPDL 2017, Thessaloniki, Greece, september 18–21, 2017, proceedings, Lecture Notes in Computer Science, 10450, Springer, 2017, pp. 274–286,.

[10]

Z. Ge, Y. Sun, M.J.T. Smith, Authorship attribution using a neural network language model, Proceedings of the thirtieth AAAI conference on artificial intelligence, february 12–17, 2016, Phoenix, Arizona, USA., AAAI Press, 2016, pp. 4212–4213.

[11]

J. Grieve, Quantitative authorship attribution: an evaluation of techniques, LLC 22 (3) (2007) 251–270,.

[12]

M. Kocher, J. Savoy, Distance measures in author profiling, Information Processingand Management 53 (5) (2017) 1103–1119,.

[13]

M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution, Journal of the American Society for Information Science and Technology 60 (1) (2009) 9–26,.

[14]

V. Lavrenko, W.B. Croft, Relevance-based language models, SIGIR Forum 51 (2) (2017) 260–267,.

Digital Library

[15]

M. Marcus, G. Kim, M.A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, …., B. Schasberger, The penn treebank: Annotating predicate argument structure, Proceedings of the workshop on human language technology, HLT ’94, 1994, pp. 114–119,.

Digital Library

[16]

T.C. Mendenhall, The characteristic curves of composition, Science ns-9 (214S) (1887) 237–246,.

[17]

F. Mosteller, D.L. Wallace, Inference and disputed authorship: the federalist papers, Addison-Wesley, Reading, Mass., 1964.

[18]

A. Neme, J.R. Gutierrez-Pulido, A. Muñoz, S. Hernández, T. Dey, Stylistics analysis and authorship attribution algorithms based on self-organizing maps, Neurocomputing 147 (2015) 147–159,.

[19]

W. Oliveira, E. Justino, L. Oliveira, Comparing compression models for authorship attribution, Forensic Science International 228 (1) (2013) 100–104,.

[20]

F. Peng, D. Schuurmans, S. Wang, Augmenting naive Bayes classifiers with statistical language models, Information Retrieval 7 (3) (2004) 317–345,.

Digital Library

[21]

Y.J.M. Pokou, P. Fournier-Viger, C. Moghrabi, Authorship attribution using variable length part-of-speech patterns, Proceedings of the 8th international conference on agents and artificial intelligence, 2016, pp. 354–361,.

Digital Library

[22]

T. Qian, B. Liu, L. Chen, Z. Peng, M. Zhong, G. He, …., G. Xu, Tri-training for authorship attribution with limited training data: a comprehensive study, Neurocomputing 171 (2016) 798–806,.

Digital Library

[23]

S. Raghavan, A. Kovashka, R.J. Mooney, Authorship attribution using probabilistic context-free grammars, ACL 2010, proceedings of the 48th annual meeting of the association for computational linguistics, july 11–16, 2010, Uppsala, sweden, short papers, The Association for Computer Linguistics, 2010, pp. 38–42.

[24]

A. Rehman, K. Javed, H.A. Babri, Feature selection based on a normalized difference measure for text classification, Information Processingand Management 53 (2) (2017) 473–489,.

Digital Library

[25]

A. Rocha, W.J. Scheirer, C.W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, …., E. Stamatatos, Authorship attribution for social media forensics, IEEE Transactionson Information Forensics and Security 12 (1) (2017) 5–33,.

Digital Library

[26]

U. Sapkota, S. Bethard, M. Montes-y-Gómez, T. Solorio, Not all character n-grams are created equal: A study in authorship attribution, in: R. Mihalcea, J.Y. Chai, A. Sarkar (Eds.), NAACL HLT 2015, the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, Denver, Colorado, USA, may 31, - june 5, 2015, The Association for Computational Linguistics, 2015, pp. 93–102.

[27]

Y. Sari, A. Vlachos, M. Stevenson, Continuous n-gram representations for authorship attribution, Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, Association for Computational Linguistics, Valencia, Spain, 2017, pp. 267–273.

[28]

J. Savoy, Authorship attribution based on specific vocabulary, ACM Transactionson Information Systems 30 (2) (2012) 12:1–12:30,.

Digital Library

[29]

J. Savoy, Authorship attribution based on a probabilistic topic model, Information Processingand Management 49 (1) (2013) 341–354,.

Digital Library

[30]

M.R. Schmid, F. Iqbal, B.C.M. Fung, E-mail authorship attribution using customized associative classification, Digital Investigation 14 (1) (2015) S116–S126,.

Digital Library

[31]

Y. Seroussi, I. Zukerman, F. Bohnert, Collaborative inference of sentiments from texts, Proceedings of the 18th international conference on user modeling, adaptation, and personalization, UMAP’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 195–206,.

Digital Library

[32]

Y. Seroussi, I. Zukerman, F. Bohnert, Authorship attribution with latent dirichlet allocation, Proceedings of the fifteenth conference on computational natural language learning, CoNLL ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 181–189.

[33]

P. Shrestha, S. Sierra, F. Gonzalez, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, 2, 2017, pp. 669–674.

[34]

G. Sidorov, F. Velasquez, E. Stamatatos, A.F. Gelbukh, L. Chanona-Hernández, Syntactic n-grams as machine learning features for natural language processing, Expert Systemswith Applications 41 (3) (2014) 853–860,.

Digital Library

[35]

E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Societyfor Information Scienceand Technology 60 (3) (2009) 538–556,.

[36]

E. Stamatatos, Authorship attribution using text distortion, Proceedings of the 15th conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, april 3–7, 2017, volume 1: Long papers, Association for Computational Linguistics, 2017, pp. 1138–1149.

[37]

K. Toutanova, C.D. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, Proceedings of the 2000 joint sigdat conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics - volume 13, EMNLP ’00, 2000, pp. 63–70,.

Digital Library

[38]

S. Vazirian, M. Zahedi, A modified language modeling method for authorship attribution, 2016 eighth international conference on information and knowledge technology (ikt), 2016, pp. 32–37,.

[39]

G.U. Yule, The statistical study of literary vocabulary, Cambridge University Press, 1944.

[40]

F. Zamora-Martínez, V. Frinken, S.E. Boquera, M.J.C. Bleda, A. Fischer, H. Bunke, Neural network language models for off-line handwriting recognition, Pattern Recognition 47 (4) (2014) 1642–1652,.

Digital Library

[41]

C. Zhang, X. Wu, Z. Niu, W. Ding, Authorship identification from unstructured texts, Knowledge-Based Systems 66 (2014) 99–111,.

Digital Library

[42]

R. Zhang, Z. Hu, H. Guo, Y. Mao, Syntax encoding with application in authorship attribution, Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, october 31 - november 4, 2018, Association for Computational Linguistics, 2018, pp. 2742–2753.

Cited By

Dong SMao JKe QPei L(2024)Decoding the writing styles of disciplinesInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10371861:4Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1016/j.ipm.2024.103718
Goyanes Mde-Marcos LDomínguez-Díaz A(2024)Automatic gender detection: a methodological procedure and recommendations to computationally infer the gender from names with ChatGPT and gender APIsScientometrics10.1007/s11192-024-05149-2129:11(6867-6888)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s11192-024-05149-2
Chanis IArampatzis A(2024)Enhancing phishing email detection with stylometric features and classifier stackingInternational Journal of Information Security10.1007/s10207-024-00928-724:1Online publication date: 7-Nov-2024
https://doi.org/10.1007/s10207-024-00928-7
Show More Cited By

Index Terms

Language models and fusion for authorship attribution
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Statistical Language Models of Lithuanian Based on Word Clustering and Morphological Decomposition

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological ...
Deep Bangla Authorship Attribution Using Transformer Models
Computational Data and Social Networks
Abstract
Authorship attribution is one of the renowned problems in the domain of Natural Language Processing (NLP). Leveraging the state-of-the-art (SOTA) techniques of NLP such as transformer models, this problem domain has achieved a considerable ...
Linguistically-augmented perplexity-based data selection for language models

HighlightsWord-level linguistic information for perplexity-based data selection.Evaluation and analysis for four languages: English, Spanish, Czech and Chinese.Combination of models lead to lower perplexity than the state-of-the-art baseline. This paper ...

Comments

Information & Contributors

Information

Published In

cover image Information Processing and Management: an International Journal

Information Processing and Management: an International Journal Volume 56, Issue 6

Nov 2019

457 pages

ISSN:0306-4573

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 November 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dong SMao JKe QPei L(2024)Decoding the writing styles of disciplinesInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10371861:4Online publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1016/j.ipm.2024.103718
Goyanes Mde-Marcos LDomínguez-Díaz A(2024)Automatic gender detection: a methodological procedure and recommendations to computationally infer the gender from names with ChatGPT and gender APIsScientometrics10.1007/s11192-024-05149-2129:11(6867-6888)Online publication date: 1-Nov-2024
https://dl.acm.org/doi/10.1007/s11192-024-05149-2
Chanis IArampatzis A(2024)Enhancing phishing email detection with stylometric features and classifier stackingInternational Journal of Information Security10.1007/s10207-024-00928-724:1Online publication date: 7-Nov-2024
https://doi.org/10.1007/s10207-024-00928-7
Ouni SFkih FOmri M(2023)A survey of machine learning-based author profiling from texts analysis in social networksMultimedia Tools and Applications10.1007/s11042-023-14711-882:24(36653-36686)Online publication date: 18-Mar-2023
https://dl.acm.org/doi/10.1007/s11042-023-14711-8
Muñoz SOliva CLago-Fernández LArroyo D(2022)Advancing the Use of Information Compression Distances in Authorship AttributionDisinformation in Open Online Media10.1007/978-3-031-18253-2_8(114-122)Online publication date: 11-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-18253-2_8
Sarwar RHassan S(2021)UrduAI: Writeprints for Urdu Authorship IdentificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347646721:2(1-18)Online publication date: 31-Oct-2021
https://dl.acm.org/doi/10.1145/3476467
Khatun ARahman AIslam MChowdhury HTasnim A(undefined)Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiTACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3530691
https://dl.acm.org/doi/10.1145/3530691

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents