skip to main content
research-article

Language models and fusion for authorship attribution

Published: 01 November 2019 Publication History

Abstract

We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components.

References

[1]
C. Akimushkin, D.R. Amancio, O.N.O. Jr., On the role of words in the network structure of texts: application to authorship attribution, CoRR (2017).
[2]
M. Allamanis, C. Sutton, Mining source code repositories at massive scale using language modeling, Proceedings of the 10th working conference on mining software repositories, MSR ’13, IEEE Press, Piscataway, NJ, USA, 2013, pp. 207–216.
[3]
H. Antony, Some simple measures of richness of vocabulary, Association for Literary and Linguistic Computing Bulletin, 7 (2) (1979) 172–177.
[4]
H. Baayen, H.V. Halteren, A. Neijt, F. Tweedie, An experiment in authorship attribution, 6th JADT, I, 2002, pp. 69–75.
[5]
D. Bagnall, Author identification using multi-headed recurrent neural networks, in: L. Cappellato, N. Ferro, G.J.F. Jones, E. SanJuan (Eds.), Working notes of CLEF 2015 – conference and labs of the evaluation forum, Toulouse, france, september 8–11, 2015., in: CEUR Workshop Proceedings, 1391, CEUR-WS.org, 2015.
[6]
A.O. Bayer, G. Riccardi, Semantic language models with deep neural networks, Computer Speech & Language 40 (2016) 1–22,.
[7]
S.H.H. Ding, B.C.M. Fung, F. Iqbal, W.K. Cheung, Learning stylometric representations for authorship analysis, IEEE Transactionson Cybernetics 49 (1) (2019) 107–121,.
[8]
E. Ferracane, S. Wang, R.J. Mooney, Leveraging discourse information effectively for authorship attribution, Proceedings of the eighth international joint conference on natural language processing, IJCNLP 2017, Taipei, Taiwan, November 27, - December 1, 2017 - volume 1: Long papers, Asian Federation of Natural Language Processing, 2017, pp. 584–593.
[9]
O. Fourkioti, S. Symeonidis, A. Arampatzis, A comparative study of language modeling to instance-based methods, and feature combinations for authorship attribution, Research and advanced technology for digital libraries – 21st international conference on theory and practice of digital libraries, TPDL 2017, Thessaloniki, Greece, september 18–21, 2017, proceedings, Lecture Notes in Computer Science, 10450, Springer, 2017, pp. 274–286,.
[10]
Z. Ge, Y. Sun, M.J.T. Smith, Authorship attribution using a neural network language model, Proceedings of the thirtieth AAAI conference on artificial intelligence, february 12–17, 2016, Phoenix, Arizona, USA., AAAI Press, 2016, pp. 4212–4213.
[11]
J. Grieve, Quantitative authorship attribution: an evaluation of techniques, LLC 22 (3) (2007) 251–270,.
[12]
M. Kocher, J. Savoy, Distance measures in author profiling, Information Processingand Management 53 (5) (2017) 1103–1119,.
[13]
M. Koppel, J. Schler, S. Argamon, Computational methods in authorship attribution, Journal of the American Society for Information Science and Technology 60 (1) (2009) 9–26,.
[14]
V. Lavrenko, W.B. Croft, Relevance-based language models, SIGIR Forum 51 (2) (2017) 260–267,.
[15]
M. Marcus, G. Kim, M.A. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, …., B. Schasberger, The penn treebank: Annotating predicate argument structure, Proceedings of the workshop on human language technology, HLT ’94, 1994, pp. 114–119,.
[16]
T.C. Mendenhall, The characteristic curves of composition, Science ns-9 (214S) (1887) 237–246,.
[17]
F. Mosteller, D.L. Wallace, Inference and disputed authorship: the federalist papers, Addison-Wesley, Reading, Mass., 1964.
[18]
A. Neme, J.R. Gutierrez-Pulido, A. Muñoz, S. Hernández, T. Dey, Stylistics analysis and authorship attribution algorithms based on self-organizing maps, Neurocomputing 147 (2015) 147–159,.
[19]
W. Oliveira, E. Justino, L. Oliveira, Comparing compression models for authorship attribution, Forensic Science International 228 (1) (2013) 100–104,.
[20]
F. Peng, D. Schuurmans, S. Wang, Augmenting naive Bayes classifiers with statistical language models, Information Retrieval 7 (3) (2004) 317–345,.
[21]
Y.J.M. Pokou, P. Fournier-Viger, C. Moghrabi, Authorship attribution using variable length part-of-speech patterns, Proceedings of the 8th international conference on agents and artificial intelligence, 2016, pp. 354–361,.
[22]
T. Qian, B. Liu, L. Chen, Z. Peng, M. Zhong, G. He, …., G. Xu, Tri-training for authorship attribution with limited training data: a comprehensive study, Neurocomputing 171 (2016) 798–806,.
[23]
S. Raghavan, A. Kovashka, R.J. Mooney, Authorship attribution using probabilistic context-free grammars, ACL 2010, proceedings of the 48th annual meeting of the association for computational linguistics, july 11–16, 2010, Uppsala, sweden, short papers, The Association for Computer Linguistics, 2010, pp. 38–42.
[24]
A. Rehman, K. Javed, H.A. Babri, Feature selection based on a normalized difference measure for text classification, Information Processingand Management 53 (2) (2017) 473–489,.
[25]
A. Rocha, W.J. Scheirer, C.W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, …., E. Stamatatos, Authorship attribution for social media forensics, IEEE Transactionson Information Forensics and Security 12 (1) (2017) 5–33,.
[26]
U. Sapkota, S. Bethard, M. Montes-y-Gómez, T. Solorio, Not all character n-grams are created equal: A study in authorship attribution, in: R. Mihalcea, J.Y. Chai, A. Sarkar (Eds.), NAACL HLT 2015, the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, Denver, Colorado, USA, may 31, - june 5, 2015, The Association for Computational Linguistics, 2015, pp. 93–102.
[27]
Y. Sari, A. Vlachos, M. Stevenson, Continuous n-gram representations for authorship attribution, Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, Association for Computational Linguistics, Valencia, Spain, 2017, pp. 267–273.
[28]
J. Savoy, Authorship attribution based on specific vocabulary, ACM Transactionson Information Systems 30 (2) (2012) 12:1–12:30,.
[29]
J. Savoy, Authorship attribution based on a probabilistic topic model, Information Processingand Management 49 (1) (2013) 341–354,.
[30]
M.R. Schmid, F. Iqbal, B.C.M. Fung, E-mail authorship attribution using customized associative classification, Digital Investigation 14 (1) (2015) S116–S126,.
[31]
Y. Seroussi, I. Zukerman, F. Bohnert, Collaborative inference of sentiments from texts, Proceedings of the 18th international conference on user modeling, adaptation, and personalization, UMAP’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 195–206,.
[32]
Y. Seroussi, I. Zukerman, F. Bohnert, Authorship attribution with latent dirichlet allocation, Proceedings of the fifteenth conference on computational natural language learning, CoNLL ’11, Association for Computational Linguistics, Stroudsburg, PA, USA, 2011, pp. 181–189.
[33]
P. Shrestha, S. Sierra, F. Gonzalez, M. Montes, P. Rosso, T. Solorio, Convolutional neural networks for authorship attribution of short texts, Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, 2, 2017, pp. 669–674.
[34]
G. Sidorov, F. Velasquez, E. Stamatatos, A.F. Gelbukh, L. Chanona-Hernández, Syntactic n-grams as machine learning features for natural language processing, Expert Systemswith Applications 41 (3) (2014) 853–860,.
[35]
E. Stamatatos, A survey of modern authorship attribution methods, Journal of the American Societyfor Information Scienceand Technology 60 (3) (2009) 538–556,.
[36]
E. Stamatatos, Authorship attribution using text distortion, Proceedings of the 15th conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, april 3–7, 2017, volume 1: Long papers, Association for Computational Linguistics, 2017, pp. 1138–1149.
[37]
K. Toutanova, C.D. Manning, Enriching the knowledge sources used in a maximum entropy part-of-speech tagger, Proceedings of the 2000 joint sigdat conference on empirical methods in natural language processing and very large corpora: Held in conjunction with the 38th annual meeting of the association for computational linguistics - volume 13, EMNLP ’00, 2000, pp. 63–70,.
[38]
S. Vazirian, M. Zahedi, A modified language modeling method for authorship attribution, 2016 eighth international conference on information and knowledge technology (ikt), 2016, pp. 32–37,.
[39]
G.U. Yule, The statistical study of literary vocabulary, Cambridge University Press, 1944.
[40]
F. Zamora-Martínez, V. Frinken, S.E. Boquera, M.J.C. Bleda, A. Fischer, H. Bunke, Neural network language models for off-line handwriting recognition, Pattern Recognition 47 (4) (2014) 1642–1652,.
[41]
C. Zhang, X. Wu, Z. Niu, W. Ding, Authorship identification from unstructured texts, Knowledge-Based Systems 66 (2014) 99–111,.
[42]
R. Zhang, Z. Hu, H. Guo, Y. Mao, Syntax encoding with application in authorship attribution, Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, october 31 - november 4, 2018, Association for Computational Linguistics, 2018, pp. 2742–2753.

Cited By

View all

Index Terms

  1. Language models and fusion for authorship attribution
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Processing and Management: an International Journal
          Information Processing and Management: an International Journal  Volume 56, Issue 6
          Nov 2019
          457 pages

          Publisher

          Pergamon Press, Inc.

          United States

          Publication History

          Published: 01 November 2019

          Author Tags

          1. Authorship attribution
          2. Language models
          3. Computational linguistics
          4. Text classification
          5. Machine learning

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 01 Feb 2025

          Other Metrics

          Citations

          Cited By

          View all

          View Options

          View options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media