skip to main content
10.1145/3594536.3595171acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
research-article

No Labels?: No Problem! Experiments with active learning strategies for multi-class classification in imbalanced low-resource settings

Published: 07 September 2023 Publication History

Abstract

Labeling textual corpora in their entirety is infeasible in most practical situations, yet it is a very common need today in public and private organizations. In contexts with large unlabeled datasets, active learning methods may reduce the manual labeling effort by selecting samples deemed more informative for the learning process. The paper elaborates on a method for multi-class classification based on state-of-the-art NLP active learning techniques, performing various experiments in low-resource and imbalanced settings. In particular, we refer to a dataset of Dutch legal documents constructed with two levels of imbalance; we study the performance of task-adapting a pre-trained Dutch language model, BERTje, and of using active learning to fine-tune the model to the task, testing several selection strategies. We find that, on the constructed datasets, an entropy-based strategy slightly improves the F1, precision, and recall convergence rates; and that the improvements are most pronounced in the severely imbalanced dataset. These results show promise for active learning in low-resource imbalanced domains but also leave space for further improvement.

References

[1]
Md Abul Bashar and Richi Nayak. 2021. Active Learning for Effectively Fine-Tuning Transfer Learning to Downstream Task. ACM Transactions on Intelligent Systems and Technology 12, 2 (Feb. 2021), 24:1--24:24.
[2]
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). ACL, 3613--3618.
[3]
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets Straight out of Law School. arXiv:2010.02559 [cs]
[4]
Haihua Chen, Lei Wu, Jiangping Chen, Wei Lu, and Junhua Ding. 2022. A Comparative Study of Automated Legal Text Classification Using Random Forests and Deep Learning. Information Processing & Management 59, 2 (March 2022), 102798.
[5]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116 [cs]
[6]
Jan Christian Blaise Cruz and Charibeth Cheng. 2020. Establishing Baselines for Text Classification in Low-Resource Languages. arXiv:2005.02068 [cs] (May 2020). arXiv:2005.02068 [cs]
[7]
Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. 2019. BERTje: A Dutch BERT Model. arXiv:1912.09582 [cs]
[8]
Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. RobBERT: A Dutch RoBERTa-based Language Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. ACL, 3255--3265.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs]
[10]
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. 2020. Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping. arXiv:2002.06305 [cs]
[11]
Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, and Noam Slonim. 2020. Active Learning for BERT: An Empirical Study. In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7949--7962.
[12]
Andrea Esuli and Fabrizio Sebastiani. 2009. Active Learning Strategies for Multi-Label Text Classification. In Advances in Information Retrieval, Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy (Eds.). Vol. 5478. Springer Berlin Heidelberg, 102--113.
[13]
Yifan Fu, Xingquan Zhu, and Bin Li. 2013. A Survey on Instance Selection for Active Learning. Knowledge and Information Systems 35, 2 (May 2013), 249--283.
[14]
R. A. Gilyazev and D. Yu. Turdakov. 2018. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Programming and Computer Software 44, 6 (Nov. 2018), 476--491.
[15]
Antonio Gulli. 2004. AG's Corpus of News Articles. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.
[16]
Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. 2011. Bayesian Active Learning for Classification and Preference Learning. arXiv:1112.5745 [cs, stat]
[17]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 328--339.
[18]
Katikapalli Subramanyam Kalyan, Ajit Rajasekharan, and Sivanesan Sangeetha. 2021. AMMUS: A Survey of Transformer-based Pretrained Models in Natural Language Processing. arXiv:2108.05542 [cs]
[19]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 36, 4 (2020), 1234--1240.
[20]
David Lowell, Zachary C. Lipton, and Byron C. Wallace. 2019. Practical Obstacles to Deploying Active Learning. In Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP). ACL, Hong Kong, China, 21--30.
[21]
Katerina Margatina, Loïc Barrault, and Nikolaos Aletras. 2022. On the Importance of Effectively Adapting Pretrained Language Models for Active Learning. arXiv:2104.08320 [cs] (March 2022). arXiv:2104.08320 [cs]
[22]
Katerina Margatina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. Active Learning by Acquiring Contrastive Examples. In Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing. ACL, 650--663.
[23]
Martin Mundt, Yong Won Hong, Iuliia Pliushch, and Visvanathan Ramesh. 2020. A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. arXiv:2009.01797 [cs, stat]
[24]
Stephen Mussmann, Robin Jia, and Percy Liang. 2020. On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks. arXiv:2010.05103 [cs]
[25]
Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, and Vijendran Gopalan Venkoparao. 2020. Domain Adaptation Challenges of BERT in Tokenization and Sub-Word Representations of Out-of-Vocabulary Words. In Proc. of the First Workshop on Insights from Negative Results in NLP. ACL, 1--5.
[26]
Harshith Padigela, Hamed Zamani, and W. Bruce Croft. 2019. Investigating the Successes and Failures of BERT for Passage Re-Ranking. arXiv:1905.01758 [cs]
[27]
Sumanth Prabhu, Moosa Mohamed, and Hemant Misra. 2021. Multi-Class Text Classification Using BERT-based Active Learning. arXiv:2104.14289 [cs] (Sept. 2021). arXiv:2104.14289 [cs]
[28]
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-Trained Models for Natural Language Processing: A Survey. Science China Technological Sciences 63, 10 (Oct. 2020), 1872--1897. arXiv:2003.08271
[29]
Paul Röttger and Janet Pierrehumbert. 2021. Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media. In Findings of the Association for Computational Linguistics: EMNLP 2021. ACL, 2400--2412.
[30]
Raphael Scheible, Fabian Thomczyk, Patric Tippmann, Victor Jaravine, and Martin Boeker. 2020. GottBERT: A Pure German Language Model. arXiv:2012.02110 [cs]
[31]
Christopher Schröder and Andreas Niekler. 2020. A Survey of Active Learning for Text Classification Using Deep Neural Networks. (Aug. 2020). arXiv:2008.07267 [cs]
[32]
Burr Settles. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
[33]
Zein Shaheen, Gerhard Wohlgenannt, and Erwin Filtz. 2020. Large Scale Legal Text Classification Using Transformer Models. arXiv:2010.12871 [cs]
[34]
Jingyu Shao, Qing Wang, and Fangbing Liu. 2019. Learning to Sample: An Active Learning Framework. arXiv:1909.03585 [cs, stat]
[35]
Yi Song, Yuxian Gu, and Minlie Huang. 2022. Many-Class Text Classification with Matching. arXiv:2205.11409 [cs]
[36]
Paul Thompson. 2001. Automatic Categorization of Case Law. In Proc. of the 8th International Conference on Artificial Intelligence and Law (ICAIL '01). ACM, 70--77.
[37]
Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition.
[38]
Ellen M. Voorhees and Donna Harman. 2000. Overview of the Sixth Text REtrieval Conference (TREC-6). Information Processing & Management 36, 1 (Jan. 2000), 3--35.
[39]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2020. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv:1905.00537 [cs]
[40]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. (April 2018).
[41]
Ying Wen, Weinan Zhang, Rui Luo, and Jun Wang. 2016. Learning Text Representation Using Recurrent Convolutional Neural Network with Highway Layers. (June 2016).
[42]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771 [cs]
[43]
Xingjiao Wu, Luwei Xiao, Yixuan Sun, Junhang Zhang, Tianlong Ma, and Liang He. 2022. A Survey of Human-in-the-loop for Machine Learning. Future Generation Computer Systems 135 (Oct. 2022), 364--381. arXiv:2108.00941 [cs]
[44]
Shanshan Yu, Jindian Su, and Da Luo. 2019. Improving BERT-Based Text Classification With Auxiliary Sentence and Domain Knowledge. IEEE Access 7 (2019), 176600--176612.
[45]
Michelle Yuan, Hsuan-Tien Lin, and Jordan Boyd-Graber. 2020. Cold-Start Active Learning through Self-supervised Language Modeling. arXiv:2010.09535 [cs]
[46]
Ce Zhang, Jaeho Shin, Christopher Ré, Michael Cafarella, and Feng Niu. 2016. Extracting Databases from Dark Data with DeepDive. In Proc. of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, 847--859.
[47]
Shujian Zhang, Chengyue Gong, Xingchao Liu, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. 2022. ALLSH: Active Learning Guided by Local Sensitivity and Hardness. arXiv:2205.04980 [cs]
[48]
Anneke Zuiderwijk and Marijn Janssen. 2014. The Negative Effects of Open Government Data - Investigating the Dark Side of Open Data. In Proc. of the 15th Annual International Conference on Digital Government Research (Dg.o '14). ACM, 147--152.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICAIL '23: Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law
June 2023
499 pages
ISBN:9798400701979
DOI:10.1145/3594536
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • IAAIL: Intl Asso for Artifical Intel & Law

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Active Learning
  2. Dutch Legal Domain
  3. Learning Convergence
  4. Semi-supervised classification
  5. Task Adaption
  6. Transfer Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICAIL 2023
Sponsor:
  • IAAIL

Acceptance Rates

Overall Acceptance Rate 69 of 169 submissions, 41%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 67
    Total Downloads
  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)3
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media