research-article

Free access

Just Accepted

Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation

Authors:

Yuanchun ZhouAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data

Accepted on 01 June 2024

https://doi.org/10.1145/3671149

Online AM: 08 June 2024 Publication History

Abstract

The objective of topic inference in research proposals aims to obtain the most suitable disciplinary division from the discipline system defined by a funding agency. The agency will subsequently find appropriate peer review experts from their database based on this division. Automated topic inference can reduce human errors caused by manual topic filling, bridge the knowledge gap between funding agencies and project applicants, and improve system efficiency. Existing methods focus on modeling this as a hierarchical multi-label classification problem, using generative models to iteratively infer the most appropriate topic information. However, these methods overlook the gap in scale between interdisciplinary research proposals and non-interdisciplinary ones, leading to an unjust phenomenon where the automated inference system categorizes interdisciplinary proposals as non-interdisciplinary, causing unfairness during the expert assignment. How can we address this data imbalance issue under a complex discipline system and hence resolve this unfairness? In this paper, we implement a topic label inference system based on a Transformer encoder-decoder architecture. Furthermore, we utilize interpolation techniques to create a series of pseudo-interdisciplinary proposals from non-interdisciplinary ones during training based on non-parametric indicators such as cross-topic probabilities and topic occurrence probabilities. This approach aims to reduce the bias of the system during model training. Finally, we conduct extensive experiments on a real-world dataset to verify the effectiveness of the proposed method. The experimental results demonstrate that our training strategy can significantly mitigate the unfairness generated in the topic inference task. To improve the reproducibility of our research, we have released accompanying code by Dropbox.

References

[1]

Rangachari Anand, Kishan G Mehrotra, Chilukuri K Mohan, and Sanjay Ranka. 1993. An improved algorithm for neural network classification of imbalanced training sets. IEEE Transactions on Neural Networks 4, 6 (1993), 962–969.

Digital Library

[2]

Guillaume P Archambault, Yongyi Mao, Hongyu Guo, and Richong Zhang. 2019. Mixup as directional adversarial training. arXiv preprint arXiv:1906.06875 (2019).

[3]

David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems 32 (2019).

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146. https://doi.org/10.1162/tacl_a_00051 arXiv:1607.04606

[5]

Xunxin Cai, Meng Xiao, Zhiyuan Ning, and Yuanchun Zhou. 2023. Resolving the Imbalance Issue in Hierarchical Disciplinary Topic Inference via LLM-based Data Augmentation. 2023 IEEE International Conference on Data Mining Workshops (ICDMW) (2023).

[6]

Jie-Neng Chen, Shuyang Sun, Ju He, Philip Torr, Alan Yuille, and Song Bai. 2021. TransMix: Attend to Mix for Vision Transformers. arXiv preprint arXiv:2111.09833 (2021).

[7]

Xiao Chen, Wenqi Fan, Jingfan Chen, Haochen Liu, Zitao Liu, Zhaoxiang Zhang, and Qing Li. 2023. Fairly Adaptive Negative Sampling for Recommendations. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW ’23). Association for Computing Machinery, New York, NY, USA, 3723–3733. https://doi.org/10.1145/3543507.3583355

Digital Library

[8]

Yong Cheng, Lu Jiang, Wolfgang Macherey, and Jacob Eisenstein. 2020. Advaug: Robust adversarial augmentation for neural machine translation. arXiv preprint arXiv:2006.11834 (2020).

[9]

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/

[10]

Hsin-Ping Chou, Shih-Chieh Chang, Jia-Yu Pan, Wei Wei, and Da-Cheng Juan. 2020. Remix: rebalanced mixup. In European Conference on Computer Vision. Springer, 95–110.

Digital Library

[11]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[12]

Adrian Galdran, Gustavo Carneiro, and Miguel A González Ballester. 2021. Balanced-MixUp for Highly Imbalanced Medical Image Classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 323–333.

[13]

Eva Gibaja and Sebastián Ventura. 2014. Multi-label learning: a review of the state of the art and ongoing research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 6 (2014), 411–444.

Digital Library

[14]

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. arXiv preprint arXiv:2103.00112 (2021).

[15]

Wei Huang, Enhong Chen, Qi Liu, Yuying Chen, Zai Huang, Yang Liu, Zhou Zhao, Dan Zhang, and Shijin Wang. 2019. Hierarchical multi-label text classification: An attention-based recurrent network approach. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1051–1060.

Digital Library

[16]

Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1 (2019), 1–54.

[17]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) 1 (2017), 562–570. https://doi.org/10.18653/v1/P17-1052

[18]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference 2 (2017), 427–431. https://doi.org/10.18653/v1/e17-2068 arXiv:1607.01759

[19]

Anubha Kabra, Ayush Chopra, Nikaash Puri, Pinkesh Badjatiya, Sukriti Verma, Piyush Gupta, et al. 2020. MixBoost: Synthetic Oversampling with Boosted Mixup for Handling Extreme Imbalance. arXiv preprint arXiv:2009.01571 (2020).

[20]

Jaehyung Kim, Jongheon Jeong, and Jinwoo Shin. 2019. Imbalanced classification via adversarial minority over-sampling. (2019).

[21]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2014), 1746–1751. https://doi.org/10.3115/v1/d14-1181 arXiv:1408.5882

[22]

Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5, 4 (2016), 221–232.

[23]

Soonki Kwon and Younghoon Lee. 2023. Explainability-based mix-up approach for text data augmentation. ACM Transactions on Knowledge Discovery from Data 17, 1 (2023), 1–14.

[24]

Charles X Ling and Victor S Sheng. 2008. Cost-sensitive learning and the class imbalance problem. Encyclopedia of machine learning 2011 (2008), 231–235.

[25]

Pengfei Liu, Xipeng Qiu, and Huang Xuanjing. 2016. Recurrent neural network for text classification with multi-task learning. IJCAI International Joint Conference on Artificial Intelligence 2016-Janua (2016), 2873–2879. arXiv:1605.05101

[26]

Yuning Mao, Jingjing Tian, Jiawei Han, and Xiang Ren. 2019. Hierarchical text classification with reinforced label assignment. arXiv preprint arXiv:1908.10419 (2019).

[27]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, Vol. 26. 3111–3119.

[28]

Sankha Subhra Mullick, Shounak Datta, and Swagatam Das. 2019. Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1695–1704.

[29]

Hao Peng, Jianxin Li, Senzhang Wang, Lihong Wang, Qiran Gong, Renyu Yang, Bo Li, Philip Yu, and Lifang He. 2019. Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Transactions on Knowledge and Data Engineering (2019).

[30]

Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. 2014. Optimizing F-measures by cost-sensitive classification. Advances in neural information processing systems 27 (2014).

[31]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 5998–6008.

[33]

Celine Vens, Jan Struyf, Leander Schietgat, Sašo Džeroski, and Hendrik Blockeel. 2008. Decision trees for hierarchical multi-label classification. Machine learning 73, 2 (2008), 185.

[34]

Vikas Verma, Kenji Kawaguchi, Alex Lamb, Juho Kannala, Arno Solin, Yoshua Bengio, and David Lopez-Paz. 2022. Interpolation consistency training for semi-supervised learning. Neural Networks 145 (2022), 90–106.

Digital Library

[35]

Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, David Lopez-Paz, and Yoshua Bengio. 2019. Manifold mixup: Better representations by interpolating hidden states. In International Conference on Machine Learning. PMLR, 6438–6447.

[36]

Vikas Verma, Meng Qu, Kenji Kawaguchi, Alex Lamb, Yoshua Bengio, Juho Kannala, and Jian Tang. 2021. GraphMix: Improved Training of GNNs for Semi-Supervised Learning. In Proceedings of the AAAI Conference on Artificial Intelligence.

[37]

Jonatas Wehrmann, Ricardo Cerri, and Rodrigo Barros. 2018. Hierarchical multi-label classification networks. In International Conference on Machine Learning. PMLR, 5075–5084.

[38]

Changxing Wu, Liuwen Cao, Yubin Ge, Yang Liu, Min Zhang, and Jinsong Su. 2022. A Label Dependence-aware Sequence Generation Model for Multi-level Implicit Discourse Relation Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 11486–11494.

[39]

Congying Xia. 2018. Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks. (2018). arXiv:arXiv:2010.02394v2

[40]

Meng Xiao, Ziyue Qiao, Yanjie Fu, Hao Dong, Yi Du, Pengyang Wang, Hui Xiong, and Yuanchun Zhou. 2023. Hierarchical interdisciplinary topic detection model for research proposal classification. IEEE Transactions on Knowledge and Data Engineering (2023).

Digital Library

[41]

Meng Xiao, Ziyue Qiao, Yanjie Fu, Yi Du, and Pengyang Wang. 2021. Expert Knowledge-Guided Length-Variant Hierarchical Label Generation for Proposal Classification. 2021 IEEE International Conference on Data Mining (2021), 757–766.

[42]

Soyoung Yoon, Gyuwan Kim, and Kyumin Park. 2021. SSMix: Saliency-Based Span Mixup for Text Classification. Fig 1 (2021), 3225–3234. https://doi.org/10.18653/v1/2021.findings-acl.285 arXiv:2106.08062

[43]

Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Junsuk Choe, and Youngjoon Yoo. 2019. CutMix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE International Conference on Computer Vision 2019-Octob, Iccv (2019), 6022–6031. https://doi.org/10.1109/ICCV.2019.00612 arXiv:1905.04899

[44]

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. MixUp: Beyond empirical risk minimization. 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings (2018), 1–13. arXiv:1710.09412

[45]

Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. 2020. How does mixup help with robustness and generalization? arXiv preprint arXiv:2010.04819 (2020).

[46]

Jie Zhou, Chunping Ma, Dingkun Long, Guangwei Xu, Ning Ding, Haoyu Zhang, Pengjun Xie, and Gongshen Liu. 2020. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1106–1117.

[47]

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers (2016), 207–212. https://doi.org/10.18653/v1/p16-2034

[48]

Yingke Zhu, Tom Ko, and Brian Mak. 2019. Mixup Learning Strategies for Text-Independent Speaker Verification. In Interspeech. 4345–4349.

[49]

Ziwei Zhu, Jianling Wang, and James Caverlee. 2020. Measuring and Mitigating Item Under-Recommendation Bias in Personalized Ranking Systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 449–458. https://doi.org/10.1145/3397271.3401177

Digital Library

Cited By

Chen ZHu CWu MLong QWang XZhou YXiao M(2024)GeneSum: Large Language Model-based Gene Summary Extraction2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM62325.2024.10822279(1438-1443)Online publication date: 3-Dec-2024
https://doi.org/10.1109/BIBM62325.2024.10822279
Cui WXiao MWang LWang XDu YZhou Y(2024)Automated taxonomy alignment via large language models: bridging the gap between knowledge domainsScientometrics10.1007/s11192-024-05111-2129:9(5287-5312)Online publication date: 26-Jul-2024
https://doi.org/10.1007/s11192-024-05111-2
Fan WZhao STang J(undefined)Introduction of the Special Issue on Trustworthy Artificial IntelligenceACM Transactions on Knowledge Discovery from Data10.1145/3712184
https://dl.acm.org/doi/10.1145/3712184

Index Terms

Interdisciplinary Fairness in Imbalanced Research Proposal Topic Inference: A Hierarchical Transformer-based Method with Selective Interpolation
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Collaborative interdisciplinary astrobiology research: a bibliometric study of the NASA Astrobiology Institute

This study aims to undertake a bibliometric investigation of the NASA Astrobiology Institute (NAI) funded research that was published between 2008 and 2012 (by teams of Cooperative Agreement Notice Four and Five). For this purpose, the study creates an ...
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02

Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
An integrated method for interdisciplinary topic identification and prediction: a case study on information science and library science

Given that many frontiers and hotspots of science and technology are emerging from interdisciplines, the accurate identification and forecasting of interdisciplinary topics has become increasingly significant. Existing methods of interdisciplinary topic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Just Accepted

EISSN:1556-472X

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 08 June 2024

Accepted: 01 June 2024

Revised: 05 May 2024

Received: 04 September 2023

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
173
Total Downloads

Downloads (Last 12 months)173
Downloads (Last 6 weeks)43

Reflects downloads up to 31 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen ZHu CWu MLong QWang XZhou YXiao M(2024)GeneSum: Large Language Model-based Gene Summary Extraction2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM62325.2024.10822279(1438-1443)Online publication date: 3-Dec-2024
https://doi.org/10.1109/BIBM62325.2024.10822279
Cui WXiao MWang LWang XDu YZhou Y(2024)Automated taxonomy alignment via large language models: bridging the gap between knowledge domainsScientometrics10.1007/s11192-024-05111-2129:9(5287-5312)Online publication date: 26-Jul-2024
https://doi.org/10.1007/s11192-024-05111-2
Fan WZhao STang J(undefined)Introduction of the Special Issue on Trustworthy Artificial IntelligenceACM Transactions on Knowledge Discovery from Data10.1145/3712184
https://dl.acm.org/doi/10.1145/3712184

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media