skip to main content
research-article

Multi-Label Annotation and Classification of Arabic Texts Based on Extracted Seed Keyphrases and Bi-Gram Alphabet Feed Forward Neural Networks Model

Published: 25 November 2022 Publication History

Abstract

In natural language processing, text classification is a fundamental problem. Multi-label classification of textual data is a challenging topic in text classification where an instance can be associated with more than one label. This paper presents a multi-label annotation and classification methodology for Arabic text data that is not currently classified as multi-label, aiming to analyze and compare the performance of various multi-label learning approaches. The current work includes two phases: The first involves automatic annotation of hotel reviews with more than one label based on the aspects found in the reviews. In this phase, review data instances were automatically annotated as multi-label based on the extracted seed keyphrases clusters. The second phase involves experiments to compare the performance of various multi-label classification learning methods. In this phase, we introduced different models including a feed-forward networks model that learns a vector representation based on the bi-gram alphabet rather than the commonly used bag-of-words model. The bi-gram alphabet vector representation model has the advantage of having reduced feature dimensions and not requiring natural language processing tools. The results indicated that employing the bi-gram alphabet vector representation feed forward neural network is a competitive solution for the multi-label text classification problem. It has achieved an accuracy of about 75.2%, and standard deviation (0.062).

Acknowledgment

Thank you to all of the reviewers who took the time to read my paper and provide feedback. I appreciate the suggestions made by the reviewers. The suggestions offered by the reviewers have been immensely helpful and I have addressed all the concerns they raised.
I re-drafted the required portions, explained some areas in more detail, repaired typographical, grammatical, and lingual issues, added examples, equations, used pseudo-code to clarify the technique, and included experiments to demonstrate the quality of the labels I generated automatically.

References

[1]
M. Afzaal, M. Usman, A. C. M. Fong, S. Fong, and Y. Zhuang. 2016. Fuzzy aspect based opinion classification system for mining tourist reviews. Advances in Fuzzy Systems, (2016).
[2]
M. Afzaal, M. Usman, A. C. Fong, and S. Fong. 2019. Multiaspect-based opinion classification model for tourist reviews. Expert Systems 36, 2 (2019), e12371.
[3]
D. W. Aha, D. Kibler, and M. K. Albert. 1991. Instance-based learning algorithms. Machine Learning 6, 1 (1991), 37--66.
[4]
N. A. Ahmed, M. A. Shehab, M. Al-Ayyoub, and I. Hmeidi. 2015. Scalable multi-label Arabic text classification. In 2015 6th International Conference on Information and Communication Systems (ICICS). IEEE. 212–217.
[5]
M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, and M. Fayyaz. 2022. Exploring deep learning approaches for Urdu text classification in product manufacturing. Enterprise Information Systems 16, 2 (2022), 223–248.
[6]
A. R. Alaei, S. Becken, and B. Stantic. 2019. Sentiment analysis in tourism: Capitalizing on big data. Journal of Travel Research 58, 2 (2019), 175–191.
[7]
N. Aljedani, R. Alotaibi, and M. Taileb. 2020. HMATC: Hierarchical multi-label Arabic text classification model using machine learning. Egyptian Informatics Journal.
[8]
T. Al-Moslmi, N. Omar, S. Abdullah, and M. Albared. 2017. Approaches to cross-domain sentiment analysis: A systematic literature review. IEEE Access 5, 16173–16192.
[9]
B. Al-Salemi, M. Ayob, G. Kendall, and S. A. M. Noah. 2019. Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Information Processing and Management 56, 1 (2019), 212–227.
[10]
J. A. Alzubi, R. Jain, P. Nagrath, S. Satapathy, S. Taneja, and P. Gupta. 2021. Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. Journal of Intelligent & Fuzzy Systems 40, 4 (2021), 5761–5769.
[11]
M. Z. Asghar, F. M. Kundi, S. Ahmad, A. Khan, and F. Khan. 2018. T-SAF: Twitter sentiment analysis framework using a hybrid classification scheme. Expert Systems 35, 1 (2018), e12233.
[12]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146.
[13]
M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. 2004. Learning multi-label scene classification. Pattern Recognition 37, 9 (2004), 1757–1771.
[14]
Y. C. Chang, C. H. Ku, and C. H. Chen. 2019. Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor. International Journal of Information Management 48 (2019), 263–279.
[15]
W. Cheng and E. Hüllermeier. 2009. Combining instance-based learning and logistic regression for multilabel classification. Machine Learning 76, 2 (2009), 211--225.
[16]
T. Durand, N. Mehrasa, and G. Mori. 2019. Learning a deep ConvNet for multi-label classification with partial labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 647–657.
[17]
A. Ejaz, Z. Turabee, M. Rahim, and S. Khoja. 2017. Opinion mining approaches on Amazon product reviews: A comparative study. In Information and Communication Technologies (ICICT), 2017 International Conference on IEEE. 173–179.
[18]
F. Elghannam. 2016. Improving the performance of adopted approaches for extracting Arabic keyphrases. International Journal of Computer Applications 975 (2016), 8887.
[19]
F. Elghannam. 2021. Text representation and classification based on bi-gram alphabet. Journal of King Saud University-Computer and Information Sciences 33, 2 (2021), 235--242.
[20]
F. Elghannam. 2019. Text representation and classification based on bi-gram alphabet. Journal of King Saud University-Computer and Information Sciences.
[21]
A. Elnagar, R. Al-Debsi, and O. Einea. 2020. Arabic text classification using deep learning models. Information Processing & Management 57, 1 (2020), 102121.
[22]
B. Fang, Q. Ye, D. Kucukusta, and R. Law. 2016. Analysis of the perceived value of online tourism reviews: Influence of readability and reviewer characteristics. Tourism Management 52 (2016), 498–506.
[23]
J. L. Fleiss. 1975. Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 651–659.
[24]
Y. Freund and R. E. Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 1 (1997), 119--139.
[25]
D. Greene and P. Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning. 377--384.
[26]
E. Gibaja and S. Ventura. 2015. A tutorial on multilabel learning. ACM Computing Surveys (CSUR), 47, 3 (2015), 1--38.
[27]
S. Haykin. 2004. Kalman filtering and neural networks. John Wiley & Sons. 47.
[28]
E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. 2008. Label ranking by learning pairwise preferences. Artificial Intelligence 172, 16–17 (2008), 1897–1916.
[29]
S. Jain. 2017. Multi label classification | solving multi label classification problems (analyticsvidhya.com), https://www.analyticsvidhya.com/blog/2017/08/introduction-to-multi-label-classification/
[31]
N. Kalchbrenner, E. Grefenstette, and P. Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
[32]
X. Kang, X. Shi, Y. Wu, and F. Ren. 2020. Active learning with complementary sampling for instructing class-biased multi-label text emotion classification. IEEE Transactions on Affective Computing.
[33]
Y. Kim, Y. Jernite, D. Sontag, and A. Rush. 2016. Character-aware neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence 30, 1.
[34]
N. F. Kayaalp, G. R. Weckman, W. A. Young II, D. F. Millie, and C. Celikbilek. 2017. Extracting customer opinions associated with an aspect by using a heuristic based sentence segmentation approach. Int. J. Bus. Inf. Syst. 26, 2 (2017), 236--260.
[35]
J. Li, L. Xu, L. Tang, S. Wang, and L. Li. 2018. Big data in tourism research: A literature review. Tourism Management 68 (2018), 301–323.
[36]
J. Lilleberg, Y. Zhu, and Y. Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 136–140.
[37]
W. Ling, Y. Tsvetkov, S. Amir, R. Fermandez, C. Dyer, A. W. Black, and C. C. Lin. 2015. Not all contexts are created equal: Better word representations with variable attention. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1367–1372.
[38]
B. Liu, M. Hu, and J. Cheng. 2005. Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web. 342–351.
[39]
S. M. Liu, and J. H. Chen. 2015. A multi-label classification based approach for sentiment classification. Expert Systems with Applications 42, 3 (2015), 1083–1093.
[40]
Y. Liu, Y. Chen, R. Lusch, H. Chen, D. Zimbra, and S. Zeng. 2010. User-generated content on social media: Predicting market success with online word-of-mouth. IEEE Intelligent Systems.
[41]
H. Liu, J. He, T. Wang, W. Song, and X. Du. 2013. Combining user preferences and user opinions for accurate recommendation. Electronic Commerce Research and Applications 12, 1 (2013), 14–23.
[42]
A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. 2011. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142--150.
[43]
A. L. Maas, A. Y. Hannun, and A. Y. Ng, 2013. Rectifier nonlinearities improve neural network acoustic models. Proc. Icml. 30, 1 (2013), 3.
[44]
T. Mikolov, K. Chen, and G. Corrado. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781. http://arxiv.org/abs/1301.3781.
[45]
T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
[46]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013b. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546.
[47]
S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao. 2021. Deep learning-based text classification: A comprehensive review. arXiv e-prints. arXiv preprint arXiv:2004.03705.
[48]
A. A. Movassagh, J. A. Alzubi, M. Gheisari, M. Rahimi, S. Mohan, A. A. Abbasi, and N. Nabipour. 2021. Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. Journal of Ambient Intelligence and Humanized Computing. 1–9.
[49]
Murat. 2020. Metrics for multilabel classification | mustafa murat ARAT (.github.io). http://mmuratarat.gethup.io.2020-01-25/multi_lable_classification_metrics.
[50]
G. Navarro. 2001. A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33, 1 (2001), 31–88.
[51]
J. Ravenscroft, A. Oellrich, S. Saha, and M. Liakata. 2016. Multi-label annotation in scientific articles-the multi-label cancer risk assessment corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 4115–4123.
[52]
J. Read, B. Pfahringer, and G. Holmes. 2008. Multi-label classification using ensembles of pruned sets. In 2008 Eighth IEEE International Conference on Data Mining. IEEE, 995–1000.
[53]
J. Read, B. Pfahringer, G. Holmes, and E. Frank. 2011. Classifier chains for multi-label classification. Machine Learning 85, 3 (2011), 333–359.
[54]
R. E. Schapire and Y. Singer. 1999. Improved boosting algorithms using confidence-rated predictions. Machine Learning 37, 3 (1999), 297--336.
[55]
R. Y. Rubinstein and D. P. Kroese. 2004. The cross-entropy method: A unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Book Manuscript.
[56]
E. Omara, M. Mosa, and N. Ismail. 2022. Applying Recurrent Networks for Arabic Sentiment Analysis. Menoufia Journal of Electronic Engineering Research 31, 1 (2022), 21–28.
[57]
C. N. Silla and A. A. Freitas. 2011. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 1 (2011), 31–72.
[58]
M. Sodanil. 2016. Multi-language sentiment analysis for hotel reviews. In MATEC Web of Conferences. EDP Sciences. 75, 03002.
[59]
K. Song, S. Feng, W. Gao, D. Wang, L. Chen, and C. Zhang. 2015. Build emotion lexicon from microblogs by combining effects of seed words and emoticons in a heterogeneous graph. In Proceedings of the 26th ACM Conference on Hypertext & Social Media. 283–292.
[60]
G. Tsoumakas, I. Katakis, and I. Vlahavas. 2011. Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23, 7 (2011), 1079--1089.
[61]
G. Tsoumakas and I. Vlahavas. 2007. Random k-labelsets: An ensemble method for multilabel classification. In European Conference on Machine Learning. Springer, Berlin. 406–417.
[62]
G. Tsoumakas, I. Katakis, and I. Vlahavas. 2009. Mining multi-label data. In Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. 667–685.
[63]
G. Tsoumakas, I. Katakis, and I. Vlahavas. 2010. Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering 23, 7 (2010), 1079–1089.
[64]
C. J. Van Rijsbergen. 2004. The Geometry of Information Retrieval. Cambridge University Press.
[65]
S. Vázquez, Ó. Muñoz-García, I. Campanella, M. Poch, B. Fisas, N. Bel, and G. Andreu. 2014. A classification of user-generated content into consumer decision journey stages. Neural Networks 58 (2014), 68–81.
[66]
L. Yujian and L. Bo. 2007. A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 6 (2007), 1091–1095.
[67]
R. A. Zayed, M. F. A. Hady, and H. Hefny. 2015. Islamic fatwa request outing via hierarchical multi-label Arabic text categorization. In 2015 First International Conference on Arabic Computational Linguistics (ACLing). IEEE, 145–151.
[68]
M. L. Zhang and Z. H. Zhou. 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40, 7 (2007), 2038--2048.
[69]
M. L. Zhang and K. Zhang. 2010. Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 999--1008.
[70]
L. Zhang and B. Liu. 2014. Aspect and entity extraction for opinion mining. In data mining and knowledge discovery for big data. Berlin, Springer. 1–40.
[71]
M. L. Zhang and Z. H. Zhou. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18, 10 (2006), 1338–1351.
[72]
X. Zhang, J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. arXiv preprint arXiv:1509.01626.
[73]
S. Zhu, X. Ji, W. Xu, and Y. Gong. 2005. Multi-labelled classification using maximum entropy method. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 274–281.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 1
January 2023
340 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3572718
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2022
Online AM: 01 June 2022
Accepted: 16 May 2022
Revised: 21 March 2022
Received: 11 August 2021
Published in TALLIP Volume 22, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multi-label classification
  2. multi-label annotation
  3. sentiment analysis
  4. vector representation
  5. bi-gram alphabet
  6. neural networks
  7. Arabic text

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)31
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media