skip to main content
10.1145/3341161.3345332acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Topic enhanced word embedding for toxic content detection in Q&A sites

Published: 15 January 2020 Publication History

Abstract

Increasingly, users are adopting community question-and-answer (Q&A) sites to exchange information. Detecting and eliminating toxic and divisive content in these Q&A sites are paramount tasks to ensure a safe and constructive environment for the users. Insincere question, which is founded upon false premises, is one type of toxic content in Q&A sites. In this paper, we proposed a novel deep learning framework enhanced pre-trained word embeddings with topical information for insincere question classification. We evaluated our proposed framework on a large real-world dataset from Quora Q&A site and showed that the topically enhanced word embedding is able to achieve better results in toxic content classification. An empirical study was also conducted to analyze the topics of the insincere questions on Quora, and we found that topics on "religion", "gender" and "politics" has a higher proportion of insincere questions.

References

[1]
G. Wang, K. Gill, M. Mohanlal, H. Zheng, and B. Y. Zhao, "Wisdom in the social crowd: an analysis of quora," in Proceedings of the 22nd international conference on World Wide Web. ACM, 2013, pp. 1341--1352.
[2]
Y. Liu and E. Agichtein, "On the evolution of the yahoo! answers qa community," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2008, pp. 737--738.
[3]
A. Schultze-Krumbholz, K. Göbel, H. Scheithauer, A. Brighi, A. Guarini, H. Tsorbatzoudis, V. Barkoukis, J. Pyżalski, P. Plichta, R. Del Rey et al., "A comparison of classification approaches for cyberbullying and traditional bullying using data from six european countries," Journal of School Violence, vol. 14, no. 1, pp. 47--65, 2015.
[4]
K. Nalini and L. J. Sheela, "Classification of tweets using text classifier to detect cyber bullying," in Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India CSI Volume 2. Springer, 2015, pp. 637--645.
[5]
Q. Huang, V. K. Singh, and P. K. Atrey, "Cyber bullying detection using social and textual analysis," in Proceedings of the 3rd International Workshop on Socially-Aware Multimedia. ACM, 2014, pp. 3--6.
[6]
B. Nandhini and J. Sheeba, "Cyberbullying detection and classification using information retrieval algorithm," in Proceedings of the 2015 International Conference on Advanced Research in Computer Science Engineering & Technology (ICARCSET 2015). ACM, 2015, p. 20.
[7]
N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, and N. Bhamidipati, "Hate speech detection with comment embeddings," in WWW, 2015.
[8]
P. Burnap and M. L. Williams, "Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making," Policy & Internet, vol. 7, no. 2, pp. 223--242, 2015.
[9]
B. Gambäck and U. K. Sikdar, "Using convolutional neural networks to classify hate-speech," in Proceedings of the first workshop on abusive language online, 2017, pp. 85--90.
[10]
W. Warner and J. Hirschberg, "Detecting hate speech on the world wide web," in Proceedings of the second workshop on language in social media. Association for Computational Linguistics, 2012, pp. 19--26.
[11]
P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, "Deep learning for hate speech detection in tweets," in Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017, pp. 759--760.
[12]
A. Schmidt and M. Wiegand, "A survey on hate speech detection using natural language processing," in Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, 2017, pp. 1--10.
[13]
K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text classification algorithms: A survey," Information, vol. 10, no. 4, p. 150, 2019.
[14]
Y. Kim, "Convolutional neural networks for sentence classification," in EMNLP, 2014.
[15]
S. Lai, L. Xu, K. Liu, and J. Zhao, "Recurrent convolutional neural networks for text classification," in AAAI, 2015.
[16]
P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu, "Text classification improved by integrating bidirectional lstm with two-dimensional max pooling," in International Conference on Computational Linguistics, 2016.
[17]
C. Zhou, C. Sun, Z. Liu, and F. Lau, "A c-lstm neural network for text classification," arXiv preprint arXiv:1511.08630, 2015.
[18]
P. Liu, X. Qiu, and X. Huang, "Recurrent neural network for text classification with multi-task learning," in IJCAI, 2016, pp. 2873--2879.
[19]
J. L. Elman, "Finding structure in time," Cognitive science, vol. 14, no. 2, pp. 179--211, 1990.
[20]
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735--1780, 1997.
[21]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in NIPS, 2013.
[22]
J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in EMNLP, 2014.
[23]
J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, "From paraphrase database to compositional paraphrase model and back," Transactions of the Association for Computational Linguistics, vol. 3, pp. 345--358, 2015.
[24]
T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin, "Advances in pre-training distributed word representations," in Proceedings of the International Conference on Language Resources and Evaluation, 2018.
[25]
Y. Liu, Z. Liu, T.-S. Chua, and M. Sun, "Topical word embeddings," in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[26]
S. Li, T.-S. Chua, J. Zhu, and C. Miao, "Generative topic embedding: a continuous representation of documents," in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 666--675.
[27]
Y. Ren, R. Wang, and D. Ji, "A topic-enhanced word embedding for twitter sentiment classification," Information Sciences, vol. 369, pp. 188--198, 2016.
[28]
J. He, Z. Hu, T. Berg-Kirkpatrick, Y. Huang, and E. P. Xing, "Efficient correlated topic modeling with topic embedding," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 225--233.
[29]
C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W.-Y. Ma, "Topic aware neural response generation," in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[30]
A. Graves and J. Schmidhuber, "Framewise phoneme classification with bidirectional lstm and other neural network architectures," Neural Networks, vol. 18, no. 5-6, pp. 602--610, 2005.
[31]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," in ICLR Workshop, 2013.
[32]
J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, "Ppdb: The paraphrase database," in NAACL, 2013.
[33]
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," JMLR, vol. 3, no. Jan, pp. 993--1022, 2003.
[34]
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, "Natural language processing (almost) from scratch," JMLR, vol. 12, no. Aug, pp. 2493--2537, 2011.
[35]
M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining. ACM, 2015, pp. 399--408.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASONAM '19: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
August 2019
1228 pages
ISBN:9781450368681
DOI:10.1145/3341161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 January 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NLP
  2. sequence model
  3. text classification
  4. toxic content
  5. word embedding

Qualifiers

  • Research-article

Conference

ASONAM '19
Sponsor:

Acceptance Rates

ASONAM '19 Paper Acceptance Rate 41 of 286 submissions, 14%;
Overall Acceptance Rate 116 of 549 submissions, 21%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media