research-article

What's in a name?: an unsupervised approach to link users across communities

Authors:

Hsiao-Wuen HonAuthors Info & Claims

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

Pages 495 - 504

https://doi.org/10.1145/2433396.2433457

Published: 04 February 2013 Publication History

Abstract

In this paper, we consider the problem of linking users across multiple online communities. Specifically, we focus on the alias-disambiguation step of this user linking task, which is meant to differentiate users with the same usernames. We start quantitatively analyzing the importance of the alias-disambiguation step by conducting a survey on 153 volunteers and an experimental analysis on a large dataset of About.me (75,472 users). The analysis shows that the alias-disambiguation solution can address a major part of the user linking problem in terms of the coverage of true pairwise decisions (46.8%). To the best of our knowledge, this is the first study on human behaviors with regards to the usages of online usernames. We then cast the alias-disambiguation step as a pairwise classification problem and propose a novel unsupervised approach. The key idea of our approach is to automatically label training instances based on two observations: (a) rare usernames are likely owned by a single natural person, e.g. pennystar88 as a positive instance; (b) common usernames are likely owned by different natural persons, e.g. tank as a negative instance. We propose using the n-gram probabilities of usernames to estimate the rareness or commonness of usernames. Moreover, these two observations are verified by using the dataset of Yahoo! Answers. The empirical evaluations on 53 forums verify: (a) the effectiveness of the classifiers with the automatically generated training data and (b) that the rareness and commonness of usernames can help user linking. We also analyze the cases where the classifiers fail.

References

[1]

F. Abel, N. Henze, E. Herder, and D. Krause. Interweaving public user profiles on the web. In Proceedings of UMAP, 2010.

Digital Library

[2]

E. Acuna and C. Rodriguez. The treatment of missing values and its effect on classifier accuracy. Classification, Clustering, and Data Mining Applications, 2004.

[3]

L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of WWW, 2007.

Digital Library

[4]

E. Bengtson and D. Roth. Understanding the value of features for coreference resolution. In Proceedings of EMNLP, 2008.

Digital Library

[5]

I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.

Digital Library

[6]

J. Cai and M. Strube. End-to-end coreference resolution via hypergraph partitioning. In Proceedings of COLING, 2010.

Digital Library

[7]

C. Chang and C. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 2011.

Digital Library

[8]

T. Cover, J. Thomas, J. Wiley, et al. Elements of information theory, volume 6. 1991.

Digital Library

[9]

A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1), 2007.

Digital Library

[10]

D. Frankowski, D. Cosley, S. Sen, L. Terveen, and J. Riedl. You are what you say: privacy risks of public mentions. In Proceedings of SIGIR, 2006.

Digital Library

[11]

R. Gonzalez and R. Woods. Digital image processing. 2002.

Digital Library

[12]

T. Iofciu, P. Fankhauser, F. Abel, and K. Bischoff. Identifying users across social tagging systems. In Proceedings of ICWSM, 2011.

[13]

D. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray-Turan. Web people search via connection analysis. IEEE Transactions on Knowledge and Data Engineering (TKDE), 20(11), 2008.

Digital Library

[14]

S. Kumar, R. Zafarani, and H. Liu. Understanding user migration patterns in social media. In Proceedings of AAAI, 2011.

[15]

S. Labitzke, I. Taranu, and H. Hartenstein. What your friends tell others about you: Low cost linkability of social network profiles. In Proceedings of SNAKDD, 2011.

[16]

J. Liu, X. Song, J. Jiang, and C. Lin. An unsupervised method for author extraction from web pages containing user-generated content. In Proceedings of CIKM, 2012.

Digital Library

[17]

J. Liu, Y. Song, and C. Lin. Competition-based user expertise score estimation. In Proceedings of SIGIR, 2011.

Digital Library

[18]

K. Liu and E. Terzi. A framework for computing the privacy scores of users in online social networks. In Proceedings of ICDM, 2009.

Digital Library

[19]

A. Malhotra, L. Totti, W. Meira, P. Kumaraguru, and V. Almeida. Studying user footprints in different online social networks. In Proceedings of CSOSN, 2012.

Digital Library

[20]

A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings of S&P, 2008.

Digital Library

[21]

A. Narayanan and V. Shmatikov. De-anonymizing social networks. In Proceedings of S&P, 2009.

Digital Library

[22]

A. Narayanan and V. Shmatikov. Myths and fallacies of personally identifiable information. Communications of the ACM, 53(6), 2010.

Digital Library

[23]

M. Newman. Communities, modules and large-scale structure in networks. Nature Physics, 8(1), 2011.

[24]

J. Novak, P. Raghavan, and A. Tomkins. Anti-aliasing on the web. In Proceedings of WWW, 2004.

Digital Library

[25]

A. Nunes, P. Calado, and B. Martins. Resolving user identities over social networks through supervised learning and rich similarity features. In Proceedings of SAC, 2012.

Digital Library

[26]

Y. Qian, Y. Hu, J. Cui, Q. Zheng, and Z. Nie. Combining machine learning and human judgment in author disambiguation. In Proceedings of CIKM, 2011.

Digital Library

[27]

J. Rao and P. Rohatgi. Can pseudonymity really guarantee privacy. In Proceedings of USENIX, 2000.

Digital Library

[28]

W. Soon, H. Ng, and D. Lim. A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 2001.

Digital Library

[29]

J. Vosecky, D. Hong, and V. Shen. User identification across multiple social networks. In Proceedings of NDT, 2009.

[30]

K. Wang, C. Thrasher, and B. Hsu. Web scale nlp: A case study on url word breaking. In Proceedings of WWW, 2011.

Digital Library

[31]

K. Wang, C. Thrasher, E. Viegas, X. Li, and B. Hsu. An overview of microsoft web n-gram corpus and applications. In Proceedings of NAACL, 2010.

Digital Library

[32]

J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of WSDM, 2010.

Digital Library

[33]

R. Zafarani and H. Liu. Connecting corresponding identities across communities. In Proceedings of ICWSM, 2009.

Cited By

Zhou QZhang PGu HLu TGu N(2024)Exploring Cross-Site User Modeling without Cross-Site User Identity Linkage: A Case Study of Content Preference PredictionACM Transactions on Information Systems10.1145/369783243:1(1-28)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697832
Peng HLi SZhao DZhong MQian CWang W(2024)Network alignment based on multiple hypernetwork attributesThe European Physical Journal Special Topics10.1140/epjs/s11734-024-01144-z233:4(843-861)Online publication date: 14-Mar-2024
https://doi.org/10.1140/epjs/s11734-024-01144-z
Zhang HRen GDing XZhou LZhang X(2024)Collaborative Cross-Network Embedding Framework for Network AlignmentIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.335547911:3(2989-3001)Online publication date: May-2024
https://doi.org/10.1109/TNSE.2024.3355479
Show More Cited By

Index Terms

What's in a name?: an unsupervised approach to link users across communities
1. Information systems

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

February 2013

816 pages

ISBN:9781450318693

DOI:10.1145/2433396

General Chairs:
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
,
Program Chairs:
Paolo Ferragina
University of Pisa, Italy
,
Aristides Gionis
Yahoo! Research, Barcelona, Spain

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 February 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2013

Sponsor:

WSDM 2013: Sixth ACM International Conference on Web Search and Data Mining

February 4 - 8, 2013

Rome, Italy

Acceptance Rates

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

184
Total Citations
View Citations
1,412
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)4

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhou QZhang PGu HLu TGu N(2024)Exploring Cross-Site User Modeling without Cross-Site User Identity Linkage: A Case Study of Content Preference PredictionACM Transactions on Information Systems10.1145/369783243:1(1-28)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1145/3697832
Peng HLi SZhao DZhong MQian CWang W(2024)Network alignment based on multiple hypernetwork attributesThe European Physical Journal Special Topics10.1140/epjs/s11734-024-01144-z233:4(843-861)Online publication date: 14-Mar-2024
https://doi.org/10.1140/epjs/s11734-024-01144-z
Zhang HRen GDing XZhou LZhang X(2024)Collaborative Cross-Network Embedding Framework for Network AlignmentIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.335547911:3(2989-3001)Online publication date: May-2024
https://doi.org/10.1109/TNSE.2024.3355479
Xu JLi CHuang FLi ZXie XYu P(2024)Sinkhorn Distance Minimization for Adaptive Semi-Supervised Social Network AlignmentIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.326712635:10(13340-13353)Online publication date: Oct-2024
https://doi.org/10.1109/TNNLS.2023.3267126
Ding XZhang HMa CZhang XZhong K(2024)User Identification Across Multiple Social Networks Based on Naive Bayes ModelIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320270935:3(4274-4285)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3202709
Zheng CPan LWu P(2024)JORA: Weakly Supervised User Identity Linkage via Jointly Learning to Represent and AlignIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.320110235:3(3900-3911)Online publication date: Mar-2024
https://doi.org/10.1109/TNNLS.2022.3201102
Li YLi XJi W(2024)A Trajectory-oriented Locality-sensitive Hashing Method for User IdentificationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.3324427(1-14)Online publication date: 2024
https://doi.org/10.1109/TKDE.2023.3324427
Ye CYang JMao Y(2024)FDHFUI: Fusing Deep Representation and Hand-Crafted Features for User IdentificationIEEE Transactions on Consumer Electronics10.1109/TCE.2024.335575770:1(916-926)Online publication date: Feb-2024
https://doi.org/10.1109/TCE.2024.3355757
Vu AHutchings AAnderson R(2024)No Easy Way Out: the Effectiveness of Deplatforming an Extremist Forum to Suppress Hate and Harassment2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00007(717-734)Online publication date: 19-May-2024
https://doi.org/10.1109/SP54263.2024.00007
Gibbon JMarjanov THutchings AAston J(2024)Measuring the Unmeasurable: Estimating True Population of Hidden Online Communities2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00014(56-66)Online publication date: 8-Jul-2024
https://doi.org/10.1109/EuroSPW61312.2024.00014
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents