Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News

Michael Bugert, Iryna Gurevych


Abstract
Cross-document event coreference resolution (CDCR) is the task of identifying which event mentions refer to the same events throughout a collection of documents. Annotating CDCR data is an arduous and expensive process, explaining why existing corpora are small and lack domain coverage. To overcome this bottleneck, we automatically extract event coreference data from hyperlinks in online news: When referring to a significant real-world event, writers often add a hyperlink to another article covering this event. We demonstrate that collecting hyperlinks which point to the same article(s) produces extensive and high-quality CDCR data and create a corpus of 2M documents and 2.7M silver-standard event mentions called HyperCoref. We evaluate a state-of-the-art system on three CDCR corpora and find that models trained on small subsets of HyperCoref are highly competitive, with performance similar to models trained on gold-standard data. With our work, we free CDCR research from depending on costly human-annotated training data and open up possibilities for research beyond English CDCR, as our data extraction approach can be easily adapted to other languages.
Anthology ID:
2021.emnlp-main.38
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
471–491
Language:
URL:
https://aclanthology.org/2021.emnlp-main.38
DOI:
10.18653/v1/2021.emnlp-main.38
Bibkey:
Cite (ACL):
Michael Bugert and Iryna Gurevych. 2021. Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 471–491, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Event Coreference Data (Almost) for Free: Mining Hyperlinks from Online News (Bugert & Gurevych, EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.38.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.38.mp4
Code
 ukplab/emnlp2021-hypercoref-cdcr
Data
ECB+