Skip to main content

Showing 1–5 of 5 results for author: Boyd-Graber, J L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16342  [pdf, other

    cs.CL

    ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

    Authors: Yoo Yeon Sung, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber

    Abstract: Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2401.11185

  2. arXiv:2406.10900  [pdf, other

    cs.CV cs.CL

    AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

    Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

    Abstract: Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  3. arXiv:2406.04643  [pdf, other

    cs.CL

    More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play

    Authors: Wichayaporn Wongkamjan, Feng Gu, Yanze Wang, Ulf Hermjakob, Jonathan May, Brandon M. Stewart, Jonathan K. Kummerfeld, Denis Peskoff, Jordan Lee Boyd-Graber

    Abstract: The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the de… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  4. arXiv:2402.11161  [pdf, other

    cs.CL cs.AI

    PEDANTS (Precise Evaluations of Diverse Answer Nominee Text for Skinflints): Efficient Evaluation Analysis and Benchmarking for Open-Domain Question Answering

    Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber

    Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current efficient answer correctness (AC) metrics do not align with human judgments, particularly verbose, free-form answers from large language models (LLMs). There are two challenges: a lack of diverse evaluation data and that models are too big and… ▽ More

    Submitted 6 July, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Efficient PEDANTS Classifier for short-form QA in github: https://github.com/zli12321/qa_metrics. arXiv admin note: text overlap with arXiv:2401.13170

  5. arXiv:2312.01308  [pdf, other

    cs.CL

    Bridging Background Knowledge Gaps in Translation with Automatic Explicitation

    Authors: HyoJung Han, Jordan Lee Boyd-Graber, Marine Carpuat

    Abstract: Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: EMNLP2023