Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Chen, Catherine; Shen, Zejiang; Klein, Dan; Stanovsky, Gabriel; Downey, Doug; Lo, Kyle

Computer Science > Computation and Language

arXiv:2306.01058 (cs)

[Submitted on 1 Jun 2023]

Title:Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Authors:Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, Kyle Lo

View PDF

Abstract:Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. Layout-infused LMs are often evaluated on documents with familiar layout features (e.g., papers from the same publisher), but in practice models encounter documents with unfamiliar distributions of layout features, such as new combinations of text sizes and styles, or new spatial configurations of textual elements. In this work we test whether layout-infused LMs are robust to layout distribution shifts. As a case study we use the task of scientific document structure recovery, segmenting a scientific paper into its structural categories (e.g., "title", "caption", "reference"). To emulate distribution shifts that occur in practice we re-partition the GROTOAP2 dataset. We find that under layout distribution shifts model performance degrades by up to 20 F1. Simple training strategies, such as increasing training diversity, can reduce this degradation by over 35% relative F1; however, models fail to reach in-distribution performance in any tested out-of-distribution conditions. This work highlights the need to consider layout distribution shifts during model evaluation, and presents a methodology for conducting such evaluations.

Comments:	To appear in ACL Findings 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2306.01058 [cs.CL]
	(or arXiv:2306.01058v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.01058

Submission history

From: Catherine Chen [view email]
[v1] Thu, 1 Jun 2023 18:01:33 UTC (1,004 KB)

Computer Science > Computation and Language

Title:Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators