TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Jiang, Dongfu; Li, Yishan; Zhang, Ge; Huang, Wenhao; Lin, Bill Yuchen; Chen, Wenhu

Computer Science > Computation and Language

arXiv:2310.00752 (cs)

[Submitted on 1 Oct 2023 (v1), last revised 9 May 2024 (this version, v4)]

Title:TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Authors:Dongfu Jiang, Yishan Li, Ge Zhang, Wenhao Huang, Bill Yuchen Lin, Wenhu Chen

View PDF HTML (experimental)

Abstract:We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task. All the resourced are released in our project website: \url{this https URL}.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.00752 [cs.CL]
	(or arXiv:2310.00752v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.00752

Submission history

From: Dongfu Jiang [view email]
[v1] Sun, 1 Oct 2023 18:01:51 UTC (776 KB)
[v2] Wed, 6 Dec 2023 16:06:08 UTC (983 KB)
[v3] Sat, 9 Dec 2023 22:39:53 UTC (981 KB)
[v4] Thu, 9 May 2024 21:51:30 UTC (1,074 KB)

Computer Science > Computation and Language

Title:TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators