×
Jul 26, 2024 · We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic ...
May 17, 2024 · We as- sess the ability of language agents to perform complex office workflows across multiple applications using cus- tomized evaluation ...
OfficeBench is one of the first office automation benchmarks for language agents. We assess the ability of language agents to perform complex office workflows ...
The OfficeBench benchmark operates within a Docker environment pre-installed with office applications such as Word, Excel, calendar, and email clients to ...
Jul 29, 2024 · The OfficeBench paper presents a comprehensive and well-designed benchmark for evaluating language agents in office automation scenarios.
OFFICEBENCH: Benchmarking Language Agents across Multiple Applications for Office Automation New paper alert! Check our latest LLM agent benchmark on the office ...
OFFICEBENCH: Benchmarking Language Agents across Multiple Applications for Office Automation New paper alert! Check our latest LLM agent benchmark on the office ...
We introduce OfficeBench, one of the first office automation benchmarks for evaluating current LLM agents' capability to address office tasks in realistic ...
People also ask
For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo.
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation. Zilong Wang, Yuedong Cui, Li Zhong, Zimin Zhang, Da Yin, Bill ...