2025-08-27 23:02:52

lmarena-ai/arena-hard-auto-v0.1

---

license: apache-2.0

dataset_info:

features:

- name: question_id

dtype: string

- name: category

dtype: string

- name: cluster

dtype: string

- name: turns

list:

- name: content

dtype: string

splits:

- name: train

num_bytes: 251691

num_examples: 500

download_size: 154022

dataset_size: 251691

configs:

- config_name: default

data_files:

- split: train

path: data/train-*

---

## Arena-Hard-Auto

**Arena-Hard-Auto-v0.1** ([See Paper](https://arxiv.org/abs/2406.11939)) is an automatic evaluation tool for instruction-tuned LLMs. It contains 500 challenging user queries sourced from Chatbot Arena. We prompt GPT-4-Turbo as judge to compare the models' responses against a baseline model (default: GPT-4-0314). Notably, Arena-Hard-Auto has the highest *correlation* and *separability* to Chatbot Arena among popular open-ended LLM benchmarks ([See Paper](https://arxiv.org/abs/2406.11939)). If you are curious to see how well your model might perform on Chatbot Arena, we recommend trying Arena-Hard-Auto.

Please checkout our GitHub repo on how to evaluate models using Arena-Hard-Auto and more information about the benchmark.

If you find this dataset useful, feel free to cite us!

```

@article{li2024crowdsourced,

title={From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline},

author={Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Wu, Tianhao and Zhu, Banghua and Gonzalez, Joseph E and Stoica, Ion},

journal={arXiv preprint arXiv:2406.11939},

year={2024}

}

```

hugging_face 收录

Copyright © 2088 1974年世界杯|世界杯战况|努山塔拉充值世界杯数字服务站|nusantarareload.com All Rights Reserved.
友情链接