● Hugging Face 📅 19/11/2024 à 02:00

Judge Arena: Benchmarking LLMs as Evaluators

Data Science

Back to Articles Judge Arena: Benchmarking LLMs as Evaluators Published November 19, 2024 Update on GitHub Upvote 63 +57 kyle kaikaidai Follow AtlaAI Maurice MauriceBurg Follow AtlaAI Roman Engeler RomanEngeler1805 Follow AtlaAI Max Bartolo mbartolo Follow Cohere Clémentine Fourrier clefourrier Follow Toby Drane tobydrane Follow AtlaAI Mathias Leys mathias-atla Follow AtlaAI Jake Golden jacksongolden Follow AtlaAI LLM-as-a-Judge has emerged as a popular way to grade natural language outputs from LLM applications, but how do we know which models make the best judges? We’re excited to launch Judge Arena - a platform that lets anyone easily compare models as judges side-by-side. Just run the judges on a test sample and vote which judge you agree with most. The results will be organized into a leaderboard that displays the best judges. Judge Arena Crowdsourced, randomized battles have proven effective at benchmarking LLMs. LMSys's Chatbot Arena has collected over 2M votes and is highly regarded as a field-test to identify the best language models. Since LLM evaluations aim to capture human preferences, direct human feedback is also key to determining which AI judges are most helpful. How it works Choose your sample for evaluation: Let the system randomly generate a 👩 User Input / 🤖 AI Response pair OR input your own custom sample Two LLM judges will: Score the response Provide their reasoning for the score Review both judges’ evaluations and vote for the one that best aligns with your judgment (We recommend reviewing the scores first before comparing critiques) After each vote, you can: Regenerate judges: Get new evaluations of the same sample Start a 🎲 New round: Randomly generate a new sample to be evaluated OR, input a new custom sample to be evaluated To avoid bias and potential abuse, the model names are only revealed after a vote is submitted. Selected Models Judge Arena focuses on the LLM-as-a-Judge approach, and therefore only includes generative models (excluding classifier models that solely output a score). We formalize our selection criteria for AI judges as the following: The model should possess the ability to score AND critique other models' outputs effectively. The model should be prompt-able to evaluate in different scoring formats, for different criteria. We selected 18 state-of-the-art LLMs for our leaderboard. While many are open-source models with public weights, we also included proprietary API models to enable direct comparison between open and closed approaches. OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo) Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku) Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B) Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B) Google (Gemma 2 9B / 27B) Mistral (Instruct v0.3 7B, Instruct v0.1 7B) The current list represents the models most commonly used in AI evaluation pipelines. We look forward to adding more models if our leaderboard proves to be useful. The Leaderboard The votes collected from the Judge Arena will be compiled and displayed on a dedicated public leaderboard. We calculate an Elo score for each model and will update the leaderboard hourly. Early Insights These are only very early results, but here’s what we’ve observed so far: Mix of top performers between proprietary and open source: GPT-4 Turbo leads by a narrow margin but the Llama and Qwen models are extremely competitive, surpassing the majority of proprietary models Smaller models show impressive performance: Qwen 2.5 7B and Llama 3.1 8B are performing remarkably well and competing with much larger models. As we gather more data, we hope to better understand the relationship between model scale and judging ability Preliminary empirical support for emerging research: LLM-as-a-Judge literature suggests that Llama models are well-suited as base models, demonstrating strong out-of-the-box performance on evaluation benchmarks. Several approaches including Lynx, Auto-J, and SFR-LLaMA-3.1-Judge opted to start with Llama models before post-training for evaluation capabilities. Our provisional results align with this trend, showing Llama 3.1 70B and 405B ranking 2nd and 3rd, respectively As the leaderboard shapes out over the coming weeks, we look forward to sharing further analysis on results on our blog. How to contribute We hope the Judge Arena is a helpful resource for the community. By contributing to this leaderboard, you’ll help developers determine which models to use in their evaluation pipeline. We’re committed to sharing 20% of the anonymized voting data in the coming months as we hope developers, researchers and users will leverage our findings to build more aligned evaluators. We’d love to hear your feedback! For general feature requests or to submit / suggest new models to add to the arena, please open up a discussion in the community tab or talk to us on Discord. Don’t hesitate to let us know if you have questions or suggestions by messaging us on X/Twitter. Atla currently funds this out of our own pocket. We are looking for API credits (with no strings attached) to support this community effort - please get in touch at support@atla-ai.com if you are interested in collaborating 🤗 Credits Thanks to all the folks who helped test this arena and shout out to the LMSYS team for the inspiration. Special mention to Clémentine Fourrier and the Hugging Face team for making this possible! More Articles from our Blog communityresearchnlp Letting Large Models Debate: The First Multilingual LLM Debate Competition +8 33 November 20, 2024 communityresearchnlp Introducing the Open Leaderboard for Japanese LLMs! +3 39 November 20, 2024 Community C0casio45 Sep 18, 2025 seems that there is an issue with sfr-llama-3.1-70b-judge Failed to parse Salesforce response format: Error with Salesforce model sfr-llama-3.1-70b-judge: 422 Client Error: Unprocessable Entity for url: https://gateway.salesforceresearch.ai/sfr-judge/process See translation Reply EditPreview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Tap or paste here to upload images Comment · Sign up or log in to comment Upvote 63 +51

🔗 Lire l'article original 👁️ 4 lectures

← Retour