Root Signals, a specialist in the evaluation of large language models (LLM) and quality control of AI applications, recently announced the launch of Root Judge, a model designed to measure the reliability of GenAI applications. This new tool, based on Meta's open-source Llama-3.3-70B-Instruct model, promises to establish a new standard in reliable, customizable, and locally deployable evaluation.

An AI that judges AI: towards automated and reliable evaluation

Root Judge aims to address the challenges related to LLM hallucinations and the reliability of generated decisions.

Its goal is threefold:

  • Detection of hallucinations: it identifies, describes, and automatically blocks contextual errors in augmented generative AI (RAG) pipelines;
  • Pairwise preference judgments: The model facilitates comparisons between different model outputs through customizable criteria;
  • Privacy compliance: Root Judge supports local deployments, ensuring data confidentiality by avoiding the sending of sensitive data to external servers.

A cutting-edge training structure

Root Judge has been post-trained on a rigorously annotated dataset and optimized using advanced techniques such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO).

Root Signals, based in Palo Alto and Helsinki, leveraged the power of the EuroHPC JU LUMI supercomputer installed in Kajaani, Finland, to train its "LLM-as-a-Judge" on 384 AMD Radeon Instinct MI250X GPUs.

A model that stands out

Root Judge surpasses both closed models such as OpenAI's GPT-4o, o1-mini, o1-preview, and Anthropic's Sonnet-3.5, as well as other similarly sized open-source LLM Judges in hallucination detection and generation of explainable outputs. Its applications span all sectors, making it a versatile tool for businesses, developers, and researchers seeking reliable AI solutions tailored to their needs. We now await benchmarks against GPT 4.5 and Sonnet 3.7 which have just been released.


Available under an open weights license, the model is also accessible via Root Signals EvalOps, a platform designed to measure and monitor the behavior of LLMs in production.