
Share
Carlini's novel benchmark uses a dataflow DSL to assess large language models with real-world tasks, offering a fresh approach to evaluating AI capabilities beyond traditional methods.
Nicholas Carlini has just released a new benchmark for large language models (LLMs) on his GitHub, titled "Yet Another Applied LLM Benchmark." This collection of nearly 100 tests is designed to evaluate the practical capabilities of LLMs by simulating real-world tasks and scenarios. The benchmark stands out due to its innovative use of a dataflow domain-specific language (DSL) for creating and evaluating tests.
The key innovation in this benchmark is the implementation of a simple DSL that makes it easy to add new tests and evaluate model responses. This DSL allows you to specify both how a question should be asked and how the answer should be evaluated. The evaluation methods are diverse, ranging from running code in a Docker container to using vision models for image recognition.
For practitioners, this benchmark offers a more realistic way to assess LLMs. Traditional benchmarks often focus on academic or theoretical tasks, but Carlini's tests are grounded in real-world use cases. This makes it easier to understand how well an LLM can assist with actual development and problem-solving tasks.
Realistic Test Cases: The benchmark includes a wide range of tests that reflect actual interactions with LLMs. For example:
Dataflow DSL: The core of the benchmark is a dataflow DSL that simplifies test creation and evaluation. Here’s how it works:
>> operator.
Hello World in Python
"Write hello world in python" >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world")
This test sends the prompt to the LLM, runs the generated Python code, and checks if the output contains "hello world".
Ambiguous Questions
"In python what __thing__ do I use for ~, kind of like how __add__ is for +?" >> LLMRun() >> (SubstringEvaluator("__inv__") | SubstringEvaluator("__invert__"))
This test checks if the model can correctly identify the method name for the bitwise NOT operation in Python.
Bitmap Image Specification
"Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> VisionLLMRun("What flag is shown in this image?") >> (SubstringEvaluator("United States"))
This test evaluates if the model can write a valid C program that outputs a bitmap image of the American flag, and then uses a vision model to verify the output.
Carlini's new benchmark for LLMs is a significant step forward in evaluating the practical capabilities of these models. By focusing on real-world tasks and using a flexible DSL, it provides a more comprehensive and realistic assessment of how well LLMs can assist developers and researchers.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
21 February 2024
88 articles
Related Articles
Related Articles
More Stories