@agentica/benchmark
Benchmark program of @agentica
about function selecting.
@agentica/core
is the simplest Agentic AI library specialized in LLM Function Calling. And in the function calling process, as @agentica/core
does not failed on the arguments composition step due to its validation feedback strategy, the most import part for @agentica/core
users is the function selecting.
@agentica/benchmark
is the benchmark tool measuring the function selecting quality. Here is an example report generated by @agentica/benchmark
measuring function calling’s selection quality of “Shopping Mall” scenario. Below measured benchmark scenario is exactly same with the recorded video, and you can find that every function calling has succeeded without any error.
In actually, below shopping mall backend also has incrased its AX (Agent Experience, accuracy of the function selecting) by repeating re-writing description comments of the API operations, and measuring its benchmark result with @agentica/benchmark
.
- Benchmark Report:
@wrtnlabs/agentica/test/examples/benchmarks/select
- Swagger Document: https://shopping-be.wrtn.ai/editor 
- Repository: https://github.com/samchon/shopping-backend 
Select Benchmark
Pseudo Code
import {
AgenticaSelectBenchmark,
IAgenticaSelectBenchmarkScenario
} from "@agentica/benchmark";
import { Agentica } from "@agentica/core";
const agent = new Agentica({ ... });
const benchmark = new AgenticaSelectBenchmark({
agent,
config: {
repeat: 4,
simultaneous: 100,
},
scenarios: [...] satisfies IAgenticaSelectBenchmarkScenario[],
});
await benchmark.execute();
const docs: Record<string, string> = benchmark.report();
await archiveReport(docs);
Call AgenticaSelectBenchmark.execute()
and AgenticaSelectBenchmark.report()
functions.
You can measure function selecting benchmark by AgenticaSelectBenchmark
class. Create its instance with Agentica
typed instance. You can customize some configurations when creating the AgenticaSelectBenchmark
instance. The configuration property repeat
means a repeating count for the benchmark execution for each scenario, and the simultaneous
property means a simultaneous count for the parallel benchmark execution.
When you call the AgenticaSelectBenchmark.execute()
function after construction, the benchmark will be executed. If you’ve provided a lot of scenarios or configured huge repeating counts, it may consume a lot of time. To trace progress of the benchmark, you can pass a callback function as the argument of the listener
parameter.
After the benchmark, you can get its report by calling the AgenticaSelectBenchmark.report()
function. The report will be returned as a dictionary object of the markdown reporting files. The key of the dictionary would be the file name with its relative location path, and the value would be the markdown content.
- Report Example:
@wrtnlabs/agentica/test/examples/benchmarks/select
Benchmark Scenario
undefined
/**
* Scenario of function selection.
*
* `IAgenticaSelectBenchmarkScenario` is a data structure which
* represents a function selection benchmark scenario. It contains two
* properties; {@linkk text} and {@link operations}.
*
* The {@link text} means the conversation text from the user, and
* the other {@link operations} are the expected operations that
* should be selected by the `selector` agent through the {@link text}
* conversation.
*
* @author Samchon
*/
export interface IAgenticaSelectBenchmarkScenario<
Model extends ILlmSchema.Model,
> {
/**
* Name of the scenario.
*
* It must be unique within the benchmark scenarios.
*/
name: string;
/**
* The prompt text from user.
*/
text: string;
/**
* Expected function selection sequence.
*
* Sequence of operations (API operation or class function) that
* should be selected by the `selector` agent from the user's
* {@link text} conversation for the LLM (Large Language Model)
* function selection.
*/
expected: IAgenticaBenchmarkExpected<Model>;
}
No more feature?
@agentica/benchmark
already supporting function calling’s execution benchmark feature. However, as @agentica/benchmark
is not providing multi-turn benchmark, the benchmark scenario must be constructed by only one conversation text, so it is not enough to measure the function calling’s quality.
We’re planning to multi turn benchmark feature in the future, so the function calling’s execution benchmark feature would be meaningful at that time. Until that, please be satisfied with the current feature, function selecting benchmark. In the current feature, only function selecting benchmark is meaningful.
For reference, by utilizing the validation feedback strategy, @agentica/core
does not failed on the arguments composition step. So if have succeeded to acquire the high scored benchmark result in the function select benchmark, your application is sufficiently ready to be launched.