Skip to Content
📖 Guide Documents🌉 Plugin ModulesBenchmark Program

@agentica/benchmark

Benchmark program of @agentica about function selecting.

@agentica/core is the simplest Agentic AI library specialized in LLM Function Calling. And in the function calling process, as @agentica/core does not failed on the arguments composition step due to its validation feedback strategy, the most import part for @agentica/core users is the function selecting.

@agentica/benchmark is the benchmark tool measuring the function selecting quality. Here is an example report generated by @agentica/benchmark measuring function calling’s selection quality of “Shopping Mall” scenario. Below measured benchmark scenario is exactly same with the recorded video, and you can find that every function calling has succeeded without any error.

In actually, below shopping mall backend also has incrased its AX (Agent Experience, accuracy of the function selecting) by repeating re-writing description comments of the API operations, and measuring its benchmark result with @agentica/benchmark.


Select Benchmark

test/benchmark/select.ts
import { AgenticaSelectBenchmark, IAgenticaSelectBenchmarkScenario } from "@agentica/benchmark"; import { Agentica } from "@agentica/core"; const agent = new Agentica({ ... }); const benchmark = new AgenticaSelectBenchmark({ agent, config: { repeat: 4, simultaneous: 100, }, scenarios: [...] satisfies IAgenticaSelectBenchmarkScenario[], }); await benchmark.execute(); const docs: Record<string, string> = benchmark.report(); await archiveReport(docs);

Call AgenticaSelectBenchmark.execute() and AgenticaSelectBenchmark.report() functions.

You can measure function selecting benchmark by AgenticaSelectBenchmark class. Create its instance with Agentica typed instance. You can customize some configurations when creating the AgenticaSelectBenchmark instance. The configuration property repeat means a repeating count for the benchmark execution for each scenario, and the simultaneous property means a simultaneous count for the parallel benchmark execution.

When you call the AgenticaSelectBenchmark.execute() function after construction, the benchmark will be executed. If you’ve provided a lot of scenarios or configured huge repeating counts, it may consume a lot of time. To trace progress of the benchmark, you can pass a callback function as the argument of the listener parameter.

After the benchmark, you can get its report by calling the AgenticaSelectBenchmark.report() function. The report will be returned as a dictionary object of the markdown reporting files. The key of the dictionary would be the file name with its relative location path, and the value would be the markdown content.

Benchmark Scenario

@agentica/benchmark/IAgenticaSelectBenchmarkScenario
/** * Scenario of function selection. * * `IAgenticaSelectBenchmarkScenario` is a data structure which * represents a function selection benchmark scenario. It contains two * properties; {@linkk text} and {@link operations}. * * The {@link text} means the conversation text from the user, and * the other {@link operations} are the expected operations that * should be selected by the `selector` agent through the {@link text} * conversation. * * @author Samchon */ export interface IAgenticaSelectBenchmarkScenario< Model extends ILlmSchema.Model, > { /** * Name of the scenario. * * It must be unique within the benchmark scenarios. */ name: string; /** * The prompt text from user. */ text: string; /** * Expected function selection sequence. * * Sequence of operations (API operation or class function) that * should be selected by the `selector` agent from the user's * {@link text} conversation for the LLM (Large Language Model) * function selection. */ expected: IAgenticaBenchmarkExpected<Model>; }

No more feature?

@agentica/benchmark already supporting function calling’s execution benchmark feature. However, as @agentica/benchmark is not providing multi-turn benchmark, the benchmark scenario must be constructed by only one conversation text, so it is not enough to measure the function calling’s quality.

We’re planning to multi turn benchmark feature in the future, so the function calling’s execution benchmark feature would be meaningful at that time. Until that, please be satisfied with the current feature, function selecting benchmark. In the current feature, only function selecting benchmark is meaningful.

For reference, by utilizing the validation feedback strategy, @agentica/core does not failed on the arguments composition step. So if have succeeded to acquire the high scored benchmark result in the function select benchmark, your application is sufficiently ready to be launched.

Last updated on