Agentica Guide Documents

`@agentica/benchmark`

Benchmark program of @agentica about function selecting.

@agentica/core is the simplest Agentic AI library specialized in LLM Function Calling. And in the function calling process, as @agentica/core does not failed on the arguments composition step due to its validation feedback strategy, the most import part for @agentica/core users is the function selecting.

@agentica/benchmark is the benchmark tool measuring the function selecting quality. Here is an example report generated by @agentica/benchmark measuring function calling’s selection quality of “Shopping Mall” scenario. Below measured benchmark scenario is exactly same with the recorded video, and you can find that every function calling has succeeded without any error.

In actually, below shopping mall backend also has incrased its AX (Agent Experience, accuracy of the function selecting) by repeating re-writing description comments of the API operations, and measuring its benchmark result with @agentica/benchmark.

Benchmark Report: @wrtnlabs/agentica/test/examples/benchmarks/select
Swagger Document: https://shopping-be.wrtn.ai/editor
Repository: https://github.com/samchon/shopping-backend

Select Benchmark

Pseudo Code

test/benchmark/select.ts


import { 
  AgenticaSelectBenchmark,
  IAgenticaSelectBenchmarkScenario
} from "@agentica/benchmark";
import { Agentica } from "@agentica/core";
 
const agent = new Agentica({ ... });
const benchmark = new AgenticaSelectBenchmark({
  agent,
  config: {
    repeat: 4,
    simultaneous: 100,
  },
  scenarios: [...] satisfies IAgenticaSelectBenchmarkScenario[],
});
await benchmark.execute();
 
const docs: Record<string, string> = benchmark.report();
await archiveReport(docs);

Actual Code

test/benchmark/select.ts


import { 
  AgenticaSelectBenchmark,
  IAgenticaSelectBenchmarkScenario
} from "@agentica/benchmark";
import { Agentica, IAgenticaOperation } from "@agentica/core";
import { HttpLlm, IHttpConnection, OpenApi } from "@samchon/openapi";
import fs from "fs";
import OpenAI from "openai";
import path from "path";
 
const mkdir = async (str: string) => {
  try {
    await fs.promises.mkdir(str, {
      recursive: true,
    });
  } catch {}
};
 
const rmdir = async (str: string) => {
  try {
    await fs.promises.rm(str, {
      recursive: true,
    });
  } catch {}
};
 
const main = async (): Promise<void> => {
  // CREATE AI AGENT
  const agent: Agentica<"chatgpt"> = new Agentica({
    model: "chatgpt",
    vendor: {
      api: new OpenAI({ apiKey: "********" }),
      model: "gpt-4o-mini",
    },
    controllers: [
      {
        protocol: "http",
        name: "shopping",
        application: HttpLlm.application({
          model: "chatgpt",
          document: await fetch(
            "https://shopping-be.wrtn.ai/editor/swagger.json",
          ).then((res) => res.json()),
        }),
        connection: {
          host: "https://shopping-be.wrtn.ai",
        },
      },
    ],
  });
 
  // DO BENCHMARK
  const find = (method: OpenApi.Method, path: string): IAgenticaOperation => {
    const found = agent
      .getOperations()
      .find(
        (op) =>
          op.protocol === "http" &&
          op.function.method === method &&
          op.function.path === path,
      );
    if (!found) throw new Error(`Operation not found: ${method} ${path}`);
    return found;
  };
  const benchmark: AgenticaSelectBenchmark<"chatgpt"> = 
    new AgenticaSelectBenchmark({
      agent,
      config: {
        repeat: 4,
        simultaneous: 100,
      },
      scenarios: [
        {
          name: "order",
          text: [
            "I wanna see every sales in the shopping mall",
            "",
            "And then show me the detailed information about the Macbook.",
            "",
            "After that, select the most expensive stock",
            "from the Macbook, and put it into my shopping cart.",
            "And take the shopping cart to the order.",
            "",
            "At last, I'll publish it by cash payment, and my address is",
            "",
            "  - country: South Korea",
            "  - city/province: Seoul",
            "  - department: Wrtn Apartment",
            "  - Possession: 101-1411",
          ].join("\n"),
          expected: {
            type: "array",
            items: [
              {
                type: "standalone",
                operation: find("patch", "/shoppings/customers/sales"),
              },
              {
                type: "standalone",
                operation: find("get", "/shoppings/customers/sales/{id}"),
              },
              {
                type: "anyOf",
                anyOf: [
                  {
                    type: "standalone",
                    operation: find("post", "/shoppings/customers/orders"),
                  },
                  {
                    type: "standalone",
                    operation: find("post", "/shoppings/customers/orders/direct"),
                  },
                ],
              },
              {
                type: "standalone",
                operation: find(
                  "post",
                  "/shoppings/customers/orders/{orderId}/publish",
                ),
              },
            ],
          },
        },
      ] satisfies IAgenticaSelectBenchmarkScenario<"chatgpt">[],
    });
  await benchmark.execute();
 
  // REPORT
  const docs: Record<string, string> = benchmark.report();
  const root: string = `docs/benchmarks/call`;
 
  await rmdir(root);
  for (const [key, value] of Object.entries(docs)) {
    await mkdir(path.join(root, key.split("/").slice(0, -1).join("/")));
    await fs.promises.writeFile(path.join(root, key), value, "utf8");
  }
};
main().catch(console.error);

undefined

@agentica/benchmark/AgenticaSelectBenchmark


/**
 * LLM function calling selection benchmark.
 *
 * `AgenticaSelectBenchmark` is a class for the benchmark of the
 * LLM (Large Model Language) function calling's selection part.
 * It utilizes the `selector` agent and tests whether the expected
 * {@link IAgenticaOperation operations} are properly selected from
 * the given {@link IAgenticaSelectBenchmarkScenario scenarios}.
 *
 * Note that, this `AgenticaSelectBenchmark` class measures only the
 * selection benchmark, testing whether the `selector` agent can select
 * candidate functions to call as expected. Therefore, it does not test
 * about the actual function calling which is done by the `executor` agent.
 * If you want that feature, use {@link AgenticaCallBenchmark} class instead.
 * 
 * @author Samchon
 */
export class AgenticaSelectBenchmark<Model extends ILlmSchema.Model> {
  /**
   * Initializer Constructor.
   *
   * @param props Properties of the selection benchmark
   */
  public constructor(props: AgenticaSelectBenchmark.IProps<Model>);
 
  /**
   * Execute the benchmark.
   *
   * Execute the benchmark of the LLM function selection, and returns
   * the result of the benchmark.
   *
   * If you wanna see progress of the benchmark, you can pass a callback
   * function as the argument of the `listener`. The callback function
   * would be called whenever a benchmark event is occurred.
   *
   * Also, you can publish a markdown format report by calling
   * the {@link report} function after the benchmark execution.
   *
   * @param listener Callback function listening the benchmark events
   * @returns Results of the function selection benchmark
   */
  public execute(
    listener?: (event: IAgenticaSelectBenchmarkEvent<Model>) => void,
  ): Promise<IAgenticaSelectBenchmarkResult<Model>>;
 
  /**
   * Report the benchmark result as markdown files.
   *
   * Report the benchmark result {@link execute}d by
   * `AgenticaSelectBenchmark` as markdown files, and returns a
   * dictionary object of the markdown reporting files. The key of
   * the dictionary would be file name, and the value would be the
   * markdown content.
   *
   * For reference, the markdown files are composed like below:
   *
   * - `./README.md`
   * - `./scenario-1/README.md`
   * - `./scenario-1/1.success.md`
   * - `./scenario-1/2.failure.md`
   * - `./scenario-1/3.error.md`
   *
   * @returns Dictionary of markdown files.
   */
  public report(): Record<string, string>;
}
export namespace AgenticaSelectBenchmark {
  /**
   * Properties of the {@link AgenticaSelectBenchmark} constructor.
   */
  export interface IProps<Model extends ILlmSchema.Model> {
    /**
     * AI agent instance.
     */
    agent: Agentica<Model>;
 
    /**
     * List of scenarios what you expect.
     */
    scenarios: IAgenticaSelectBenchmarkScenario<Model>[];
 
    /**
     * Configuration for the benchmark.
     */
    config?: Partial<IConfig>;
  }
 
  /**
   * Configuration for the benchmark.
   *
   * `AgenticaSelectBenchmark.IConfig` is a data structure which
   * represents a configuration for the benchmark, especially the
   * capacity information of the benchmark execution.
   */
  export interface IConfig {
    /**
     * Repeat count.
     *
     * The number of repeating count for the benchmark execution
     * for each scenario.
     *
     * @default 10
     */
    repeat: number & tags.Type<"uint32"> & tags.Minimum<1>;
 
    /**
     * Simultaneous count.
     *
     * The number of simultaneous count for the parallel benchmark
     * execution.
     *
     * If you configure this property greater than `1`, the benchmark
     * for each scenario would be executed in parallel in the given
     * count.
     *
     * @default 10
     */
    simultaneous: number & tags.Type<"uint32"> & tags.Minimum<1>;
  }
}

undefined

@agentia/benchmark/IAgenticaSelectBenchmarkEvent


/**
 * Event of LLM function selection benchmark.
 *
 * `IAgenticaSelectBenchmarkEvent` is an union type of the events occurred
 * during the LLM function selection benchmark, representing one phase of
 * the benchmark testing about a scenario.
 *
 * In other words, when {@link AgenticaSelectBenchmark} executes the
 * benchmark, it will run the benchmark will test a scenario repeately with
 * the given configuration {@link AgenticaSelectBenchmark.IConfig.repeat}.
 * And in the repeated benchmark about a scenario,
 * `IAgenticaSelectBenchmarkEvent` is one of the repeated testing.
 *
 * For reference, `IAgenticaSelectBenchmarkEvent` is categorized into three
 * types: `success`, `failure`, and `error`. The `success` means the
 * benchmark testing is fully meet the expected scenario, and `failure`
 * means that the `selector` had not selected the expected operations. The
 * last type, `error`, means that an error had been occurred during the
 * benchmark testing.
 *
 * @author Samchon
 */
export type IAgenticaSelectBenchmarkEvent<Model extends ILlmSchema.Model> =
  | IAgenticaSelectBenchmarkEvent.ISuccess<Model>
  | IAgenticaSelectBenchmarkEvent.IFailure<Model>
  | IAgenticaSelectBenchmarkEvent.IError<Model>;
export namespace IAgenticaSelectBenchmarkEvent {
  /**
   * Success event type.
   *
   * The `success` event type represents that the benchmark testing is
   * fully meet the expected scenario.
   */
  export interface ISuccess<Model extends ILlmSchema.Model>
    extends IEventBase<"success", Model> {
    /**
     * Usage of the token during the benchmark.
     */
    usage: IAgenticaTokenUsage;
 
    /**
     * Selected operations in the benchmark.
     */
    selected: IAgenticaOperationSelection<Model>[];
 
    /**
     * Prompt messages from the assistant.
     */
    assistantPrompts: IAgenticaPrompt.IText<"assistant">[];
  }
 
  /**
   * Failure event type.
   *
   * The `failure` event type represents that the `selector` had not
   * selected the expected scenario in the benchmark testing.
   */
  export interface IFailure<Model extends ILlmSchema.Model>
    extends IEventBase<"failure", Model> {
    /**
     * Usage of the token during the benchmark.
     */
    usage: IAgenticaTokenUsage;
 
    /**
     * Selected operations in the benchmark.
     */
    selected: IAgenticaOperationSelection<Model>[];
 
    /**
     * Prompt messages from the assistant.
     */
    assistantPrompts: IAgenticaPrompt.IText<"assistant">[];
  }
 
  /**
   * Error event type.
   * 
   * The `error` event type repsents that an error had been occurred
   * during the benchmark testing.
   */
  export interface IError<Model extends ILlmSchema.Model>
    extends IEventBase<"error", Model> {
    /**
     * Error occurred during the benchmark.
     */
    error: unknown;
  }
 
  interface IEventBase<Type extends string, Model extends ILlmSchema.Model> {
    /**
     * Discriminant type.
     */
    type: Type;
 
    /**
     * Expected scenario.
     */
    scenario: IAgenticaSelectBenchmarkScenario<Model>;
 
    /**
     * When the benchmark testing started.
     */
    started_at: Date;
 
    /**
     * When the benchmark testing completed.
     */
    completed_at: Date;
  }
}

Call AgenticaSelectBenchmark.execute() and AgenticaSelectBenchmark.report() functions.

You can measure function selecting benchmark by AgenticaSelectBenchmark class. Create its instance with Agentica typed instance. You can customize some configurations when creating the AgenticaSelectBenchmark instance. The configuration property repeat means a repeating count for the benchmark execution for each scenario, and the simultaneous property means a simultaneous count for the parallel benchmark execution.

When you call the AgenticaSelectBenchmark.execute() function after construction, the benchmark will be executed. If you’ve provided a lot of scenarios or configured huge repeating counts, it may consume a lot of time. To trace progress of the benchmark, you can pass a callback function as the argument of the listener parameter.

After the benchmark, you can get its report by calling the AgenticaSelectBenchmark.report() function. The report will be returned as a dictionary object of the markdown reporting files. The key of the dictionary would be the file name with its relative location path, and the value would be the markdown content.

Report Example: @wrtnlabs/agentica/test/examples/benchmarks/select

Benchmark Scenario

undefined

@agentica/benchmark/IAgenticaSelectBenchmarkScenario


/**
 * Scenario of function selection.
 *
 * `IAgenticaSelectBenchmarkScenario` is a data structure which
 * represents a function selection benchmark scenario. It contains two
 * properties; {@linkk text} and {@link operations}.
 *
 * The {@link text} means the conversation text from the user, and
 * the other {@link operations} are the expected operations that
 * should be selected by the `selector` agent through the {@link text}
 * conversation.
 *
 * @author Samchon
 */
export interface IAgenticaSelectBenchmarkScenario<
  Model extends ILlmSchema.Model,
> {
  /**
   * Name of the scenario.
   *
   * It must be unique within the benchmark scenarios.
   */
  name: string;
 
  /**
   * The prompt text from user.
   */
  text: string;
 
  /**
   * Expected function selection sequence.
   *
   * Sequence of operations (API operation or class function) that
   * should be selected by the `selector` agent from the user's
   * {@link text} conversation for the LLM (Large Language Model)
   * function selection.
   */
  expected: IAgenticaBenchmarkExpected<Model>;
}

No more feature?

@agentica/benchmark already supporting function calling’s execution benchmark feature. However, as @agentica/benchmark is not providing multi-turn benchmark, the benchmark scenario must be constructed by only one conversation text, so it is not enough to measure the function calling’s quality.

We’re planning to multi turn benchmark feature in the future, so the function calling’s execution benchmark feature would be meaningful at that time. Until that, please be satisfied with the current feature, function selecting benchmark. In the current feature, only function selecting benchmark is meaningful.

For reference, by utilizing the validation feedback strategy, @agentica/core does not failed on the arguments composition step. So if have succeeded to acquire the high scored benchmark result in the function select benchmark, your application is sufficiently ready to be launched.