Prompt evaluating library POC

This proof of concept consists of three parts:

The prompt-evaluating library (source)
- responsible for loading/saving data
- running prompts against data
- processing results according to use case definition
Use case definition files (example)
The web server and UI (currently built using SvelteKit, because I wanted to try it)
- rendering UI for editing data and results
- using the prompt-evaluating library internally to process the data and results

Use case definition

The use case definition has the type

interface UseCaseDefinition<Response extends Record<string, any> = any> {
  prompt: {
    variables: string[];
  };
  groundTruth: {
    fields: string[];
  };
  parseJson: (rawResponse: string) => JSON,
  processResponse: (json: JSON) => Response;
  evaluateResponse?: (response: Response, groundTruth: Response) => Response;
  sortResults: (results: PromptResults) => PromptResults[];
}

It needs to define a few functions:

parseJson

The model response is always string. The first step is to parse the text response to a JSON with ground truth fields.

processResponse

Convert the JSON record to actual ground truth values. This function is also used to validate ground truth data rows.
The returned shape is up to a user and is specific to a use case.
The function can return additional (computed) fields than are defined in the ground truth. This fields can be used by the evaluateResponse function and will be computed by the library for every edited row in the ground truth table in the UI.

evaluateResponse

Compare the response against the ground truth and return boolean for each field.
True means the value matches the ground truth, false means it doesn't.
The return value can also be augmented it with other fields to help with sorting (true positive, false negative, etc.).

sortResults

This function gets an array of results, one item per prompt. Each item will contain all fields from the parsed response and values will be counts of data rows that matches the ground truth.
This function is supposed to sort the prompt results with descending order from best to worst.
This function is used to get the rating (ordering) of the results for all prompts. It's also used to compare the result with a baseline (if the result is first and the baselibe is second, then the new result is better than the baseline. In other case it's worse than the baseline).

The prompt evaluating library

Saving/loading of data rows, ground truth and prompt definitions
- the data can be stored in the file system, a DB, CloudFlares KV store, Google sheets etc.
Each row is loaded, the prompt variables are replaced in user prompt, and then the OpenAI is called with the prompt
The raw OpenAI result is processed with the parseJson and processResponse functions from the use case definition
Then all field values are compared with the ground truth and the count of rows that matched the ground truth field is returned
This process is repeated fo each prompt template
Then the sortResults function from the use case definition is used to sort the results for all prompt templates

UI

The UI contains two editable tables:

Ground truth
- contains values for prompt variables
- contain ground truth field values
Prompts
- contains fields for prompt templates
- Model to use, system prompt, user prompt, parameters (temperature etc.)

You can evaluate all prompts at once (the prompt-evaluating library is able to cache the OpenAI results for faster responses). The table with sorted results is displayed with all the fields from the parsed response and count of rows matching the ground truth for that field.
When you edit a prompt you can also evaluate only the changed prompt and then results for all prompts are sorted again. Each prompt edit/run is saved, so you have a history, and you can see if you are making the prompt better or worse over time (not implemented right now).

The CLI editor

You can run the prompts and display the results in the terminal by using the CLI interface.
This can also be used while creating the use case definition to check that everything works correctly before starting to work in the UI (the prompt-evaluating library is able to cache the OpenAI results for faster responses).

To run a use case using CLI run the following command:

npx tsx ./src/prompt-evaluating-lib/cli.ts <use-case-name>

For example:

npx tsx ./src/prompt-evaluating-lib/cli.ts actionItemsMeta

Discussion/additional features

Is this useful?
Now everything is written in TypeScript for fast POC
It could make sense to have the use case definition files written in Python
Either the whole library could be written in Python or the Node.js library would call the python code to process results
Having it written in TypeScript has advantages though
- Whole web server/app infrastructure was ready for free in 5 minutes
- Easy deploy to serverless or edge (which means the app could run for free on Cloudflare/Vercel for our purposes)
- Code share between server and client
  - the use case definition functions could be used on client for instant ground truth validation or results sorting
One possibility is to use the app only for GT/prompt editing and running prompts, and results could be evaluated manually
- after manual evaluation, it could be saved and used to compare against the baseline
It could be split into multiple parts
- The core library itself
- The web server and browser UI (other people could build their own GUIs/servers on top of the core library)
- The CLI script/result printer (could be part of the core library)
- The response processing combinator functions. However, users could use any JSON parsing/validation library they want
Could be connected to other APIs/models, not just the OpenAI
The data rows could be fetched from a Google sheet
The results for each data row could be synced back to the Google sheet
Or the whole data editing could stay in google sheets and the app could be used to just run prompts, and save the results back to Google sheets (kind of like a gsheet plugin if something like that is possible)
The UI could have buttons to import/export data
- from a JSON file
- from a CSV file
- Connect to a Google sheet (and pick column ranges)
- Connect to our DB (e.g. action items could be fetched automatically, only rows that have GT annotated would be used in runs)
Could be used as a part of integrated tests to detect prompt regressions (or model regressions)