Prompt evaluating library POC
This proof of concept consists of three parts:
- The prompt-evaluating library (source)
- responsible for loading/saving data
- running prompts against data
- processing results according to use case definition
- Use case definition files (example)
- The web server and UI (currently built using SvelteKit, because I wanted to try it)
- rendering UI for editing data and results
- using the prompt-evaluating library internally to process the data and results
Use case definition
The use case definition has the type
interface UseCaseDefinition<Response extends Record<string, any> = any> {
prompt: {
variables: string[];
};
groundTruth: {
fields: string[];
};
parseJson: (rawResponse: string) => JSON,
processResponse: (json: JSON) => Response;
evaluateResponse?: (response: Response, groundTruth: Response) => Response;
sortResults: (results: PromptResults) => PromptResults[];
}It needs to define a few functions:
parseJson
The model response is always string. The first step is to parse the text response to a JSON with ground truth fields.
processResponse
Convert the JSON record to actual ground truth values. This function is also used to validate ground truth data rows.
The returned shape is up to a user and is specific to a use case.
The function can return additional (computed) fields
than are defined in the ground truth. This fields can be used by the evaluateResponse function and will be
computed by the library for every edited row in the ground truth table in the UI.
evaluateResponse
Compare the response against the ground truth and return boolean for each field.
True means the value matches the ground truth, false means it doesn't.
The return value can also be augmented it with other fields to help with sorting (true positive, false negative, etc.).
sortResults
This function gets an array of results, one item per prompt. Each item will contain
all fields from the parsed response and values will be counts of data rows
that matches the ground truth.
This function is supposed to sort the prompt results with descending order
from best to worst.
This function is used to get the rating (ordering) of the results for all prompts.
It's also used to compare the result with a baseline (if the result is first and the baselibe is second, then the new result is better than the baseline. In other case it's worse than the baseline).
The prompt evaluating library
- Saving/loading of data rows, ground truth and prompt definitions
- the data can be stored in the file system, a DB, CloudFlares KV store, Google sheets etc.
- Each row is loaded, the prompt variables are replaced in user prompt, and then the OpenAI is called with the prompt
- The raw OpenAI result is processed with the
parseJsonandprocessResponsefunctions from the use case definition - Then all field values are compared with the ground truth and the count of rows that matched the ground truth field is returned
- This process is repeated fo each prompt template
- Then the
sortResultsfunction from the use case definition is used to sort the results for all prompt templates
UI
The UI contains two editable tables:
- Ground truth
- contains values for prompt variables
- contain ground truth field values
- Prompts
- contains fields for prompt templates
- Model to use, system prompt, user prompt, parameters (temperature etc.)
You can evaluate all prompts at once (the prompt-evaluating library is able to cache
the OpenAI results for faster responses). The table with sorted results is displayed
with all the fields from the parsed response and count of rows matching the ground
truth for that field.
When you edit a prompt you can also evaluate only the changed prompt
and then results for all prompts are sorted again. Each prompt edit/run
is saved, so you have a history, and you can see if you are making the prompt
better or worse over time (not implemented right now).
The CLI editor
You can run the prompts and display the results in the terminal by using
the CLI interface.
This can also be used while creating the use case definition to check
that everything works correctly before starting to work
in the UI (the prompt-evaluating library is able to cache the OpenAI results
for faster responses).
To run a use case using CLI run the following command:
npx tsx ./src/prompt-evaluating-lib/cli.ts <use-case-name>For example:
npx tsx ./src/prompt-evaluating-lib/cli.ts actionItemsMetaDiscussion/additional features
- Is this useful?
- Now everything is written in TypeScript for fast POC
- It could make sense to have the use case definition files written in Python
- Either the whole library could be written in Python or the Node.js library would call the python code to process results
- Having it written in TypeScript has advantages though
- Whole web server/app infrastructure was ready for free in 5 minutes
- Easy deploy to serverless or edge (which means the app could run for free on Cloudflare/Vercel for our purposes)
- Code share between server and client
- the use case definition functions could be used on client for instant ground truth validation or results sorting
- One possibility is to use the app only for GT/prompt editing and running prompts, and results could be evaluated manually
- after manual evaluation, it could be saved and used to compare against the baseline
- It could be split into multiple parts
- The core library itself
- The web server and browser UI (other people could build their own GUIs/servers on top of the core library)
- The CLI script/result printer (could be part of the core library)
- The response processing combinator functions. However, users could use any JSON parsing/validation library they want
- Could be connected to other APIs/models, not just the OpenAI
- The data rows could be fetched from a Google sheet
- The results for each data row could be synced back to the Google sheet
- Or the whole data editing could stay in google sheets and the app could be used to just run prompts, and save the results back to Google sheets (kind of like a gsheet plugin if something like that is possible)
- The UI could have buttons to import/export data
- from a JSON file
- from a CSV file
- Connect to a Google sheet (and pick column ranges)
- Connect to our DB (e.g. action items could be fetched automatically, only rows that have GT annotated would be used in runs)
- Could be used as a part of integrated tests to detect prompt regressions (or model regressions)