Introduction
It is designed to evaluate a task-oriented dialogue system by generating synthetic conversations, extracting task completion metrics, and producing a labeled synthetic dataset.
Tutorial
Here is an example for the customer service assistant chatbot.
-
First, create an API for the Agent you built. It will start an API on the default port 8000.
python model_api.py --input-dir ./examples/customer_service
- Fields:
--input-dir
: The directory that contains the needed files for the orchestrator and documents for the workers.--model
: The openai model type used to generate bot response. Default isgpt-4o
. You could change it to other models likegpt-4o-mini
.--port
: The port number to start the API. Default is 8000.
- Fields:
-
Then, start the evaluation process:
python eval.py \
--model_api http://127.0.0.1:8000/eval/chat \
--config ./examples/customer_service_config.json \
--documents_dir ./examples/customer_service \
--output-dir ./examples/customer_service- Fields:
--model_api
: The api url that you created in the previous step--config
: The path to the config file--documents_dir
: The directory that contains the generated files--output-dir
: The directory to save the evaluation results--num_convos
: Number of synthetic conversations to simulate. Default is 5.--num_goals
: Number of goals/tasks to simulate. Default is 5.--max_turns
: Maximum number of turns per conversation. Default is 5.--model
: The openai model type used to generate bot response. Default isgpt-4o
. You could change it to other models likegpt-4o-mini
.
- Fields:
Results
The evaluation will generate the following outputs in the specified output directory:
-
Simulated Synthetic Dataset (
simulate_data.json
)- JSON file containing simulated conversations generated based on the user's objective to evaluate the task success rate.
-
Labeled Synthetic Dataset (
labeled_data.json
)- JSON file containing labeled conversations generated based on the taskgraph to evaluate the NLU performance.
-
Goal Completion Metrics (
goal_completion.json
)- JSON file summarizing task completion statistics based on the bot's ability to achieve specified goals.