

Fine-tune a Chatbot QA model
Empowering Website Conversations: Part 3
Introduction
In the preceding section, we crafted a question-answer dataset by harnessing OpenAI’s completions API on our cleaned Markdown website files. Our next step involves leveraging this dataset to embark on the fine-tuning for our Chatbot QA model with OpenAI’s fine-tuning API. As we delve into this process, we shall also discuss the significance of comprehensive testing. This important step ensures that our models yield the outcomes we desire, instilling the confidence needed for their reliable deployment.
The previous section can be found here
Fine-tuning OpenAI Models
OpenAI’s fine-tuning process provides several benefits over using the provided base models directly. By customizing pre-trained models like GPT-3.5, fine-tuning empowers us to tailor responses for our chatbot’s needs. This process grants control over output, allowing for more control over guidelines, tonal consistency, and brand identity. With a focus on our domain, fine-tuning enhances the AI solution’s knowledge in our specific area, ensuring more accurate and informed interactions. Moreover, this process is data-efficient, enabling model specialization with limited data, and is also resource-friendly, requiring fewer tokens per prompt compared to providing a context.
OpenAI provides several different models of natural language processing (NLP)
tasks. At the time of writing the available models are gpt-3.5-turbo-0613
,
babbage-002
, and davinci-002
. GPT-4 support is expected to be added soon.
OpenAI API
We are going to be using OpenAI’s python interface and the CLI. The CLI can be installed with the following command and requires python3.
$ pip install --upgrade openai
$ openai
openai
usage: openai [-h] [-V] [-v] [-b API_BASE] [-k API_KEY] [-p PROXY [PROXY ...]] [-o ORGANIZATION] {api,tools,wandb} ...
positional arguments:
{api,tools,wandb}
api Direct API calls
tools Client side tools for convenience
wandb Logging with Weights & Biases
optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-v, --verbose Set verbosity.
-b API_BASE, --api-base API_BASE
What API base url to use.
-k API_KEY, --api-key API_KEY
What API key to use.
-p PROXY [PROXY ...], --proxy PROXY [PROXY ...]
What proxy to use.
-o ORGANIZATION, --organization ORGANIZATION
Which organization to run as (will use your default organization if not specified)
The CLI provides several commands for interacting with the OpenAI’s API. For creating the fine tuning model we will use the Python library directly.
openai.File.create(
file=open("mydata.jsonl", "rb"),
purpose='fine-tune'
)
openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")
Selecting the Right OpenAI Model for Fine-tuning
OpenAI describes the available models models as follows:
Model | Description | Max Tokens | Training Data |
---|---|---|---|
gpt-3.5-turbo-0613 | Snapshot of gpt-3.5-turbo from June 13th 2023 with function calling data. Unlike gpt-3.5-turbo, this model will not receive updates, and will be deprecated 3 months after a new version is released. | 4,096 tokens | Up to Sep 2021 |
davinci-002 | Most capable GPT base model. Can do any task the other models can do, often with higher quality. Replacement for the GPT-3 curie and davinci base models. | 16,384 tokens | Up to Oct 2019 |
babbage-002 | Capable of straightforward tasks, very fast, and lower cost. Replacement for the GPT-3 ada and babbage base models. | 16,384 tokens | Up to Sep 2021 |
The base model you choose to use for your fine-tuned model should be up to the
task but you should also carefully weigh the cost associated with using the
models. For pricing information see OpenAI’s pricing breakdown
page. In our example we are going to use
gpt-3.5-turbo-0613
as the base for our Question/Answer model.
Train the Chatbot Model
Now that we have the QA dataset, we can start the fine-tuning process! We are going to walk through how we can turn the question and answer dataset we generated in the previous section into the JSONL training format for fine-tuning.
{"messages": [
{"role": "system", "content": "<Role of the model>"},
{"role": "user", "content": "<User question or prompt>"},
{"role": "assistant", "content": "<Desired generated result>"}
]}
{"messages": [
{"role": "system", "content": "<Role of the model>"},
{"role": "user", "content": "<User question or prompt>"},
{"role": "assistant", "content": "<Desired generated result>"}
]}
{"messages": [
{"role": "system", "content": "<Role of the model>"},
{"role": "user", "content": "<User question or prompt>"},
{"role": "assistant", "content": "<Desired generated result>"}
]}
The following was done in Google Colab.
Import Dependencies and QA dataset
We need to get access to the QA dataset we created in the previous section.
try:
import openai
except:
!pip install openai
import openai
import pandas as pd
qa_df = pd.read_csv('data/embyr_website_qa.csv')
Create the files for training the QA model
The messages for training will be the following format:
{"messages": [
{
"role": "system",
"content": "You are a factual chatbot to answer questions about AI and Embyr."
},
{"role": "user", "content": "<QUESTION>"},
{"role": "assistant", "content": "<ANSWER>**STOP**"}
]}
Notice the **STOP** characters added to the end of the completion. This will help the model know when to stop generating output. Without training in a “stop token” that can be used later, the model can start to repeat itself or return nonsense.
The following function will go through the QA dataset and turn each
question/answer pair into a separate set of messages. The resulting training
file will be saved to qa_train.jsonl
.
def create_fine_tuning_dataset(df):
"""
Create a dataset for fine tuning the OpenAI model
Parameters
----------
df: pd.DataFrame
The dataframe containing the question and answer
Returns
-------
pd.DataFrame
The dataframe containing the prompts and completions, ready for fine-tuning
"""
rows = []
for i, row in df.iterrows():
for q, a in zip((row.questions).split('\n'), (row.answers).split('\n')):
if len(q) >10 and len(a) >10:
rows.append({"messages": [
{"role": "system", "content": "You are a factual chatbot to answer questions about AI and Embyr."},
{"role": "user", "content": f"{q[2:].strip()}"},
{"role": "assistant", "content": f"{a[2:].strip()} **STOP**"}
]})
return pd.DataFrame(rows)
ft = create_fine_tuning_dataset(qa_df)
ft.to_json(f'qa_train.jsonl', orient='records', lines=True)
Train the model
We are finally ready to fine-tune a model. To do so use the openai
Python
library.
Make sure to have set your API key OPENAI_API_KEY
environment variable before
trying to run the command. The following uploads the file we just created to
OpenAI and then creates a fine-tuning job for that file. You may need to wait a
bit between uploading the file and starting the fine-tuning job.
import os
openai.api_key = os.getenv("OPENAI_API_KEY")
file = openai.File.create(
file=open("qa_train.jsonl", "rb"),
purpose='fine-tune'
)
job = openai.FineTuningJob.create(training_file=file.id, model="gpt-3.5-turbo")
The job may take a while to complete depending on the current queue and how large your data is. To check on the status of your job, you can use the following command to see a list of events.
openai.FineTuningJob.list_events(id=job.id, limit=10)
<OpenAIObject list at 0x7e0857ee8a40> JSON: {
"object": "list",
"data": [
{
"object": "fine_tuning.job.event",
"id": "ftevent-ID",
"created_at": 1693343247,
"level": "info",
"message": "Step 500/1113: training loss=1.35",
"data": {
"step": 500,
"train_loss": 1.3462088108062744,
"train_mean_token_accuracy": 0.6603773832321167
},
"type": "metrics"
},
{
"object": "fine_tuning.job.event",
"id": "ftevent-ID",
"created_at": 1693343112,
"level": "info",
"message": "Step 400/1113: training loss=0.13",
"data": {
"step": 400,
"train_loss": 0.1310240626335144,
"train_mean_token_accuracy": 0.9545454382896423
},
"type": "metrics"
},
{
"object": "fine_tuning.job.event",
"id": "ftevent-ID",
"created_at": 1693342974,
"level": "info",
"message": "Step 300/1113: training loss=1.49",
"data": {
"step": 300,
"train_loss": 1.4866760969161987,
"train_mean_token_accuracy": 0.8333333134651184
},
"type": "metrics"
},
…
Once the job has completed the account associated with the API key will receive an email with the complete model name.
Hi Embyr,
Your fine-tuning job ftjob-ID has successfully completed, and a new model
ft:gpt-3.5-turbo-0613:embyr::ID has been created for your use.
Try it out on the OpenAI Playground or integrate it into your application
using the Completions API.
Thank you for building on the OpenAI platform,
The OpenAI team
Evaluating the Fine-tuned ChatBot
Now that we have trained our model we need to verify it returns our expected output. Testing ensures that the model aligns with your intended objectives, generating responses that meet the desired quality and tone. Beyond validation, testing gauges the model’s ability to handle diverse inputs. It’s also a critical safeguard against potential biases that may have emerged during fine-tuning. By prioritizing user experience, testing validates that the model provides coherent, accurate, and appropriate responses, contributing to elevated user satisfaction.
I would suggest running through several of the questions used for training verbatim, variations on those questions, as well as some completely unrelated questions. This can be done with the OpenAI CLI chat_completions subcommand:
$openai api chat_completions.create -h
usage: openai api chat_completions.create [-h] -g ROLE CONTENT [-e ENGINE | -m MODEL] [-n N] [-M MAX_TOKENS] [-t TEMPERATURE]
[-P TOP_P] [--stop STOP] [--stream]
required arguments:
-g ROLE CONTENT, --message ROLE CONTENT
A message in `{role} {content}` format. Use this argument multiple times to add multiple messages.
optional arguments:
-e ENGINE, --engine ENGINE
The engine to use. See https://learn.microsoft.com/en-us/azure/cognitive-services/openai/chatgpt-
quickstart?pivots=programming-language-python for more about what engines are available.
-m MODEL, --model MODEL
The model to use.
-n N, --n N How many completions to generate for the conversation.
-M MAX_TOKENS, --max-tokens MAX_TOKENS
The maximum number of tokens to generate.
-t TEMPERATURE, --temperature TEMPERATURE
What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more
creative applications, and 0 (argmax sampling) for ones with a well-defined answer. Mutually exclusive with
`top_p`.
-P TOP_P, --top_p TOP_P
An alternative to sampling with temperature, called nucleus sampling, where the considers the results of the
tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are
considered. Mutually exclusive with `temperature`.
--stop STOP A stop sequence at which to stop generating tokens for the message.
--stream Stream messages as they're ready.
I used 0.0 for temperature, which controls the amount of randomness introduced
into the results. The lower the value the more deterministic the response will
be, which is useful to make sure the models are starting from the correct
content. Replace the model passed to -m
with the ID of your trained model.
$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "How can Embyr help me integrate AI into my business?" -M 500 --stop '**STOP**'
Embyr can help businesses integrate AI into their existing systems by providing consulting services, assisting with project scoping, and managing the implementation process.
$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "How can AI help my business?" -M 500 --stop '**STOP**'
AI can help businesses automate processes, gain insight from data, and provide better customer experiences. AI technologies, applications, and use cases are vast and continually evolving.
$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "What is Star Wars?" -M 500 --stop '**STOP**'
Star Wars is a popular science fiction franchise that was created by George Lucas. It has become an astronomical success and consists of movies, TV series, books, comics, and more.
Hmm… that worked but we probably don’t want our ChatBot to be able to answer questions about everything. In the next section, we will discuss how we can use discriminators to help guide our chatbot to stay on topic.
Conclusion
We have successfully trained the QA model that will act as the core of our
custom Embyr Website chatbot. We converted our QA dataset into the required
JSONL format for fine-tuning, and utilized OpenAI’s fine-tuning API to create a
custom model trained on all things Embyr. But we aren’t done yet. After testing
our model, we know we still need to design a way to ensure the input and output
responses always relate to Embyr. In the next section, we will discuss how we
can handle this with the use of discriminator fine-tuned models that use
babbage-002
to “decide” if the input and/or output are related.
Part 1: What are Chatbots, and why would I want one?
Part 2: From Markdown to Training Data?
Part 4: Safeguard your Chatbot with Discriminators