Empowering Website Conversations: Part 3

Introduction

In the preceding section, we crafted a question-answer dataset by harnessing OpenAI’s completions API on our cleaned Markdown website files. Our next step involves leveraging this dataset to embark on the fine-tuning for our Chatbot QA model with OpenAI’s fine-tuning API. As we delve into this process, we shall also discuss the significance of comprehensive testing. This important step ensures that our models yield the outcomes we desire, instilling the confidence needed for their reliable deployment.

The previous section can be found here

Fine-tuning OpenAI Models

OpenAI’s fine-tuning process provides several benefits over using the provided base models directly. By customizing pre-trained models like GPT-3.5, fine-tuning empowers us to tailor responses for our chatbot’s needs. This process grants control over output, allowing for more control over guidelines, tonal consistency, and brand identity. With a focus on our domain, fine-tuning enhances the AI solution’s knowledge in our specific area, ensuring more accurate and informed interactions. Moreover, this process is data-efficient, enabling model specialization with limited data, and is also resource-friendly, requiring fewer tokens per prompt compared to providing a context.

OpenAI provides several different models of natural language processing (NLP) tasks. At the time of writing the available models are gpt-3.5-turbo-0613, babbage-002, and davinci-002. GPT-4 support is expected to be added soon.

OpenAI API

We are going to be using OpenAI’s python interface and the CLI. The CLI can be installed with the following command and requires python3.

$ pip install --upgrade openai
$ openai
openai
usage: openai [-h] [-V] [-v] [-b API_BASE] [-k API_KEY] [-p PROXY [PROXY ...]] [-o ORGANIZATION] {api,tools,wandb} ...

positional arguments:
  {api,tools,wandb}
  api               Direct API calls
  tools             Client side tools for convenience
  wandb             Logging with Weights & Biases

optional arguments:
  -h, --help          show this help message and exit
  -V, --version       show program's version number and exit
  -v, --verbose       Set verbosity.
  -b API_BASE, --api-base API_BASE
                      What API base url to use.
  -k API_KEY, --api-key API_KEY
                      What API key to use.
  -p PROXY [PROXY ...], --proxy PROXY [PROXY ...]
                      What proxy to use.
  -o ORGANIZATION, --organization ORGANIZATION
                      Which organization to run as (will use your default organization if not specified)

The CLI provides several commands for interacting with the OpenAI’s API. For creating the fine tuning model we will use the Python library directly.

openai.File.create(
  file=open("mydata.jsonl", "rb"),
  purpose='fine-tune'
)

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

Selecting the Right OpenAI Model for Fine-tuning

OpenAI describes the available models models as follows:

Model	Description	Max Tokens	Training Data
gpt-3.5-turbo-0613	Snapshot of gpt-3.5-turbo from June 13th 2023 with function calling data. Unlike gpt-3.5-turbo, this model will not receive updates, and will be deprecated 3 months after a new version is released.	4,096 tokens	Up to Sep 2021
davinci-002	Most capable GPT base model. Can do any task the other models can do, often with higher quality. Replacement for the GPT-3 curie and davinci base models.	16,384 tokens	Up to Oct 2019
babbage-002	Capable of straightforward tasks, very fast, and lower cost. Replacement for the GPT-3 ada and babbage base models.	16,384 tokens	Up to Sep 2021

The base model you choose to use for your fine-tuned model should be up to the task but you should also carefully weigh the cost associated with using the models. For pricing information see OpenAI’s pricing breakdown page. In our example we are going to use gpt-3.5-turbo-0613 as the base for our Question/Answer model.

Train the Chatbot Model

Now that we have the QA dataset, we can start the fine-tuning process! We are going to walk through how we can turn the question and answer dataset we generated in the previous section into the JSONL training format for fine-tuning.

{"messages": [
    {"role": "system", "content": "<Role of the model>"},
    {"role": "user", "content": "<User question or prompt>"},
    {"role": "assistant", "content": "<Desired generated result>"}
]}
{"messages": [
    {"role": "system", "content": "<Role of the model>"},
    {"role": "user", "content": "<User question or prompt>"},
    {"role": "assistant", "content": "<Desired generated result>"}
]}
{"messages": [
    {"role": "system", "content": "<Role of the model>"},
    {"role": "user", "content": "<User question or prompt>"},
    {"role": "assistant", "content": "<Desired generated result>"}
]}

The following was done in Google Colab.

Import Dependencies and QA dataset

We need to get access to the QA dataset we created in the previous section.

try:
  import openai
except:
  !pip install openai
  import openai

import pandas as pd
qa_df = pd.read_csv('data/embyr_website_qa.csv')

Create the files for training the QA model

The messages for training will be the following format:

{"messages": [
    {
       "role": "system",
       "content": "You are a factual chatbot to answer questions about AI and Embyr."
    },
    {"role": "user", "content": "<QUESTION>"},
    {"role": "assistant", "content": "<ANSWER>**STOP**"}
]}

Notice the **STOP** characters added to the end of the completion. This will help the model know when to stop generating output. Without training in a “stop token” that can be used later, the model can start to repeat itself or return nonsense.

The following function will go through the QA dataset and turn each question/answer pair into a separate set of messages. The resulting training file will be saved to qa_train.jsonl.

def create_fine_tuning_dataset(df):
  """
  Create a dataset for fine tuning the OpenAI model

  Parameters
  ----------
  df: pd.DataFrame
      The dataframe containing the question and answer

  Returns
  -------
  pd.DataFrame
      The dataframe containing the prompts and completions, ready for fine-tuning
  """
  rows = []
  for i, row in df.iterrows():
      for q, a in zip((row.questions).split('\n'), (row.answers).split('\n')):
          if len(q) >10 and len(a) >10:
                  rows.append({"messages": [
                    {"role": "system", "content": "You are a factual chatbot to answer questions about AI and Embyr."},
                    {"role": "user", "content": f"{q[2:].strip()}"},
                    {"role": "assistant", "content": f"{a[2:].strip()} **STOP**"}
                    ]})
  return pd.DataFrame(rows)

ft = create_fine_tuning_dataset(qa_df)
ft.to_json(f'qa_train.jsonl', orient='records', lines=True)

Train the model

We are finally ready to fine-tune a model. To do so use the openai Python library.

Make sure to have set your API key OPENAI_API_KEY environment variable before trying to run the command. The following uploads the file we just created to OpenAI and then creates a fine-tuning job for that file. You may need to wait a bit between uploading the file and starting the fine-tuning job.

import os
openai.api_key = os.getenv("OPENAI_API_KEY")

file = openai.File.create(
  file=open("qa_train.jsonl", "rb"),
  purpose='fine-tune'
)

job = openai.FineTuningJob.create(training_file=file.id, model="gpt-3.5-turbo")

The job may take a while to complete depending on the current queue and how large your data is. To check on the status of your job, you can use the following command to see a list of events.

openai.FineTuningJob.list_events(id=job.id, limit=10)

<OpenAIObject list at 0x7e0857ee8a40> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-ID",
      "created_at": 1693343247,
      "level": "info",
      "message": "Step 500/1113: training loss=1.35",
      "data": {
        "step": 500,
        "train_loss": 1.3462088108062744,
        "train_mean_token_accuracy": 0.6603773832321167
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-ID",
      "created_at": 1693343112,
      "level": "info",
      "message": "Step 400/1113: training loss=0.13",
      "data": {
        "step": 400,
        "train_loss": 0.1310240626335144,
        "train_mean_token_accuracy": 0.9545454382896423
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-ID",
      "created_at": 1693342974,
      "level": "info",
      "message": "Step 300/1113: training loss=1.49",
      "data": {
        "step": 300,
        "train_loss": 1.4866760969161987,
        "train_mean_token_accuracy": 0.8333333134651184
      },
      "type": "metrics"
    },
…

Once the job has completed the account associated with the API key will receive an email with the complete model name.

    Hi Embyr,

    Your fine-tuning job ftjob-ID has successfully completed, and a new model
    ft:gpt-3.5-turbo-0613:embyr::ID has been created for your use.

    Try it out on the OpenAI Playground or integrate it into your application
    using the Completions API.

    Thank you for building on the OpenAI platform,
    The OpenAI team

Evaluating the Fine-tuned ChatBot

Now that we have trained our model we need to verify it returns our expected output. Testing ensures that the model aligns with your intended objectives, generating responses that meet the desired quality and tone. Beyond validation, testing gauges the model’s ability to handle diverse inputs. It’s also a critical safeguard against potential biases that may have emerged during fine-tuning. By prioritizing user experience, testing validates that the model provides coherent, accurate, and appropriate responses, contributing to elevated user satisfaction.

I would suggest running through several of the questions used for training verbatim, variations on those questions, as well as some completely unrelated questions. This can be done with the OpenAI CLI chat_completions subcommand:

$openai api chat_completions.create -h
usage: openai api chat_completions.create [-h] -g ROLE CONTENT [-e ENGINE | -m MODEL] [-n N] [-M MAX_TOKENS] [-t TEMPERATURE]
                                        [-P TOP_P] [--stop STOP] [--stream]

required arguments:
  -g ROLE CONTENT, --message ROLE CONTENT
                      A message in `{role} {content}` format. Use this argument multiple times to add multiple messages.

optional arguments:
  -e ENGINE, --engine ENGINE
                      The engine to use. See https://learn.microsoft.com/en-us/azure/cognitive-services/openai/chatgpt-
                      quickstart?pivots=programming-language-python for more about what engines are available.
  -m MODEL, --model MODEL
                      The model to use.
  -n N, --n N         How many completions to generate for the conversation.
  -M MAX_TOKENS, --max-tokens MAX_TOKENS
                      The maximum number of tokens to generate.
  -t TEMPERATURE, --temperature TEMPERATURE
                      What sampling temperature to use. Higher values means the model will take more risks. Try 0.9 for more
                      creative applications, and 0 (argmax sampling) for ones with a well-defined answer. Mutually exclusive with
                      `top_p`.
  -P TOP_P, --top_p TOP_P
                      An alternative to sampling with temperature, called nucleus sampling, where the considers the results of the
                      tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are
                      considered. Mutually exclusive with `temperature`.
  --stop STOP         A stop sequence at which to stop generating tokens for the message.
  --stream            Stream messages as they're ready.

I used 0.0 for temperature, which controls the amount of randomness introduced into the results. The lower the value the more deterministic the response will be, which is useful to make sure the models are starting from the correct content. Replace the model passed to -m with the ID of your trained model.

$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "How can Embyr help me integrate AI into my business?" -M 500 --stop '**STOP**'
Embyr can help businesses integrate AI into their existing systems by providing consulting services, assisting with project scoping, and managing the implementation process.

$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "How can AI help my business?" -M 500 --stop '**STOP**'
AI can help businesses automate processes, gain insight from data, and provide better customer experiences. AI technologies, applications, and use cases are vast and continually evolving.

$ openai api chat_completions.create -m ft:gpt-3.5-turbo-0613:embyr::7t0RQy5K --message system "You are a factual chatbot to answer questions about AI and Embyr." --message user "What is Star Wars?" -M 500 --stop '**STOP**'
Star Wars is a popular science fiction franchise that was created by George Lucas. It has become an astronomical success and consists of movies, TV series, books, comics, and more.

Hmm… that worked but we probably don’t want our ChatBot to be able to answer questions about everything. In the next section, we will discuss how we can use discriminators to help guide our chatbot to stay on topic.

Conclusion

We have successfully trained the QA model that will act as the core of our custom Embyr Website chatbot. We converted our QA dataset into the required JSONL format for fine-tuning, and utilized OpenAI’s fine-tuning API to create a custom model trained on all things Embyr. But we aren’t done yet. After testing our model, we know we still need to design a way to ensure the input and output responses always relate to Embyr. In the next section, we will discuss how we can handle this with the use of discriminator fine-tuned models that use babbage-002 to “decide” if the input and/or output are related.

Part 1: What are Chatbots, and why would I want one?

Part 2: From Markdown to Training Data?

Part 4: Safeguard your Chatbot with Discriminators

References

OpenAI Fine-tuning

Olympics example from OpenAI Cookbook

Chat API reference

pandas