Fine-tuning OpenAI models

OpenAI documentation suggests that fine-tuning lets you get more out of the models by providing:

  • Higher quality results than prompting
  • Ability to train on more examples than can fit in a prompt
  • Token savings due to shorter prompts
  • Lower latency requests

I wanted to validate to what extent fine-tuning OpenAI models is worth it.

The general idea is very simple:

  1. Prepare training data that describes system message, user message and expected assistant response.
    1. I used ChatGPT 4 to generate random inputs.
    2. I then wrote a prompt that produces the expected output when using gpt-4-turbo.
    3. I then used the prompt to generate a training dataset.
  2. Upload the data to OpenAI https://platform.openai.com/finetune/ and start the fine-tuning process.
    • I used gpt-3.5-turbo-0125 as the base model for this test.
  3. Run your future prompts using the new model ID that you are provided.

The prompt I used was:

Parse user input and return mentions of geographical locations.

Reply with a JSON that contains an array of locations. Each location should describe city and country.

* "city" is null if city name is not mentioned in the input.
* "country" is null if country name is not mentioned in the input and cannot be inferred from the city name.

Example: "graphic designer in New York"

Expected Output:

{
  "locations": [
    {
      "city": "New York",
      "country": "United States"
    }
  ]
}

The prompt is simplified for the sake of this example. In reality, the prompt was trying to extract different types of entities from the user input, and provided many more examples.

Then I used ChatGPT 4 to generate random inputs and expected outputs, which I then formatted as training data:

{
  "messages": [
    {
      "content": "Parse user input and return mentions of geographical locations.\n\nReply with a JSON that contains an array of locations. Each location should describe city and country.\n\n* \"city\" is null if city name is not mentioned in the input.\n* \"country\" is null if country name is not mentioned in the input and cannot be inferred from the city name.\n\nExample: \"graphic designer in New York\"\n\nExpected Output:\n\n{\n  \"locations\": [\n    {\n      \"city\": \"New York\",\n      \"country\": \"United States\"\n    }\n  ]\n}",
      "role": "system"
    },
    {
      "content": "graphic designer in New York",
      "role": "user"
    },
    {
      "content": "{\"locations\":[{\"city\":\"New York\",\"country\":\"United States\"}]}",
      "role": "assistant"
    }
  ]
}

This turned out to be a mistake. Refer to the Gotchas section for more information.

I used 200 messages in the learning sample (somewhere between 50 and 100 is enough apparently 1).

Fine-tuning took about 20 minutes and cost no more than USD 5.

I then built a test suite that would cover various edge cases and compared the results of the fine-tuned model with the non-tuned model.

it('extracts mentions of geographical locations (city with inferred country)', async () => {
  const searchAttributes = await extractSearchAttributes(
    'graphic designer in New York',
  );
 
  expect(searchAttributes).toEqual({
    locations: [
      {
        city: 'New York',
        country: 'United States',
      },
    ],
  });
});

It is important to note that these edge cases were not present in the training dataset.

I used this to quickly sanity check the accuracy of the fine-tuned model vs the non-tuned model. It was also helpful to write out the test cases by hand as it helped me to think through the problem and understand the edge cases.

The fine-tuned model crushed all the tests.

After fine-tuning, gpt-3.5-turbo model performed just as a good as gpt-4-turbo.

This was a win.

The fine-tuned model takes anywhere from 40% to 70% less time to respond.

It varies a lot, but it is consistently faster than the non-tuned model.

However, it still takes anywhere from 300ms to 500ms to produce a response. In contrast, we get 100-200ms with Groq without fine-tuning.

The cost of using the fine-tuned model is not the same as the cost of the gpt-3.5-turbo model.

ModelStageCost
gpt-3.5-turbo (fine-tuned)Input$3.00 / 1M tokens
gpt-3.5-turbo (fine-tuned)Output$6.00 / 1M tokens
gpt-3.5-turbo (fine-tuned)Training$8.00 / 1M tokens
gpt-3.5-turboInput$0.50 / 1M tokens
gpt-3.5-turboOutput$1.00 / 1M tokens
gpt-4-turboInput$10.00 / 1M tokens
gpt-4-turboOutput$30.00 / 1M tokens

So the cost of using the fine-tuned model is about 6 times higher than the cost of using the gpt-3.5-turbo model. However, fine-tuned model requires far fewer system prompt tokens to produce a response. Continuing with our example, using fine-tuned model my system prompt is 300 tokens long, while I need system prompt of approx. 1,500 tokens to get equivalent results with gpt-4-turbo. (I was not able to get reliable results with gpt-3.5-turbo without fine-tuning.)

Let's say that I need to make 10,000 invocations:

  • With gpt-3.5-turbo (fine-tuned) it would cost me USD 9 ( ((300*10,000)/1,000,000)*3  )
  • With gpt-4-turbo it would cost me USD 150 ( ((1,500*10,000)/1,000,000)*10  )

That's a saving of 94%.

That's a win.

When I started reading about fine-tuning, what threw me off was that many examples insisted that the training prompt should reflect how the model is going to be used.

In retrospect, I misunderstood this to mean 'the training dataset should use the same prompt as the system prompt that we are trying to optimize,' which in my case was the prompt used to build the training dataset.

If you refer back to the process, what happened was that I built a prompt that was already able to produce the desired output. It was a long prompt, but it was able to produce the desired output. I then used that very prompt to create a dataset for fine-tuning the module. The problem is that I also included that very prompt as the system prompt in the training dataset. The fine-tuned model worked fine, but it was expecting the original system prompt, which largely defeated the intent (reducing token usage).

Anyway, the learning here is that it does not matter what the prompt was when you were preparing the training dataset. The only thing that matters is how you want to use the model in the future, i.e. my training data should have described a short prompt that would trigger the fine-tuned model to produce the desired output, e.g.

{
  "messages": [
    {
      "content": "Parse user input and return mentions of geographical locations.",
      "role": "system"
    },
    {
      "content": "graphic designer in New York",
      "role": "user"
    },
    {
      "content": "{\"locations\":[{\"city\":\"New York\",\"country\":\"United States\"}]}",
      "role": "assistant"
    }
  ]
}

A model fine-tuned using the above training data would then know how to parse user input and what to return, without needing to be reminded of the original system prompt.

When experimenting, create a table:

CREATE TABLE <namespace>_openai_chat_completion (
  id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  user_prompt citext NOT NULL,
  response json NOT NULL,
  included timestamp with time zone
);
 
CREATE UNIQUE INDEX openai_chat_completion_user_prompt_idx ON <namespace>_openai_chat_completion(user_prompt citext_ops);

Then, record every exchange in the database.

Then, over time, you will want to manually review all entries in the table and mark as included those that cover various edge cases. This will allow you to iterate over various edge cases, improve model, and over time keep adding new edge cases to the training sample.

This approached felt far more productive than my initial approach of mucking with JSON files.

If you are optimizing for speed, fine-tuning OpenAI models is going to be an improvement. However, it is not going to be the fastest option out there.

If you are optimizing for cost or accuracy, fine-tuning OpenAI models is going to be definitely worth it.

  1. https://platform.openai.com/docs/guides/fine-tuning/example-count-recommendations

Spotted a mistake? Edit article