How can you fine tune document data extraction using latest LLMs

Hello! In this article, we’ll explore how to enhance data extraction from documents by fine-tuning and optimising our interactions with ML Models. We'll take GPT-4o as an example.

LLM

1. Understanding GPT-4o’s Capabilities

Before diving into the details of fine-tuning and best practices, let’s have a quick overview of ChatGPT’s core functions:

  • Natural Language Understanding: ChatGPT can read and understand text and context, making it helpful for tasks like summarisation and data extraction.
  • Contextual Reasoning: You can guide ChatGPT to focus on certain aspects of a document by including carefully constructed prompts or instructions.
  • Generative Responses: ChatGPT can generate human-like text based on provided inputs, which can be great for summarisation but might introduce inaccuracies if not prompted properly.

With these capabilities in mind, remember that good prompt design often goes hand in hand with model performance. Even small changes in how you present your prompt can make a big difference in your results.

2. Leveraging Prompt Engineering

2.1 Keep Prompts Clear and Structured

One of the simplest ways to improve data extraction from documents is by optimising your prompt. Provide clear instructions and contextual information about what you need. Avoid using vague language, be as direct and specific as possible about the data you want to extract. For example, if you need the “Invoice Number” and “Total Amount” from a text, explicitly ask for them:

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

prompt = """
You are an assistant designed to extract specific pieces of information from text.
Extract the following from the text:
- Invoice Number
- Total Amount

Text:
\"\"\"
Invoice Number: INV-12345
Date: 2025-01-19
Total Amount: £250.00
\"\"\"

Please provide the data in JSON format, with keys 'invoice_number' and 'total_amount'.
"""

response = openai.Completion.create(
    engine="text-davinci-003",  
    prompt=prompt,
    max_tokens=100,
    temperature=0
)

print(response.choices[0].text.strip())

In the example above:

  • We explicitly mention exactly which data we want to extract.
  • We provide an example text.
  • We instruct the model to respond in JSON format.

2.2 Use Step-by-Step Guidance

You can also use a step-by-step approach (sometimes called chain-of-thought prompting) to clarify your needs. For instance:

prompt = """
Step 1: Read the document carefully and identify the date, invoice number, and total amount.
Step 2: Once you find these values, list them in an easily machine-readable format.

Document:
\"\"\"
Invoice Number: INV-67890
Issue Date: 2025-02-15
Total Amount: £340.75
\"\"\"

Perform Step 1, then Step 2.
"""

response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=prompt,
    max_tokens=150,
    temperature=0
)

print(response.choices[0].text.strip())

By walking the model through the steps, you reduce ambiguity and improve the likelihood of accurate data extraction.

3. Fine-Tuning Your Model

3.1 Why Fine-Tune?

Fine-tuning is the process of training a model on your own dataset so that it better understands the specific type of text and questions relevant to your use case. While ChatGPT (GPT-4/GPT-4o) is great out of the box, a fine-tuned model can be more accurate and consistent for your specific needs, such as extracting data from a particular type of form or legal document.

3.2 Steps to Fine-Tune

  1. Prepare Your Training Data

Create a dataset of prompt-response pairs. Here, your prompt will be the document snippet (or relevant question), and the response will be the correctly extracted data. For instance, if your training set has 100 invoice text snippets, each should be paired with the correct extracted data in your desired format (e.g., JSON).

  1. Format Data for Fine-Tuning

OpenAI expects a JSONL file format for fine-tuning. Each line in this file should represent a training example with the keys prompt and completion:

{"prompt": "Extract invoice details:\nInvoice Number: INV-001\n...", "completion": "{\"invoice_number\": \"INV-001\", ... }"}
{"prompt": "Extract invoice details:\nInvoice Number: INV-002\n...", "completion": "{\"invoice_number\": \"INV-002\", ... }"}
// ... more lines
  1. Upload Your Dataset

Use the OpenAI command line interface or their API to upload your JSONL file:

openai tools fine_tunes.prepare_data -f my_data.jsonl
  1. Initiate Fine-Tuning

Once your data is uploaded, you can start fine-tuning:

openai api fine_tunes.create -t "my_data_prepared.jsonl" -m "gpt-3.5-turbo"

Monitor the fine-tuning job’s progress, and once complete, you’ll have a fine-tuned model specialised in extracting data from your type of documents.

  1. Use Your Fine-Tuned Model

After training, make requests to your fine-tuned model by specifying its name:

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

response = openai.Completion.create(
    model="ft:gpt-3.5-turbo:your-unique-model-name",
    prompt="Extract data:\nInvoice Number: INV-123\nDate: 2025-03-10\nTotal: £400.00",
    max_tokens=100,
    temperature=0
)

print(response.choices[0].text.strip())

4. Improving Accuracy with Retrieval-Augmented Generation

Sometimes, you have a large corpus of documents, and you need to ensure the model only uses relevant text. One effective strategy is Retrieval-Augmented Generation (RAG). Rather than passing the entire document to ChatGPT, you can:

  1. Embed and Index Your Documents

Use a text-embedding model from OpenAI (for example, text-embedding-ada-002) to convert each document into embeddings. Store these embeddings in a vector database (like Pinecone, Weaviate, or FAISS).

  1. Search for Relevant Chunks When you receive a request for data extraction, embed the user query, retrieve the most relevant document chunks from your vector database, and provide those chunks to ChatGPT as context.

  2. Keep the Context Window Lean

By only passing the relevant chunks, you reduce confusion and ensure ChatGPT focuses on the most pertinent data.

This approach can significantly boost accuracy, especially for large, unwieldy documents.

5. Handling Edge Cases and Errors

Even with fine-tuning and retrieval augmentation, occasional inaccuracies can occur. Here are some tips:

  • Validation Rules: Implement simple checks to validate the model’s output (e.g., if an extracted date doesn’t match your expected date format, request a clarification or re-run extraction).
  • Regex Matching: For some structured fields like invoice numbers or emails, you might use regex to post-process ChatGPT’s output. This helps filter out accidental mistakes.
  • Ask Follow-Up Questions: If your first extraction is incomplete or unclear, programmatically ask ChatGPT follow-up questions for clarification.

6. Example End-to-End Workflow

Here’s a simplified workflow to illustrate how you might combine fine-tuning and best practices:

  1. Upload and Index Documents: Store them in a vector database.
  2. Embed Query: When a user asks for data, embed their query using an embedding model.
  3. Retrieve Relevant Chunks: Pull the most relevant chunks from your vector database.
  4. Prompt ChatGPT: Provide a system or user prompt with the relevant document content, asking for specific fields to be extracted.
  5. Post-Process: Validate or transform the output into your desired format.
  6. Store or Present Results: Save the extracted data in your database or display it to the user.

By combining these strategies, you’ll likely see a big improvement in the accuracy and consistency of your data extraction. Just remember to keep experimenting, iterating, and refining. Best of luck, and have a lovely time! ✨