Extracting Structured Data from Emails with GPT-4o: A Comparative Study

The tense change in the TL;DR threw me off a bit, here's my go at it: This study compares four methods for extracting structured data from emails using GPT-4o. It finds that Structured One-Shot (a single call with a strict output format) is the most cost-effective and efficient, while Unstructured Many-Shot (multiple calls for each data point) achieves the highest accuracy but at a higher cost. For most real-world applications, Structured One-Shot offers the best balance of performance and affordability.

Introduction

In today’s digital world, businesses rely on email communication for a vast array of functions—from handling customer support requests to processing sales inquiries. However, much of this communication exists in an unstructured format, making it difficult to extract and utilize key details efficiently. Without automation, organizations must manually sift through emails to pull out important information like customer names, phone numbers, or order details, which is both time-consuming and error-prone.

Traditional methods, like Named Entity Recognition (NER), often struggle with the complexity and nuance of free-form text, as they are designed to identify only predefined entities (e.g., names, locations, dates). This can result in missed or incomplete extractions when the data doesn't fit neatly into these categories. The real challenge, therefore, is converting this free-form text into structured data that can seamlessly integrate with internal systems—capturing not just specific entities, but also understanding context, intent, and relationships.

To overcome these limitations, we turned to OpenAI's GPT-4o model, which offers a more flexible, context-aware approach to text processing. Unlike NER, which is rigid and often unable to handle subtle variations in language, GPT-4o can adapt to diverse email formats and complex requests. We tested four different prompting strategies to determine which approach yields the most accurate, efficient, and cost-effective results for automating this process.

The Business benefit of automating data extraction

In today’s fast-paced business environment, email is both a vital communication channel and a potential data bottleneck. Manually extracting key information from emails is a slow, expensive, and error-prone process. Automating this with advanced models like GPT-4o provides a critical pathway to enhanced operational efficiency. By automating data extraction, businesses can significantly reduce the time and resources spent on manual processing, freeing up valuable employee time for strategic work. This efficiency gain translates directly to cost savings and improved productivity across departments.

Business Reason Image

Beyond efficiency, automation unlocks the hidden value within email data. Unstructured email content becomes structured, easily integrated into CRM, analytics, and other business systems. This transformation allows organizations to leverage email data for better decision-making, personalized customer experiences, and proactive identification of trends and issues. Furthermore, automated systems enhance customer service responsiveness and scale effortlessly with growing email volumes, making email data extraction automation a crucial investment for modern, data-driven businesses.

Model Prompting and Data Extraction

Prompting refers to the practice of instructing a language model to generate specific responses based on input queries. By carefully designing prompts, we can guide GPT-4o to extract structured data from unstructured text, such as emails. In this study, we experimented with both unstructured prompts, where the model receives open-ended instructions to return data in JSON format, and structured prompts, which enforce strict output constraints using predefined schemas. The key goal of prompting is to ensure consistency, accuracy, and reliability in data extraction while balancing cost and performance.

Creating Structured Outputs with GPT-4o

When we talk about "structured" output, we mean that GPT-4o is instructed to follow a predefined output format (often JSON) with specific fields and rules. The reason we tested both "structured" and "unstructured" prompts is that OpenAI has fine-tuned GPT-4o to handle response formats, but this fine-tuning wasn’t part of its original core training. In theory, GPT-4o should be very good at following user instructions in plain text—since that was part of its initial training—but might produce slightly varied or inconsistent outputs when faced with complex constraints (like enumerations). By contrast, the new "response format" tools can be more consistent in enforcing strict output structure, but we aren’t entirely sure whether the extra fine-tuning will increase or decrease overall accuracy in extracting the correct data.

Email to JSON

Enums in particular illustrate this difference. An "enum" (short for enumeration) is a predefined list of valid values for a field—for example, ["Red", "Green", "Blue"]. When using a strict response format (like a JSON schema requiring an enum), GPT-4o is more likely to stick to valid values. When using an unstructured prompt, GPT-4o might produce an unexpected value, like "Teal", which is not in the allowed list. To create a structured response format we use a JSON element, which follows this rough schema setup:

{
  "type": "json_schema",
  "json_schema": {
    "name": "fetch_customer_name",
    "schema": {
      "type": "object",
      "properties": {
        "propertyName": {
          "type": "string",
          "title": "name",
          "description": "name of the person"
        }
      },
      "required": ["propertyName"],
      "additionalProperties": false
    }
  }
}

Where we can add a list of properties to the inner properties JSON element as such

"properties": {
    "firstName": {
      "type": "string",
      "title": "first name",
      "description": "first name of the person"
    },
    "lastName": {
      "type": "string",
      "title": "last name",
      "description": "last name of the person"
    }
},
"required": ["firstName", "lastName"]

If this schema is specified, GPT is forced to output data matching this schema. A correct response might look like:

{
  "firstName": "Tobias",
  "lastName": "Rehfeldt"
}

There are also native integrations with language specific setups, for example the OpenAI Python packages allow inputs of structured Pydantic schemas. Given that these are already present in many python projects, it can be easy to direct the GPT output without having to create additional JSON schemas.

from pydantic import BaseModel
class User(BaseModel):
    first_name: str
    last_name: str

Whether you use JSON or a language specific object, either method ensures that our output is consistent every time, reducing guesswork and making downstream data processing simpler.

Structured outputs can be organized in two ways: flat or nested.
A flat structure represents a single piece of information, such as a customer’s name. In contrast, a nested structure involves an element that contains other elements. For example, in an online order, the "order" would be the parent element, which could contain multiple "items" as child elements. Each item might include details like quantity, price, and description. Here's an example of a structured output from GPT:

{
  "order": [
    {
      "quantity": 15,
      "price": 10,
      "description": "lollipops"
    },
    {
      "quantity": 7,
      "price": 50,
      "description": "rock candy"
    }
  ]
}

The Four Approaches

We tested four ways of interacting with GPT-4o:

Unstructured One-Shot
- Setup: A single API call prompts GPT-4o with an instruction
- Example Prompt:
  "Extract the first and last name."
- Example Output (in one go):
```
{"firstName": "Tobias", "lastName": "Rehfeldt"}
```
Unstructured Many-Shot
- Setup: Multiple API calls, each requesting a single key.
- Example Prompts:
  - Call 1: "Extract the first name." → Output:
```
{"value": "Tobias"}
```
  - Call 2: "Extract the last name." → Output:
```
{"value": "Rehfeldt"}
```
Structured One-Shot
- Setup: A single API call with a strict response format (like the JSON schema or Pydantic model above).
- Example Prompt:
  "Return the data in valid JSON matching this schema: <schema or model definition>"
- Example Output:
```
{"firstName": "Tobias", "lastName": "Rehfeldt"}
```
Structured Many-Shot
- Setup: Multiple calls, each using a response format for a single field.
- Example Prompts:
  - Call 1: "Return the first name in valid JSON." → Output:
```
{"value": "Tobias"}
```
  - Call 2: "Return the last name in valid JSON." → Output:
```
{"value": "Rehfeldt"}
```

Methodology

Data Collection

We gathered an array of real emails containing various personal and technical details, surrounded by irrelevant text, such as headers, footers, links, images, etc. The data we extracted contains specific strings (e.g. names), unspecific strings (descriptions of protocols), numbers, enumeration, and dates. We also have nested structures containing both strings (both specific and unspecific), numbers, and enumerations.

Procedure

From the data collection we took a few "hard" emails which contained all of the needed information, but in a very unstructured and hard-to-read writing style. We then extracted the information 20 times per email, and saw the consistency of the outputs for each method.

Model Prompts

Unstructured prompts explaining what/how it needs to extract while simply asking for key-value pairs in a JSON element with no strict format.
Structured prompts explaining what/how it needs to extract, and using a JSON schema to ensure GPT adheres to a specific format.

Evaluation Metrics

Accuracy: How often GPT-4o returned correct information in the specified fields.
Consistency: Whether the output followed the desired structure reliably.
Token Use & Cost: The total number of tokens consumed, impacting both speed and monetary cost.

Results and Findings

We found the following rough ranking (from least to most accurate):

Unstructured One-Shot: This model most often missed information which was provided to it, and when it did find information it was wrong more often than other setups.
Structured Many-Shot: This performed better over all than unstructured one-shot, but not to a degree where the increased cost and token usage would be worth it.
Structured One-Shot: Performance much more reliably than structured many-shot while requiring just one call. This method strikes a good balance between cost and consistency.
Unstructured Many-Shot: This method exceeded the accuracy of Structured One-Shot because each single-field request was very focused. However, this approach required multiple calls.

In our testing we had multiple emails which contained >10.000 tokens, and we had ~15 fields that we were extracting. We had also divided these fields into chunks, so instead of finding every information in one call, we had 4 calls finding; 2 nested (1 field each) and 2 flat (~6.5 fields each). Given this setup, we find that the one-shot methods do 4 total calls, totaling about ~40K tokens and cost ~$0.1. However, in the many shot methods it does not matter how many chunks we divide the fields into, as each field is treated separately regardless. For this reason, we ended with about ~150.000K tokens, and ~$0.4 per run. Given that one-shot methods can be more efficiently controlled in chunks, it can also be easier to control the cost.

Conclusion and Final Ranking

If cost and speed are top priorities, Structured One-Shot is a good choice. If you need maximum accuracy, regardless of increased costs, our findings show that Unstructured Many-Shot the best overall results using GPT-4o. However, generally we would recommend a Structured One-Shot approach for production-scale email parsing because it offers clear schema adherence, solid accuracy, and a single call. A backend structured around Structured One-Shot is also easier to set-up and maintain.

Furthermore, with reasoning models like OpenAIs o1, Geminis Flash-Thinking-Exp, or Deepseek-R1 becoming widely available soon, it sparks the question how these will affect this type of data extraction. Given that these models are slower, have lower token limits, and higher costs, we would expect that a one-shot approach would be more suitable for these models. Given their reasoning capabilities, we would assume that the structured one-shot would be the most suitable for these models, while performing at or above the unstructured many-shot approach using GPT-4o

Ready to take the next step in email data extraction using AI? At Flowtale, we're constantly exploring and adapting our data extraction methods to the model strengths, and we are here to help you navigate this rapidly evolving landscape. Our tool, Flowform, helps extract data and automate your business processes, if you need consultation on how to extract data from emails, or want to automate your business processes, contact us.

3 MARCH, 2025