
Share
This article delves into using GPT-4o’s structured outputs to streamline web scraping, showcasing how Pydantic models can refine and enhance data extraction processes.
I'm pretty excited about the new structured outputs feature in OpenAI’s API, so I decided to take it for a spin and develop an AI-assisted web scraper. This post summarizes my learnings and shares some interesting findings.
The first experiment was straightforward: I asked GPT-4o to extract data from an HTML string using the new structured outputs feature. To structure the output, I used the following Pydantic models:
from typing import List, Dict
from pydantic import BaseModel
class ParsedColumn(BaseModel):
name: str
values: List[str]
class ParsedTable(BaseModel):
name: str
columns: List[ParsedColumn]
The system prompt I used was:
You’re an expert web scraper. You’re given the HTML contents of a table and you have to extract structured data from it.
Here are some interesting things I found when parsing different tables:
After experimenting with simple tables, I wanted to see how the model would handle more complex ones. I passed a 10-day weather forecast from Weather.com, which contains a large row at the top and smaller rows for the other nine days. Interestingly, GPT-4o was able to parse this correctly:
For the remaining nine days, the table shows both day and night forecasts (see screenshot above). The model correctly parsed this data and added a Day/Night column. Here’s how it looks in the browser (note that you need to click on the button to the right of each row to display this):
At first, I thought the parsed Condition column was a hallucination because I didn’t see it on the website. However, upon inspecting the source code, I realized these tags exist but are invisible in the table.

When thinking about where to find "easy tables," my first thought was Wikipedia. Surprisingly, a seemingly simple table from the Human Development Index page breaks the model because rows with repeated values are merged:
While the model is able to retrieve individual columns (as instructed by the system prompt), they don’t have the same size, making it impossible to represent the data as a table.
I tried modifying the system prompt with the following:
Tables might collapse rows into a single row. If that’s the case, extract the collapsed row as multiple JSON values to ensure all columns contain the same number of rows.
Unfortunately, this didn’t work. I have yet to try further modifications to the system prompt to instruct the model to extract rows instead of columns.
Running an OpenAI API call every time can become very expensive, so I decided to ask the model to return XPaths (a language for selecting nodes in XML documents) instead. This way, you can use a traditional web scraping library to extract data based on the provided XPaths.
Here’s how I modified the system prompt:
You’re an expert web scraper. Given the HTML contents of a table, provide the XPath expressions needed to extract each column's values.
This approach reduces the number of API calls and leverages the model’s ability to understand complex HTML structures without incurring the cost of multiple requests.
Overall, GPT-4o’s structured outputs feature shows great promise for AI-assisted web scraping. While it has some limitations, especially with merged rows, it can handle complex tables quite well. If you’re interested, you can check out the demo and source code to see how it works in action.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
13 September 2024
133 articles
Related Articles
Related Articles
More Stories