Extract structured data from web search with Instructor

Superhighway guides

Instructor gives you typed, validated output from any LLM — you define a Pydantic model and get it back, retried automatically if the model mis-formats. But Instructor only shapes what the LLM already knows. Pair it with Superhighway and you get the missing half: live web content to extract from. Search for fresh facts, then extract a clean schema.

The split: retrieval vs. extraction

	Superhighway	Instructor
Job	Bring in live web content	Turn text into a typed Pydantic model
Output	Search results, page markdown, synthesized research	Validated `BaseModel` instances
Without the other	Raw text the agent must parse by hand	Clean schema — but stale, from training data

Together: search → extract. Superhighway fetches what's true today; Instructor hands you a typed object you can trust.

Setup

pip install instructor openai requests
export SUPERHIGHWAY_API_KEY="your_key"   # free at superhighway.walls.sh/pricing
export OPENAI_API_KEY="your_key"

Instructor wraps any supported client — OpenAI, Anthropic, Gemini, Mistral, Cohere. The examples below use OpenAI; swap instructor.from_openai(OpenAI()) for instructor.from_anthropic(Anthropic()) and the rest is identical.

Path 1: Search → extract (the basic pattern)

Call /search for live results, concatenate them into context, and let Instructor extract a typed model:

import os
import instructor
import requests
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List

client = instructor.from_openai(OpenAI())

class SearchSummary(BaseModel):
    topic: str
    key_findings: List[str] = Field(description="Main findings from the search results")
    summary: str

# Step 1 — search the live web with Superhighway
resp = requests.get(
    "https://superhighway.walls.sh/search",
    params={"q": "latest AI agent frameworks 2025", "limit": 5},
    headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
    timeout=10,
)
results = resp.json()["results"]
context = "\n\n".join(f"{r['title']}\n{r['description']}" for r in results)

# Step 2 — extract a typed model with Instructor
summary = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=SearchSummary,
    messages=[
        {"role": "user", "content": f"Extract key findings from these search results:\n\n{context}"}
    ],
)

print(summary.key_findings)   # typed List[str], guaranteed to match the schema

If the model returns malformed JSON or a field of the wrong type, Instructor catches the Pydantic ValidationError and re-prompts the model with the error — so you get clean, typed data even when the LLM mis-formats once.

Path 2: Research → extract (deeper, page-level content)

/search returns titles and snippets. For richer extraction, use /research, which reads the top result pages and synthesizes them — then hand that to Instructor for a structured schema:

from typing import Optional

class CompanyProfile(BaseModel):
    name: str
    products: List[str]
    founded_year: Optional[int] = None
    key_facts: List[str]

resp = requests.get(
    "https://superhighway.walls.sh/research",
    params={"q": "Anthropic AI company overview", "pages": 3},
    headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
    timeout=30,
)
research = resp.json()

# /research already read the top pages — feed the page content to Instructor
context = "\n\n".join(
    f"## {p['title']}\n{p['markdown'][:3000]}"
    for p in research["pages"] if not p.get("error")
)

profile = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=CompanyProfile,
    messages=[{"role": "user", "content": f"Extract a company profile:\n\n{context}"}],
)

print(profile.products)        # typed List[str]
print(profile.founded_year)    # typed Optional[int]

Because /research reads real pages rather than just titles, the extracted model is grounded in full source text, not snippets. The Optional[int] on founded_year lets the model leave a field empty when the sources don't say — Instructor validates that too.

Path 3: Agentic loop — search, decide, search again

Instructor's structured output makes control flow easy: extract a decision model and branch on its boolean fields. Here the agent keeps searching until it has enough to answer:

class SearchDecision(BaseModel):
    has_enough_info: bool
    next_query: Optional[str] = Field(default=None, description="Follow-up query if more info is needed")
    answer: Optional[str] = Field(default=None, description="Final answer once enough info is gathered")

def search_web(query: str, limit: int = 5) -> str:
    r = requests.get(
        "https://superhighway.walls.sh/search",
        params={"q": query, "limit": limit},
        headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
        timeout=10,
    )
    results = r.json()["results"]
    return "\n\n".join(f"{x['title']}\n{x['description']}" for x in results)

question = "Which AI agent framework added native MCP support most recently?"
query = question
gathered = ""

for _ in range(4):  # cap the loop
    gathered += "\n\n" + search_web(query)
    decision = client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=SearchDecision,
        messages=[{
            "role": "user",
            "content": f"Question: {question}\n\nWhat we know so far:{gathered}\n\n"
                       f"Do you have enough to answer? If not, give the next search query.",
        }],
    )
    if decision.has_enough_info:
        print(decision.answer)
        break
    query = decision.next_query or question

The typed has_enough_info boolean drives the loop — no brittle string parsing of the model's reply. Each iteration grounds the next decision in fresh Superhighway results.

Why the pairing works

Complementary, not overlapping — Superhighway supplies live web content; Instructor enforces a typed schema on top of it. Retrieval + extraction.
Self-healing output — Instructor retries on validation failure, so a single mis-formatted LLM response doesn't break your pipeline.
Any backend — works with every Instructor-supported provider: OpenAI, Anthropic, Gemini, Mistral, Cohere.
Less prompt engineering — /research already synthesizes pages, so Instructor's model only has to shape the answer, not find it.

Which Superhighway endpoints to use

Endpoint	Feeds Instructor	Price
`/search`	Titles + snippets — fast, broad context	$0.001
`/news`	Recent items with dates — extract timelines/events	$0.001
`/scrape`	One page as markdown — extract from a known URL	$0.002
`/research`	Top pages read + synthesized — richest extraction	$0.005

Full API reference: /openapi.json. Prefer autonomous pay-per-call with no API key? Superhighway speaks x402. Get a free API key.