Extract structured data from web search with Instructor
Instructor gives you typed, validated output from any LLM — you define a Pydantic model and get it back, retried automatically if the model mis-formats. But Instructor only shapes what the LLM already knows. Pair it with Superhighway and you get the missing half: live web content to extract from. Search for fresh facts, then extract a clean schema.
The split: retrieval vs. extraction
| Superhighway | Instructor | |
|---|---|---|
| Job | Bring in live web content | Turn text into a typed Pydantic model |
| Output | Search results, page markdown, synthesized research | Validated BaseModel instances |
| Without the other | Raw text the agent must parse by hand | Clean schema — but stale, from training data |
Together: search → extract. Superhighway fetches what's true today; Instructor hands you a typed object you can trust.
Setup
pip install instructor openai requests
export SUPERHIGHWAY_API_KEY="your_key" # free at superhighway.walls.sh/pricing
export OPENAI_API_KEY="your_key"
Instructor wraps any supported client — OpenAI, Anthropic, Gemini, Mistral, Cohere. The examples below use OpenAI; swap instructor.from_openai(OpenAI()) for instructor.from_anthropic(Anthropic()) and the rest is identical.
Path 1: Search → extract (the basic pattern)
Call /search for live results, concatenate them into context, and let Instructor extract a typed model:
import os
import instructor
import requests
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List
client = instructor.from_openai(OpenAI())
class SearchSummary(BaseModel):
topic: str
key_findings: List[str] = Field(description="Main findings from the search results")
summary: str
# Step 1 — search the live web with Superhighway
resp = requests.get(
"https://superhighway.walls.sh/search",
params={"q": "latest AI agent frameworks 2025", "limit": 5},
headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
timeout=10,
)
results = resp.json()["results"]
context = "\n\n".join(f"{r['title']}\n{r['description']}" for r in results)
# Step 2 — extract a typed model with Instructor
summary = client.chat.completions.create(
model="gpt-4o-mini",
response_model=SearchSummary,
messages=[
{"role": "user", "content": f"Extract key findings from these search results:\n\n{context}"}
],
)
print(summary.key_findings) # typed List[str], guaranteed to match the schema
If the model returns malformed JSON or a field of the wrong type, Instructor catches the Pydantic ValidationError and re-prompts the model with the error — so you get clean, typed data even when the LLM mis-formats once.
Path 2: Research → extract (deeper, page-level content)
/search returns titles and snippets. For richer extraction, use /research, which reads the top result pages and synthesizes them — then hand that to Instructor for a structured schema:
from typing import Optional
class CompanyProfile(BaseModel):
name: str
products: List[str]
founded_year: Optional[int] = None
key_facts: List[str]
resp = requests.get(
"https://superhighway.walls.sh/research",
params={"q": "Anthropic AI company overview", "pages": 3},
headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
timeout=30,
)
research = resp.json()
# /research already read the top pages — feed the page content to Instructor
context = "\n\n".join(
f"## {p['title']}\n{p['markdown'][:3000]}"
for p in research["pages"] if not p.get("error")
)
profile = client.chat.completions.create(
model="gpt-4o-mini",
response_model=CompanyProfile,
messages=[{"role": "user", "content": f"Extract a company profile:\n\n{context}"}],
)
print(profile.products) # typed List[str]
print(profile.founded_year) # typed Optional[int]
Because /research reads real pages rather than just titles, the extracted model is grounded in full source text, not snippets. The Optional[int] on founded_year lets the model leave a field empty when the sources don't say — Instructor validates that too.
Path 3: Agentic loop — search, decide, search again
Instructor's structured output makes control flow easy: extract a decision model and branch on its boolean fields. Here the agent keeps searching until it has enough to answer:
class SearchDecision(BaseModel):
has_enough_info: bool
next_query: Optional[str] = Field(default=None, description="Follow-up query if more info is needed")
answer: Optional[str] = Field(default=None, description="Final answer once enough info is gathered")
def search_web(query: str, limit: int = 5) -> str:
r = requests.get(
"https://superhighway.walls.sh/search",
params={"q": query, "limit": limit},
headers={"Authorization": f"Bearer {os.environ['SUPERHIGHWAY_API_KEY']}"},
timeout=10,
)
results = r.json()["results"]
return "\n\n".join(f"{x['title']}\n{x['description']}" for x in results)
question = "Which AI agent framework added native MCP support most recently?"
query = question
gathered = ""
for _ in range(4): # cap the loop
gathered += "\n\n" + search_web(query)
decision = client.chat.completions.create(
model="gpt-4o-mini",
response_model=SearchDecision,
messages=[{
"role": "user",
"content": f"Question: {question}\n\nWhat we know so far:{gathered}\n\n"
f"Do you have enough to answer? If not, give the next search query.",
}],
)
if decision.has_enough_info:
print(decision.answer)
break
query = decision.next_query or question
The typed has_enough_info boolean drives the loop — no brittle string parsing of the model's reply. Each iteration grounds the next decision in fresh Superhighway results.
Why the pairing works
- Complementary, not overlapping — Superhighway supplies live web content; Instructor enforces a typed schema on top of it. Retrieval + extraction.
- Self-healing output — Instructor retries on validation failure, so a single mis-formatted LLM response doesn't break your pipeline.
- Any backend — works with every Instructor-supported provider: OpenAI, Anthropic, Gemini, Mistral, Cohere.
- Less prompt engineering —
/researchalready synthesizes pages, so Instructor's model only has to shape the answer, not find it.
Which Superhighway endpoints to use
| Endpoint | Feeds Instructor | Price |
|---|---|---|
/search | Titles + snippets — fast, broad context | $0.001 |
/news | Recent items with dates — extract timelines/events | $0.001 |
/scrape | One page as markdown — extract from a known URL | $0.002 |
/research | Top pages read + synthesized — richest extraction | $0.005 |
Full API reference: /openapi.json. Prefer autonomous pay-per-call with no API key? Superhighway speaks x402. Get a free API key.