Build a drug discovery research agent
Drug discovery intelligence — pipeline status, mechanism of action, clinical trial results, competitive landscape, patent expiry — is some of the most expensive data in the life sciences. Evaluate Pharma and Cortellis charge five and six figures a year for it. But most of the underlying signal is public: ClinicalTrials.gov registrations, FDA approval records, PubMed abstracts, bioRxiv preprints, SEC filings, and company press releases. This guide builds a Python agent that mines those public sources and assembles a structured drug discovery brief. It chains all four Superhighway endpoints — /research for the therapeutic landscape and mechanism, /search for trial registrations and regulatory filings, /scrape for trial endpoints and results, and /news for recent trial readouts and FDA decisions — then uses an LLM to emit a structured drug discovery brief as JSON.
This guide is for research and informational purposes only. It retrieves publicly available scientific and regulatory information. It is not medical advice and not investment advice. Always consult qualified healthcare professionals for medical decisions and licensed financial advisors for investment decisions.
1. What you'll build
A Python agent that takes a drug name, molecular target, disease area, or biotech company and produces a structured drug discovery intelligence brief:
- Synthesizes the therapeutic landscape: mechanism of action, development status, clinical evidence
- Finds clinical trial registrations, FDA submissions, and PubMed research
- Scrapes trial details, endpoints, and results from ClinicalTrials.gov and regulatory pages
- Pulls recent pipeline news: trial readouts, FDA decisions, drug approvals, clinical holds, M&A
- Uses an LLM to generate a structured brief — mechanism, development stage, key trials, competitive landscape, regulatory status, and patent timeline as JSON
2. Setup
pip install openai requests python-dotenv
Create a .env file with your two keys:
SUPERHIGHWAY_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
3. Research the therapeutic landscape
Start with /research, which pulls multi-source background into one synthesis — how the drug or target works, where it sits in development, what the clinical evidence shows, and who the competitors are. This is the grounding context for every later step.
import requests, os, json
SUPERHIGHWAY_KEY = os.getenv("SUPERHIGHWAY_API_KEY")
BASE = "https://superhighway.walls.sh"
def research_landscape(drug_or_target: str) -> str:
"""Deep synthesis: mechanism, clinical evidence, safety, competitive landscape."""
r = requests.get(
f"{BASE}/research",
params={
"q": f"{drug_or_target} mechanism action clinical trial efficacy safety therapeutic",
"pages": 6
},
headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
)
data = r.json()
return data.get("synthesis", data.get("markdown", ""))[:3000]
4. Find clinical trials and regulatory data
Two /search calls do the legwork. The first hunts for ClinicalTrials.gov registrations and trial results — NCT IDs, phase, endpoints; the second goes after FDA approval history and the competitive pipeline. Together they surface the candidate pages the agent will scrape for hard data.
def find_clinical_trials(drug_or_target: str) -> list[dict]:
"""Find ClinicalTrials.gov registrations, phases, and trial results."""
r = requests.get(
f"{BASE}/search",
params={
"q": f"{drug_or_target} ClinicalTrials.gov phase 1 2 3 clinical trial endpoint results",
"limit": 6
},
headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
)
return r.json().get("results", [])
def find_regulatory_data(drug_or_target: str) -> list[dict]:
"""Find FDA approval history, NDA/BLA submissions, and competitive pipeline."""
r = requests.get(
f"{BASE}/search",
params={
"q": f"{drug_or_target} FDA approval NDA BLA regulatory submission pipeline competitive",
"limit": 6
},
headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
)
return r.json().get("results", [])
5. Scrape trial and regulatory pages
/scrape turns each candidate URL into clean, LLM-ready markdown. This is where the agent pulls actual trial endpoints, dosing, primary outcomes, and results out of ClinicalTrials.gov entries, FDA documents, and company pipeline pages.
def scrape_trial_page(url: str) -> dict:
"""Scrape a trial registration, FDA page, or pipeline page for endpoints and results."""
r = requests.get(
f"{BASE}/scrape",
params={"url": url},
headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
)
data = r.json()
return {
"url": url,
"title": data.get("title", ""),
"content": data.get("markdown", "")[:2500],
}
6. Get recent pipeline news
/news surfaces what just happened — trial readouts, FDA decisions, drug approvals, clinical holds, licensing deals, and pharma M&A. This is the time-sensitive layer that a static literature scan misses.
def get_pipeline_news(drug_or_target: str) -> list[dict]:
"""Get recent trial results, FDA decisions, approvals, and pipeline updates."""
r = requests.get(
f"{BASE}/news",
params={
"q": f"{drug_or_target} clinical trial results FDA approval pipeline biotech pharma",
"count": 6
},
headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
)
return r.json().get("articles", [])
7. Generate the drug discovery brief with an LLM
Now hand everything to the LLM. The system prompt forbids inventing trial results or efficacy figures and bars any medical or investment recommendation — the agent summarizes what's in the sources, it does not advise. The output is structured JSON so it slots straight into a competitive landscape memo, a pipeline tracker, or a dashboard.
from openai import OpenAI
llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
DISCLAIMER = (
"This summary is based on publicly available scientific and regulatory "
"information. It is not medical advice, not investment advice, and does not "
"replace expert scientific or clinical judgment. Drug development involves "
"significant risk and uncertainty. Consult qualified healthcare professionals "
"for medical decisions and licensed financial advisors for investment decisions."
)
def generate_brief(
drug_or_target: str,
landscape: str,
trial_pages: list[dict],
news: list[dict],
context: dict
) -> dict | None:
"""Generate a structured drug discovery intelligence brief."""
trial_text = "\n".join(
f"- {p['title']}: {p['content'][:400]}"
for p in trial_pages[:5]
if p.get("content")
)
news_text = "\n".join(
f"- {n.get('title', '')} ({n.get('source', '')})"
for n in news[:6]
)
query_type = context.get("query_type", "drug")
therapeutic_area = context.get("therapeutic_area", "general")
stage_focus = context.get("stage_focus", "all")
response = llm.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"""You are a drug discovery analyst preparing a research brief.
The subject is a {query_type}; focus on the {therapeutic_area} therapeutic area and {stage_focus} development stages.
Be specific and factual. Only use information from the provided sources.
Do not invent trial results, efficacy figures, or NCT IDs — if a detail isn't in the sources,
say 'not found in sources.' Do not make medical or investment recommendations."""
},
{
"role": "user",
"content": f"""Write a drug discovery brief for: {drug_or_target}
Therapeutic Landscape:
{landscape[:2000]}
Clinical Trials & Regulatory Data:
{trial_text}
Recent Pipeline News:
{news_text}
Return JSON with:
- drug_or_target: string (the drug, target, disease area, or company researched)
- therapeutic_area: string (e.g. "oncology", "CNS", "cardiovascular", "immunology")
- mechanism_of_action: string (how it works — target, pathway, modality: small molecule / biologic / gene therapy / cell therapy)
- development_stage: "preclinical" | "phase-1" | "phase-2" | "phase-3" | "approved" | "post-market" | "discontinued" | "multiple-programs"
- clinical_evidence_summary: string (what the data shows — efficacy signals, safety profile, dose ranges)
- key_trials: list of 3-5 strings (important trials found: NCT ID, phase, indication, status, key results if available)
- competitive_landscape: string (other drugs in the same class, approved competitors, pipeline competition)
- regulatory_status: string (FDA/EMA approvals, IND/NDA/BLA status, breakthrough designation, orphan drug status)
- patent_and_exclusivity: string (patent expiry, exclusivity periods, biosimilar competition timeline if relevant)
- recent_developments: list of 3 strings (from the news — trial results, regulatory decisions, M&A, licensing deals)
- investment_considerations: string (pipeline risk, market size, competitive positioning — based on public data only)
- data_quality: "high" | "medium" | "low" (how much source material was found)"""
}
],
response_format={"type": "json_object"}
)
try:
brief = json.loads(response.choices[0].message.content)
brief["disclaimer"] = DISCLAIMER
return brief
except (json.JSONDecodeError, KeyError):
return None
8. The full research pipeline
Wire the steps together: research the landscape, find trials and regulatory data, scrape the top pages, pull news, then generate the brief. The context dict tells the agent whether you're researching a drug, a target, a disease area, or a company — and which therapeutic area and development stage to slant toward.
def research_drug_or_target(
query: str,
context: dict | None = None,
max_pages: int = 5
) -> dict | None:
"""
Run the full drug discovery research pipeline.
context: {
"query_type": "drug" | "target" | "disease-area" | "company",
"therapeutic_area": "oncology" | "cns" | "cardiovascular" | "immunology" | "infectious-disease" | "general",
"stage_focus": "all" | "early" | "late" | "approved"
}
"""
if context is None:
context = {"query_type": "drug", "therapeutic_area": "general", "stage_focus": "all"}
print(f"Researching drug/target: {query}")
# Step 1: Landscape synthesis
print("Synthesizing therapeutic landscape...")
landscape = research_landscape(query)
# Step 2: Find trials and regulatory data
print("Finding clinical trials and regulatory data...")
results = find_clinical_trials(query) + find_regulatory_data(query)
# Step 3: Scrape the top trial and regulatory pages
print(f"Scraping {min(len(results), max_pages)} pages...")
trial_pages = []
seen = set()
for result in results:
url = result.get("url")
if not url or url in seen:
continue
seen.add(url)
page = scrape_trial_page(url)
if page["content"]:
trial_pages.append(page)
if len(trial_pages) >= max_pages:
break
# Step 4: Recent pipeline news
print("Pulling recent pipeline news...")
news = get_pipeline_news(query)
# Step 5: Generate the brief
print("Generating drug discovery brief...")
return generate_brief(query, landscape, trial_pages, news, context)
def print_brief(brief: dict):
if not brief:
print("Could not generate brief.")
return
print(f"\n{'='*60}")
print(f"Drug Discovery Brief: {brief.get('drug_or_target', 'Subject')}")
print(f"Therapeutic area: {brief.get('therapeutic_area', '?')}")
print(f"Development stage: {brief.get('development_stage', '?')}")
print(f"Data quality: {brief.get('data_quality', '?').upper()}")
print(f"{'='*60}")
print(f"\nMechanism of Action:\n{brief.get('mechanism_of_action', '')}")
print(f"\nClinical Evidence:\n{brief.get('clinical_evidence_summary', '')}")
trials = brief.get("key_trials", [])
if trials:
print("\nKey Trials:")
for t in trials:
print(f" * {t}")
print(f"\nCompetitive Landscape:\n{brief.get('competitive_landscape', '')}")
print(f"\nRegulatory Status:\n{brief.get('regulatory_status', '')}")
print(f"\nPatent & Exclusivity:\n{brief.get('patent_and_exclusivity', '')}")
developments = brief.get("recent_developments", [])
if developments:
print("\nRecent Developments:")
for d in developments:
print(f" ! {d}")
print(f"\nInvestment Considerations:\n{brief.get('investment_considerations', '')}")
print(f"\n{brief.get('disclaimer', '')}")
if __name__ == "__main__":
import sys
# Usage: python agent.py "semaglutide GLP-1 mechanism" drug
query = sys.argv[1] if len(sys.argv) > 1 else "semaglutide GLP-1 mechanism"
query_type = sys.argv[2] if len(sys.argv) > 2 else "drug"
CONTEXT = {
"query_type": query_type,
"therapeutic_area": "general",
"stage_focus": "all",
}
brief = research_drug_or_target(query, CONTEXT, max_pages=5)
if brief:
print_brief(brief)
9. Common research topics
- Specific drugs — "semaglutide GLP-1 mechanism", "pembrolizumab PD-1 checkpoint", "lecanemab Alzheimer amyloid".
- Molecular targets — "KRAS G12C inhibitors", "CDK4/6 inhibitors breast cancer", "GLP-1 receptor agonists".
- Disease areas — "NASH/MASH drug pipeline 2025", "Alzheimer's disease therapy landscape", "obesity treatment competitive landscape".
- Biotech pipelines — "Moderna mRNA pipeline", "BioNTech oncology programs", "Recursion Pharmaceuticals AI drug discovery".
- Emerging modalities — "RNA interference therapeutics", "cell therapy CAR-T pipeline", "CRISPR gene editing clinical trials".
10. Use cases
- Pharma researcher — competitive landscape scan before starting a new program: who else is in the space?
- Biotech VC — quick pipeline assessment before a meeting with a biotech startup founder.
- Clinical trial coordinator — understand the competitive trial landscape for a new study design.
- Science journalist — background brief on a drug in the news before writing an article.
- Computational chemist — find published data on a target before designing a new compound.
11. Extending the agent
- Pipeline tracker — schedule a weekly
/news?q={drug}+clinical+trial+results+FDAcall to track development milestones as they break. - Competitive sweep — batch-run for each drug in a therapeutic class and compare
development_stageandkey_trialsside by side. - Patent monitor — add a
/search?q={drug}+patent+expiry+exclusivity+biosimilarcall to track exclusivity cliffs. - Conference prep — run before JP Morgan, ASCO, ASH, or ESMO to brief on drugs being presented.
12. Important disclaimer
This agent retrieves publicly available scientific and regulatory information from clinical trial registries, FDA records, published research, and news sources. It is not medical advice and not investment advice. Drug development involves significant risk and uncertainty — most candidates that enter clinical trials never reach approval. Always consult qualified healthcare professionals for medical decisions and licensed financial advisors for investment decisions.
13. Getting your API key
Grab a free Superhighway key at /pricing (1,000 calls/month, no credit card). For an agent that provisions its own access, skip the key entirely with x402: it pays $0.002 per call in USDC on Base — no signup, no key management. See the x402 pay-per-call guide for the wallet setup.
For related builds, the healthcare research agent applies the same four-endpoint pattern to treatments and conditions for patients and caregivers, and the financial research agent uses it on companies and markets.