Build a drug discovery research agent

Superhighway guides

Drug discovery intelligence — pipeline status, mechanism of action, clinical trial results, competitive landscape, patent expiry — is some of the most expensive data in the life sciences. Evaluate Pharma and Cortellis charge five and six figures a year for it. But most of the underlying signal is public: ClinicalTrials.gov registrations, FDA approval records, PubMed abstracts, bioRxiv preprints, SEC filings, and company press releases. This guide builds a Python agent that mines those public sources and assembles a structured drug discovery brief. It chains all four Superhighway endpoints — /research for the therapeutic landscape and mechanism, /search for trial registrations and regulatory filings, /scrape for trial endpoints and results, and /news for recent trial readouts and FDA decisions — then uses an LLM to emit a structured drug discovery brief as JSON.

This guide is for research and informational purposes only. It retrieves publicly available scientific and regulatory information. It is not medical advice and not investment advice. Always consult qualified healthcare professionals for medical decisions and licensed financial advisors for investment decisions.

1. What you'll build

A Python agent that takes a drug name, molecular target, disease area, or biotech company and produces a structured drug discovery intelligence brief:

2. Setup

pip install openai requests python-dotenv

Create a .env file with your two keys:

SUPERHIGHWAY_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

3. Research the therapeutic landscape

Start with /research, which pulls multi-source background into one synthesis — how the drug or target works, where it sits in development, what the clinical evidence shows, and who the competitors are. This is the grounding context for every later step.

import requests, os, json

SUPERHIGHWAY_KEY = os.getenv("SUPERHIGHWAY_API_KEY")
BASE = "https://superhighway.walls.sh"

def research_landscape(drug_or_target: str) -> str:
    """Deep synthesis: mechanism, clinical evidence, safety, competitive landscape."""
    r = requests.get(
        f"{BASE}/research",
        params={
            "q": f"{drug_or_target} mechanism action clinical trial efficacy safety therapeutic",
            "pages": 6
        },
        headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
    )
    data = r.json()
    return data.get("synthesis", data.get("markdown", ""))[:3000]

4. Find clinical trials and regulatory data

Two /search calls do the legwork. The first hunts for ClinicalTrials.gov registrations and trial results — NCT IDs, phase, endpoints; the second goes after FDA approval history and the competitive pipeline. Together they surface the candidate pages the agent will scrape for hard data.

def find_clinical_trials(drug_or_target: str) -> list[dict]:
    """Find ClinicalTrials.gov registrations, phases, and trial results."""
    r = requests.get(
        f"{BASE}/search",
        params={
            "q": f"{drug_or_target} ClinicalTrials.gov phase 1 2 3 clinical trial endpoint results",
            "limit": 6
        },
        headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
    )
    return r.json().get("results", [])

def find_regulatory_data(drug_or_target: str) -> list[dict]:
    """Find FDA approval history, NDA/BLA submissions, and competitive pipeline."""
    r = requests.get(
        f"{BASE}/search",
        params={
            "q": f"{drug_or_target} FDA approval NDA BLA regulatory submission pipeline competitive",
            "limit": 6
        },
        headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
    )
    return r.json().get("results", [])

5. Scrape trial and regulatory pages

/scrape turns each candidate URL into clean, LLM-ready markdown. This is where the agent pulls actual trial endpoints, dosing, primary outcomes, and results out of ClinicalTrials.gov entries, FDA documents, and company pipeline pages.

def scrape_trial_page(url: str) -> dict:
    """Scrape a trial registration, FDA page, or pipeline page for endpoints and results."""
    r = requests.get(
        f"{BASE}/scrape",
        params={"url": url},
        headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
    )
    data = r.json()
    return {
        "url": url,
        "title": data.get("title", ""),
        "content": data.get("markdown", "")[:2500],
    }

6. Get recent pipeline news

/news surfaces what just happened — trial readouts, FDA decisions, drug approvals, clinical holds, licensing deals, and pharma M&A. This is the time-sensitive layer that a static literature scan misses.

def get_pipeline_news(drug_or_target: str) -> list[dict]:
    """Get recent trial results, FDA decisions, approvals, and pipeline updates."""
    r = requests.get(
        f"{BASE}/news",
        params={
            "q": f"{drug_or_target} clinical trial results FDA approval pipeline biotech pharma",
            "count": 6
        },
        headers={"Authorization": f"Bearer {SUPERHIGHWAY_KEY}"}
    )
    return r.json().get("articles", [])

7. Generate the drug discovery brief with an LLM

Now hand everything to the LLM. The system prompt forbids inventing trial results or efficacy figures and bars any medical or investment recommendation — the agent summarizes what's in the sources, it does not advise. The output is structured JSON so it slots straight into a competitive landscape memo, a pipeline tracker, or a dashboard.

from openai import OpenAI

llm = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

DISCLAIMER = (
    "This summary is based on publicly available scientific and regulatory "
    "information. It is not medical advice, not investment advice, and does not "
    "replace expert scientific or clinical judgment. Drug development involves "
    "significant risk and uncertainty. Consult qualified healthcare professionals "
    "for medical decisions and licensed financial advisors for investment decisions."
)

def generate_brief(
    drug_or_target: str,
    landscape: str,
    trial_pages: list[dict],
    news: list[dict],
    context: dict
) -> dict | None:
    """Generate a structured drug discovery intelligence brief."""

    trial_text = "\n".join(
        f"- {p['title']}: {p['content'][:400]}"
        for p in trial_pages[:5]
        if p.get("content")
    )

    news_text = "\n".join(
        f"- {n.get('title', '')} ({n.get('source', '')})"
        for n in news[:6]
    )

    query_type = context.get("query_type", "drug")
    therapeutic_area = context.get("therapeutic_area", "general")
    stage_focus = context.get("stage_focus", "all")

    response = llm.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": f"""You are a drug discovery analyst preparing a research brief.
The subject is a {query_type}; focus on the {therapeutic_area} therapeutic area and {stage_focus} development stages.
Be specific and factual. Only use information from the provided sources.
Do not invent trial results, efficacy figures, or NCT IDs — if a detail isn't in the sources,
say 'not found in sources.' Do not make medical or investment recommendations."""
            },
            {
                "role": "user",
                "content": f"""Write a drug discovery brief for: {drug_or_target}

Therapeutic Landscape:
{landscape[:2000]}

Clinical Trials & Regulatory Data:
{trial_text}

Recent Pipeline News:
{news_text}

Return JSON with:
- drug_or_target: string (the drug, target, disease area, or company researched)
- therapeutic_area: string (e.g. "oncology", "CNS", "cardiovascular", "immunology")
- mechanism_of_action: string (how it works — target, pathway, modality: small molecule / biologic / gene therapy / cell therapy)
- development_stage: "preclinical" | "phase-1" | "phase-2" | "phase-3" | "approved" | "post-market" | "discontinued" | "multiple-programs"
- clinical_evidence_summary: string (what the data shows — efficacy signals, safety profile, dose ranges)
- key_trials: list of 3-5 strings (important trials found: NCT ID, phase, indication, status, key results if available)
- competitive_landscape: string (other drugs in the same class, approved competitors, pipeline competition)
- regulatory_status: string (FDA/EMA approvals, IND/NDA/BLA status, breakthrough designation, orphan drug status)
- patent_and_exclusivity: string (patent expiry, exclusivity periods, biosimilar competition timeline if relevant)
- recent_developments: list of 3 strings (from the news — trial results, regulatory decisions, M&A, licensing deals)
- investment_considerations: string (pipeline risk, market size, competitive positioning — based on public data only)
- data_quality: "high" | "medium" | "low" (how much source material was found)"""
            }
        ],
        response_format={"type": "json_object"}
    )

    try:
        brief = json.loads(response.choices[0].message.content)
        brief["disclaimer"] = DISCLAIMER
        return brief
    except (json.JSONDecodeError, KeyError):
        return None

8. The full research pipeline

Wire the steps together: research the landscape, find trials and regulatory data, scrape the top pages, pull news, then generate the brief. The context dict tells the agent whether you're researching a drug, a target, a disease area, or a company — and which therapeutic area and development stage to slant toward.

def research_drug_or_target(
    query: str,
    context: dict | None = None,
    max_pages: int = 5
) -> dict | None:
    """
    Run the full drug discovery research pipeline.

    context: {
        "query_type": "drug" | "target" | "disease-area" | "company",
        "therapeutic_area": "oncology" | "cns" | "cardiovascular" | "immunology" | "infectious-disease" | "general",
        "stage_focus": "all" | "early" | "late" | "approved"
    }
    """
    if context is None:
        context = {"query_type": "drug", "therapeutic_area": "general", "stage_focus": "all"}

    print(f"Researching drug/target: {query}")

    # Step 1: Landscape synthesis
    print("Synthesizing therapeutic landscape...")
    landscape = research_landscape(query)

    # Step 2: Find trials and regulatory data
    print("Finding clinical trials and regulatory data...")
    results = find_clinical_trials(query) + find_regulatory_data(query)

    # Step 3: Scrape the top trial and regulatory pages
    print(f"Scraping {min(len(results), max_pages)} pages...")
    trial_pages = []
    seen = set()
    for result in results:
        url = result.get("url")
        if not url or url in seen:
            continue
        seen.add(url)
        page = scrape_trial_page(url)
        if page["content"]:
            trial_pages.append(page)
        if len(trial_pages) >= max_pages:
            break

    # Step 4: Recent pipeline news
    print("Pulling recent pipeline news...")
    news = get_pipeline_news(query)

    # Step 5: Generate the brief
    print("Generating drug discovery brief...")
    return generate_brief(query, landscape, trial_pages, news, context)

def print_brief(brief: dict):
    if not brief:
        print("Could not generate brief.")
        return

    print(f"\n{'='*60}")
    print(f"Drug Discovery Brief: {brief.get('drug_or_target', 'Subject')}")
    print(f"Therapeutic area: {brief.get('therapeutic_area', '?')}")
    print(f"Development stage: {brief.get('development_stage', '?')}")
    print(f"Data quality: {brief.get('data_quality', '?').upper()}")
    print(f"{'='*60}")

    print(f"\nMechanism of Action:\n{brief.get('mechanism_of_action', '')}")
    print(f"\nClinical Evidence:\n{brief.get('clinical_evidence_summary', '')}")

    trials = brief.get("key_trials", [])
    if trials:
        print("\nKey Trials:")
        for t in trials:
            print(f"  * {t}")

    print(f"\nCompetitive Landscape:\n{brief.get('competitive_landscape', '')}")
    print(f"\nRegulatory Status:\n{brief.get('regulatory_status', '')}")
    print(f"\nPatent & Exclusivity:\n{brief.get('patent_and_exclusivity', '')}")

    developments = brief.get("recent_developments", [])
    if developments:
        print("\nRecent Developments:")
        for d in developments:
            print(f"  ! {d}")

    print(f"\nInvestment Considerations:\n{brief.get('investment_considerations', '')}")
    print(f"\n{brief.get('disclaimer', '')}")

if __name__ == "__main__":
    import sys

    # Usage: python agent.py "semaglutide GLP-1 mechanism" drug
    query = sys.argv[1] if len(sys.argv) > 1 else "semaglutide GLP-1 mechanism"
    query_type = sys.argv[2] if len(sys.argv) > 2 else "drug"

    CONTEXT = {
        "query_type": query_type,
        "therapeutic_area": "general",
        "stage_focus": "all",
    }

    brief = research_drug_or_target(query, CONTEXT, max_pages=5)
    if brief:
        print_brief(brief)

9. Common research topics

10. Use cases

11. Extending the agent

12. Important disclaimer

This agent retrieves publicly available scientific and regulatory information from clinical trial registries, FDA records, published research, and news sources. It is not medical advice and not investment advice. Drug development involves significant risk and uncertainty — most candidates that enter clinical trials never reach approval. Always consult qualified healthcare professionals for medical decisions and licensed financial advisors for investment decisions.

13. Getting your API key

Grab a free Superhighway key at /pricing (1,000 calls/month, no credit card). For an agent that provisions its own access, skip the key entirely with x402: it pays $0.002 per call in USDC on Base — no signup, no key management. See the x402 pay-per-call guide for the wallet setup.

For related builds, the healthcare research agent applies the same four-endpoint pattern to treatments and conditions for patients and caregivers, and the financial research agent uses it on companies and markets.