Engineering Evolution

Key architectural lessons from transforming v1.0 into the robust v2.0 pipeline.

1. Architectural Flaws & Tight Coupling

The Problem (v1.0)

The legacy database (`database.py`) became a massive god-class managing over 40 distinct tables. Agents read directly from and wrote directly to SQLite using raw SQL queries scattered throughout the codebase. Changing a schema required rewriting agent logic.

The Solution (v2.0)

Strict modular architecture. ORM Integration provides type-safe schema definitions. Agents never write raw SQL — instead utilizing a standardized db/data_manager.py pattern for upsert helpers. Total tables reduced from 40+ to ~15.

2. Trial Linkage & the "Zero Trial Bug"

The Problem (v1.0)

Companies that acquired pipelines via M&A (e.g., Gilead acquiring Kite Pharma) appeared to have 0 clinical trials because the trial sponsors remained under the acquired entity's name. Hard-coded regex failed to map them consistently.

The Solution (v2.0)

A "Ticker-First" Onboarding Pipeline. SEC 10-K filings are used as the authoritative source since they legally disclose all active drugs and often cite explicit NCT IDs. These IDs are stable identifiers completely immune to M&A naming shifts.

3. Semantic SEC Parsing vs. Raw Full-Text

The Problem (v1.0)

Passing an entire 10-K (50k–150k tokens) into a prompt window resulted in noisy, unfocused extractions. Models would fixate on boilerplate legal language and miss the actual clinical targets.

The Solution (v2.0)

Implemented sec-parser to perform structured slicing of Item 1 (Business), Item 1A (Risk Factors), and Item 7 (MD&A). This pre-slicing reduces input token volume by ~70% and drastically improves LLM extraction recall.

4. GPU VRAM & Inference Constraints

The Problem (v1.0)

Initial designs called for llama3.1:70b, but the 8GB VRAM of a Tesla P4 GPU simply cannot run it. Running agents sequentially without orchestration led to simultaneous GPU requests that caused OOM errors.

The Solution (v2.0)

Utilized llama3.1:8b for general extraction, llama3.2:3b for low-latency scoring, and 4-bit quantized deepseek-r1:7b for synthesis. Implemented explicit CrewAI schedule staggering (e.g., Crew 1 at 08:00, Crew 2 at 09:00) to entirely prevent concurrent heavy loads.