The Contract Review Problem at Scale
A single M&A transaction at this firm might involve 800 contracts that need to be reviewed for due diligence. At 45 minutes per contract for a junior associate, that's 600 hours of associate time β $180,000 in labor cost β for the document review phase alone. Partners were billing clients for this work, but it was creating associate burnout and limiting the firm's capacity to take on new matters.
Building a Legal-Domain AI
Generic LLMs perform poorly on contract review because they lack knowledge of the firm's specific standard positions, risk thresholds, and preferred language. We fine-tuned GPT-4o on 2,400 contracts that the firm's partners had previously reviewed and annotated β teaching the model the firm's specific risk philosophy and what constitutes a "red flag" vs. an "acceptable deviation."
The extraction pipeline uses a two-stage approach: first, a structured extraction model pulls the 47 key data points into a standardized schema; second, a risk analysis model compares each extracted clause against the firm's playbook and generates a risk rating (green/yellow/red) with a plain-language explanation of why the clause is concerning.
The Due Diligence Workflow
For M&A due diligence, the system processes an entire data room β hundreds of contracts uploaded as PDFs β and generates a consolidated due diligence report in 2β3 hours. The report includes a contract inventory, a risk summary sorted by severity, a comparison of key terms across all contracts, and flags for contracts that require partner-level review. Associates review the AI's output rather than reading every contract from scratch.
Accuracy and Quality Control
We validated the system against 200 contracts that partners had manually reviewed, finding 96.2% agreement on key term extraction and 94.8% agreement on risk ratings. The 3.8% disagreement rate was concentrated in highly negotiated, non-standard clauses β exactly the contracts that should receive human partner attention. The system flags its own uncertainty, routing low-confidence extractions for human review.