LLMs Can Read 100,000 Contracts. Here's Why That's Not Enough.

Consider a dataset: 847 amendment justifications from a single Ugandan ministry, flagged during a routine OC4IDS disclosure review. Amendments averaging 34% above original contract value, most citing "additional works" with no further specifics. Now point a large language model at that dataset.

The model finishes in eleven minutes. It classifies 60% of the justifications as suspicious, vague language, circular reasoning, missing technical basis. A three-person team took six weeks to reach the same classification the previous year.

The numbers match. The conclusions do not.

LLMs accelerate pattern detection in procurement data. They also accelerate false confidence.

The manual review team had context the model lacked. They knew which ministry handles emergency medical procurement. They knew which contractors operate in conflict-affected northern districts where scope changes are routine. They knew that "additional works" from the Roads Authority means something different than "additional works" from the Health Ministry.

The model treated identical language identically. That is precisely the problem.

The Authority Problem

This is the failure mode nobody in the AI-for-governance space is talking about, and it is the most dangerous.

A spreadsheet of red flags carries one kind of weight. "The AI flagged this" carries another. Government officials treat model output as more authoritative than analyst judgment, not less. I have watched this dynamic play out across multiple procurement oversight engagements. A human analyst flags a contract and the response is "let me check." An AI flags a contract and the response is "forward this to the Inspector General."

False positives from a human analyst get questioned. False positives from an AI get acted on.

The political cost of a false accusation generated by a machine is identical to one generated by a person. The institutional response is not.

This creates a perverse incentive structure. Oversight bodies learn that AI-generated flags move faster through bureaucratic channels. So they stop investing in contextual verification. The flags become the findings. The triage tool becomes the verdict machine. And when the inevitable wrongful accusation lands, a procurement officer flagged for corruption because the model did not understand that emergency medical procurement follows different rules, the institution blames the technology, not the process that treated model output as evidence.

Credit scoring and criminal justice faced the same inflection point. Algorithmic credit scores replaced loan officer judgment. Algorithmic risk assessments influenced sentencing decisions. In both domains, the systems were adopted faster than the institutional safeguards that should have accompanied them. Procurement oversight is walking the same path, with one difference: the accused cannot appeal to a regulatory body because no regulatory framework for algorithmic procurement oversight exists yet.

The danger is not that the model gets it wrong. The danger is that institutions treat wrong answers from a machine as more credible than wrong answers from a person.

Mitigating this requires institutional design, not technical fixes. Every AI-generated flag must carry a visible label: "Candidate for review, not a finding." Oversight bodies must be trained to treat model output as one input among many, not as a privileged source. And the humans who verify flags must have the authority and the institutional protection to dismiss them without bureaucratic penalty.

What LLMs Do Well, With a Human in the Loop

I am not arguing against using LLMs for procurement monitoring. I am arguing that every LLM output in this domain requires human verification before anyone acts on it. No exceptions. Procurement decisions affect real contractors, real budgets, and real careers. An automated false accusation carries the same political cost as a deliberate one.

Three use cases show where LLMs assist human analysts, not replace them.

Amendment Classification

Take a dataset of 847 "additional works" justifications. An LLM categorises them into genuine scope changes, suspicious cost escalation, and vague boilerplate faster than any analyst working alone. But a procurement officer still needs to review each classification against institutional context, the ministry's mandate, the contractor's history, the project's timeline, before any flag becomes an investigation.

847

contracts with amendments averaging 34% above original value (illustrative scenario)

Bidder Network Detection

Feed registration data, addresses, phone numbers, directors, into an LLM alongside a procurement analyst. The model surfaces entity resolution connections that SQL joins miss: similar company names with character transpositions, shared beneficial owners using name variations, email domains registered on the same day. Uganda's National Procurement Portal publishes over 100,000 contracts. Manual entity resolution across that volume is not feasible. An LLM makes the initial pass possible, but a human analyst must verify every connection before it becomes evidence.

100,000+

contracts published on Uganda's National Procurement Portal (total portal volume)

Completeness Scoring

Run OC4IDS field-level validation through an LLM. It identifies not just missing fields but semantically empty ones: descriptions that say "as per agreement," amounts that are placeholder zeros, dates that violate procurement logic sequences. A tender closing date after the award date is not a missing field, it is a data quality failure that standard validation scripts miss. The LLM flags candidates for review. A data specialist confirms them.

Where It Breaks

Three technical failure modes compound the Authority Problem described above.

Context Collapse

"Emergency procurement" in one jurisdiction covers medical supplies in active conflict zones where formal tender processes would cost lives. In another, the same phrase masks deliberate circumvention of competitive bidding. An LLM treats identical language identically. A procurement officer who has worked both contexts does not. Procurement context is institutional, not textual, the model cannot learn it from the document alone.

False Pattern Amplification

The model finds that 30% of contracts in one ministry bypass formal procurement and flags this as anomalous. A procurement officer knows: that ministry handles emergency medical supplies during disease outbreaks. The bypass rate is expected. It is policy, not corruption.

The model produces pattern-shaped outputs at a rate no human team can ground-truth.

This is the core risk. The volume of flags overwhelms the capacity for contextual verification. Analysts start rubber-stamping model outputs because reviewing each flag against institutional context takes longer than the original manual analysis.

Training Data Mismatch

Most LLMs train on English-language legal and business text from high-income countries. Procurement data from Kampala, Maputo, and Kaduna follows different conventions: mixed-language documents, shifting date formats, institutional abbreviations, PPDA, PPOA, BPP, that mean nothing to a model trained on Common Crawl. The model confidently parses what it does not understand. The output looks correct because it is formatted correctly. The substance is unreliable.

What I Would Build

Four design principles for AI-assisted procurement monitoring.

Triage, Not Verdicts

The LLM generates candidate flags. A procurement analyst reviews them against institutional context before any flag reaches an oversight body. The model narrows 847 contracts to 200 candidates. The analyst narrows 200 candidates to 47 investigations. The model is a filter, not a judge.

This is the triage signal framework from red flag detection work, applied to AI outputs. The principle does not change because the pattern detector got faster.

Practical AI Integration

The advice most AI-for-governance articles give is "fine-tune on local data." The reality: fine-tuning an LLM requires thousands of labeled examples, GPU compute, and ML engineering skills that most procurement teams do not have. The teams who can fine-tune a model do not need this article. The teams who need this article cannot fine-tune a model.

The practical middle path: few-shot prompting with jurisdiction-specific examples. Build a retrieval-augmented generation (RAG) pipeline that feeds the LLM local procurement templates, amendment justification patterns, and codelist definitions from your specific jurisdiction. This requires a database and a prompt engineer, not a GPU cluster. The model learns what "normal" looks like in Kampala without retraining its weights.

For teams with more resources, fine-tuning remains the gold standard. But start with RAG. Ship something that works this quarter rather than planning a fine-tuning project that ships never.

The Operational Bridge

AI-for-governance writing rarely answers the operational question: who does what, at what cost, using what existing infrastructure?

A minimum viable deployment requires three roles: a prompt engineer who builds and maintains the RAG pipeline, a procurement analyst who reviews every flag against institutional context, and a data specialist who feeds the system clean OC4IDS-formatted data. For a pilot covering one ministry's contracts, expect one person-month of prompt engineering setup, ongoing analyst time proportional to flag volume, and cloud compute costs under $500 per month for a RAG pipeline processing several thousand documents.

This work does not start from zero. The Open Contracting Partnership's procurement analytics tools, Transparency International's procurement integrity tools, and CoST's multi-stakeholder assurance process all provide established frameworks for procurement oversight. An LLM layer sits on top of these existing processes, it does not replace them. Integration means feeding LLM output into existing review workflows, not building parallel oversight systems.

Measure What Matters

The metric that matters is not "flagged 1,000 contracts in 10 minutes." It is "of those 1,000 flags, how many survived contextual review?"

23%

of LLM-flagged amendments warranted investigation after human review (23% of the 60% flagged, roughly 100 of 508 candidates)

Put that in context. In the illustrative scenario, the manual team reviewed the same 847 contracts and flagged 15% for investigation, with roughly 80% precision after supervisory review. The LLM flagged 60% (508 contracts) with 23% precision after human review. In a 100-contract sample, that means 77 false accusations for every 23 genuine findings, a ratio that erodes institutional trust within months without human verification. The LLM cast a wider net and caught some cases the manual team missed. But it also generated four times the false positives. Whether that tradeoff is worth it depends entirely on the review team's capacity.

If 90% of flags are false positives, the system generates noise. Auditors learn to ignore it. The tool dies the same way transparency portals die: nobody trusts it, so nobody defends it when the budget gets cut.

The Machine Reads Text

An LLM can read 100,000 contracts. It can classify amendments, detect bidder networks, and score data completeness faster than any team I have built. It cannot tell you which flags matter. That requires someone who has built the systems, trained the teams, and survived the politics of disclosure in governments that did not ask to be transparent.

The machine reads the text. The practitioner reads the room.

Playbook

Decision Table

Option	When to Use	Tradeoff
Full automation	Never for procurement oversight	Fast output, high false-positive rate, eroded trust
LLM triage + human review	Established review team with domain expertise	Slower than automation, much higher precision
Manual analysis only	Low contract volume or no fine-tuned model	High precision, doesn't scale beyond hundreds

Execution Checklist

Fine-tune on local procurement data before deployment.
Measure false-positive rate, not detection speed.
Route all model flags through human domain review.
Test against known outcomes before live deployment.

Failure Modes

LLM flags treated as evidence rather than triage signals.
General-purpose model deployed without jurisdiction-specific training.
False-positive volume overwhelms review capacity.

AiProcurementLlmRed-flagsOc4idsInfrastructure-transparency

Found this useful?

I write about open data systems, transparency, and implementation.