Custom-built agentic pipeline that reads messy vendor invoices, cross-references them against Purchase Orders and Warehouse receipts, and autonomously approves, flags, or escalates โ with live reasoning trace and a full evaluation suite.
Every Accounts Payable (AP) team runs the same gauntlet daily: for every vendor invoice received, a clerk must manually pull the corresponding Purchase Order and Warehouse receipt, then verify that quantities, prices, line items, and totals agree across all three documents. Any mismatch โ a shortage, a price increase, an unauthorized SKU, a duplicate submission โ must be caught before payment is released.
This is 3-Way Matching. It is repetitive, time-consuming, and deeply error-prone at scale. It is also a textbook agentic AI use case: structured document comparison, tool-based data lookup, deterministic rules, and a final reasoning step that must produce an auditable decision.
The agent is built as a hand-rolled orchestrator in TypeScript โ no LangChain, no LangGraph. This was a deliberate choice: 3-Way Matching has a fixed, deterministic pipeline. The steps never change order, there is no dynamic tool routing, and the control flow is simple enough that a framework would add overhead without value.
Each capability lives in its own src/lib/agent/tools/ file โ extract-pdf.ts, lookup-po.ts, query-wms.ts, etc. Each is a plain async function with typed inputs and Zod-validated outputs.
orchestrator.ts is a single async function runAgent() that calls tools in sequence, emitting a TraceEvent after each step via a callback โ no agent loop, no recursion.
All DB access goes through db/repo.ts. The same codebase runs against SQLite locally and Neon Postgres in production โ one environment variable flip, zero code changes.
Each tool call emits a TraceEvent that is pushed to the browser over a Server-Sent Events stream. The UI updates in real time โ no polling, no WebSocket overhead.
// orchestrator.ts โ simplified
export async function runAgent(invoiceId: string, emit: EmitFn): Promise {
const invoice = await getInvoiceById(invoiceId)
await checkDuplicate(invoice.invoice_number) // fast, no API
const extracted = await extractPdf(invoice.pdf_path) // Gemini Vision
const po = await lookupPo(extracted.po_reference)
const wms = await queryWms(po.id)
const vendorMatch = await fuzzyMatchVendor(extracted.vendor_name, po.vendor_name)
const fxResult = await convertCurrency(...) // if currencies differ
const decision = await reasonAndDecide({ extracted, po, wms, vendorMatch, fxResult })
emit({ step: 'decide', status: 'done', detail: decision.status })
await saveResult(invoiceId, decision)
return buildResult(invoiceId, decision, trace)
}
The first tool call is the hardest: turning an unstructured invoice image into a structured JSON object. The agent sends the raw PDF or image file to Gemini 2.5 Flash as base64-encoded inline data, along with a schema-anchored extraction prompt. The response is validated with a Zod schema before any downstream tool sees it.
// extract-pdf.ts
const result = await model.generateContent([
EXTRACTION_PROMPT, // structured output instructions
{ inlineData: { mimeType, data: base64 } } // raw file bytes
])
// Zod validates the response โ rejects any hallucinated fields
return ExtractedInvoiceSchema.parse(JSON.parse(result.response.text()))
The 12 test scenarios cover 5 invoice variants: clean digital PDF, scanned document, phone photo, handwritten annotation, and crumpled/degraded. All 5 formats are tested against the same extraction schema to prove robustness.
After extraction, the orchestrator runs a sequence of typed tool calls. Each tool is a pure async function โ no side effects, no shared state, easy to test in isolation.
Parameterized SQL queries against the Neon Postgres database. Primary lookup is by PO reference from the invoice; fallback is fuzzy vendor-name search across all open POs.
Gemini embeddings + cosine similarity. Catches vendor name drift ("ACME Corp." vs "Acme Corporation Inc.") and near-duplicate fraud ("Apex Logistics" vs "Apax Logistics Inc.").
Live FX rate lookup with a 6-hour Neon cache. Only runs when invoice currency โ PO currency. Downstream tools receive a converted amount for apples-to-apples price comparison.
Fast pre-check against match_results by invoice_number. Fires before any LLM call โ if the invoice was already processed, it is immediately flagged as DUPLICATE at near-zero cost.
The final reason_and_decide step combines two things: a deterministic rule engine that catches every known mismatch type, and a Gemini call that writes a plain-English explanation of exactly what was found and why.
| Condition detected | Decision |
|---|---|
| All quantities and prices match | APPROVED |
| WMS received qty < invoiced qty | FLAGGED โ SHORTAGE |
| Invoice unit price > PO agreed price | FLAGGED โ PRICE MISMATCH |
| Line item not present in PO | FLAGGED โ UNAUTHORIZED ITEMS |
| Same invoice number previously processed | FLAGGED โ DUPLICATE |
| Tax totals don't match line-item arithmetic | FLAGGED โ TAX MISMATCH |
| Currency conversion involved | FLAGGED โ FX CONVERSION |
| Vendor name mismatch beyond fuzzy threshold | FLAGGED โ VENDOR MISMATCH |
| Confidence below threshold / ambiguous | ESCALATED โ HUMAN REVIEW |
The UI is a Next.js 14 App Router application. When a user clicks an invoice card, the browser opens a Server-Sent Events connection to /api/agent/stream. Every tool call in the orchestrator emits a TraceEvent that is pushed down the stream immediately โ the trace panel updates step-by-step as the agent works.
12 cards with real PDF thumbnails, difficulty badges, and skill tags. Each card shows the agent's result as a colour dot (green/red/amber) once processed. Individual PDF download on each card.
Live log of every tool call โ function name, inputs, and output summary. Steps flip from "running" to "done" in-place with animations. Matches exactly what orchestrator.ts emits.
Animated reveal of status badge, confidence bar, flag reason, and full agent reasoning. Includes a direct link to the Langfuse trace for the run.
Single button runs all 12 invoices sequentially. The ActionBar shows live approved/flagged/escalated counters updating as each result arrives.
// SSE stream consumer โ client side
const reader = res.body.getReader()
while (true) {
const { done, value } = await reader.read()
if (done) break
const events = parseSSE(value)
for (const event of events) {
if (event.type === 'step') applyStep(event) // updates trace panel
if (event.type === 'result') setDecision(event) // reveals decision
}
}
Every AI demo should have an eval suite. This one runs automatically: hit "Run Eval" and the agent processes all 12 scenarios fresh, then computes a full report against ground-truth labels defined in data/scenarios.json.
Top-line metrics across all 12 scenarios. F1 is macro-averaged across the three classes (APPROVED / FLAGGED / ESCALATED) to avoid penalizing class imbalance.
Precision, recall, and F1 broken out per class with a visual bar. Makes it easy to see whether the agent over-flags (high recall, low precision) or misses flags (high precision, low recall).
Heat-map grid of actual vs predicted class. Green diagonal = correct. Red off-diagonal = specific error type โ e.g. ESCALATED misclassified as APPROVED tells you the agent isn't conservative enough.
Latency percentiles across all 12 runs. p95 catches outliers โ a single slow scenario (e.g. handwritten OCR) can dominate average but is invisible without the percentile view.
One of the goals was to make the data behind the agent visible โ to show that this isn't a black box but a system you could wire into a real supply chain. Two dedicated views provide that transparency.
Browse all Purchase Orders, WMS receipts, and processed invoices in ERP-style header/detail rows โ one header per document, then one row per line item. Non-approved invoices show a โ badge on the specific impacted SKUs.
Mirrors what an Accounts Payable (AP) manager's review queue looks like. Every FLAGGED or ESCALATED invoice is shown as a card with the flag reason, the agent's full explanation, confidence score, and the exact line items responsible highlighted in red.
The impacted-SKU logic is deterministic: for PRICE_MISMATCH, it compares invoice unit price vs PO price per SKU; for SHORTAGE, invoice qty vs WMS received qty; for UNAUTHORIZED_ITEMS, SKUs present in the invoice but absent from the PO.
Beyond the 12 pre-seeded scenarios, users can upload any invoice PDF, JPEG, PNG, or WEBP. The system:
scenario_id.
Uploaded files are written to /tmp (Vercel's only writable directory) and deleted immediately after processing. They are never committed to git or persisted beyond the request.
CI runs tsc --noEmit + ESLint + Next.js build on every push. Merge to master triggers an automatic Vercel deploy. Zero-downtime, zero config.
Serverless Postgres โ scales to zero between runs, autoscales on load. The same repo runs against SQLite locally via a single env-var switch.
Every agent run creates a Langfuse trace with one span per tool call. The decision panel includes a direct "View Trace" link that opens the full trace in Langfuse โ token counts, latency, inputs, outputs.
Upstash Redis rate-limiting at the edge (20 req/min/IP). Daily hard cap on total LLM calls. BYOI file size limit (10 MB) and MIME validation. All LLM calls are server-side only.