FastPay AI: Autonomous 3-Way Invoice Matching Agent

01: The Problem

3-Way Matching: thousands of manual hours in every Accounts Payable (AP) team

Try it live The full demo is live at fastpay-ai.mezapps.com. Click any invoice card to watch the agent reason through it in real time, or hit "Process Today's Batch" to run all 12 scenarios at once.

Every Accounts Payable (AP) team runs the same gauntlet daily: for every vendor invoice received, a clerk must manually pull the corresponding Purchase Order and Warehouse receipt, then verify that quantities, prices, line items, and totals agree across all three documents. Any mismatch, a shortage, a price increase, an unauthorized SKU, a duplicate submission, must be caught before payment is released.

This is 3-Way Matching. It is repetitive, time-consuming, and deeply error-prone at scale. It is also a textbook agentic AI use case: structured document comparison, tool-based data lookup, deterministic rules, and a final reasoning step that must produce an auditable decision.

12

Hand-crafted test scenarios

7

Agent tools per run

$0

Monthly infrastructure cost

02: Agent Architecture

Custom orchestrator: no framework, full control

The agent is built as a hand-rolled orchestrator in TypeScript, no LangChain, no LangGraph. This was a deliberate choice: 3-Way Matching has a fixed, deterministic pipeline. The steps never change order, there is no dynamic tool routing, and the control flow is simple enough that a framework would add overhead without value.

Tool-per-file pattern

Each capability lives in its own src/lib/agent/tools/ file, extract-pdf.ts, lookup-po.ts, query-wms.ts, etc. Each is a plain async function with typed inputs and Zod-validated outputs.

Orchestrator as a pipeline

orchestrator.ts is a single async function runAgent() that calls tools in sequence, emitting a TraceEvent after each step via a callback, no agent loop, no recursion.

Repository pattern

All DB access goes through db/repo.ts. The same codebase runs against SQLite locally and Neon Postgres in production, one environment variable flip, zero code changes.

Streaming via SSE

Each tool call emits a TraceEvent that is pushed to the browser over a Server-Sent Events stream. The UI updates in real time, no polling, no WebSocket overhead.

// orchestrator.ts, simplified
export async function runAgent(invoiceId: string, emit: EmitFn): Promise {
  const invoice  = await getInvoiceById(invoiceId)
  await checkDuplicate(invoice.invoice_number)   // fast, no API

  const extracted   = await extractPdf(invoice.pdf_path)  // Gemini Vision
  const po          = await lookupPo(extracted.po_reference)
  const wms         = await queryWms(po.id)
  const vendorMatch = await fuzzyMatchVendor(extracted.vendor_name, po.vendor_name)
  const fxResult    = await convertCurrency(...)           // if currencies differ

  const decision = await reasonAndDecide({ extracted, po, wms, vendorMatch, fxResult })
  emit({ step: 'decide', status: 'done', detail: decision.status })

  await saveResult(invoiceId, decision)
  return buildResult(invoiceId, decision, trace)
}

03: Vision-Based Extraction

Reading invoices the way a human would: including messy ones

The first tool call is the hardest: turning an unstructured invoice image into a structured JSON object. The agent sends the raw PDF or image file to Gemini 2.5 Flash as base64-encoded inline data, along with a schema-anchored extraction prompt. The response is validated with a Zod schema before any downstream tool sees it.

// extract-pdf.ts
const result = await model.generateContent([
  EXTRACTION_PROMPT,                         // structured output instructions
  { inlineData: { mimeType, data: base64 } } // raw file bytes
])

// Zod validates the response, rejects any hallucinated fields
return ExtractedInvoiceSchema.parse(JSON.parse(result.response.text()))

Why Gemini multimodal over a dedicated OCR service Dedicated OCR services extract text but miss context, they can't infer that a handwritten note says "5% loyalty discount applied" or that a crumpled scan with low contrast still has a readable PO number. Gemini natively understands layout, tables, annotations, and partial text.

The 12 test scenarios cover 5 invoice variants: clean digital PDF, scanned document, phone photo, handwritten annotation, and crumpled/degraded. All 5 formats are tested against the same extraction schema to prove robustness.

04: Tool-Based Matching

Five typed tools that query real data sources

After extraction, the orchestrator runs a sequence of typed tool calls. Each tool is a pure async function, no side effects, no shared state, easy to test in isolation.

lookup_po + query_wms

Parameterized SQL queries against the Neon Postgres database. Primary lookup is by PO reference from the invoice; fallback is fuzzy vendor-name search across all open POs.

fuzzy_match_vendor

Gemini embeddings + cosine similarity. Catches vendor name drift ("ACME Corp." vs "Acme Corporation Inc.") and near-duplicate fraud ("Apex Logistics" vs "Apax Logistics Inc.").

convert_currency

Live FX rate lookup with a 6-hour Neon cache. Only runs when invoice currency ≠ PO currency. Downstream tools receive a converted amount for apples-to-apples price comparison.

check_duplicate

Fast pre-check against match_results by invoice_number. Fires before any LLM call, if the invoice was already processed, it is immediately flagged as DUPLICATE at near-zero cost.

Multi-line item matching POs and WMS receipts contain 1–3 SKUs each. The agent compares every line item individually, a quantity shortage on SKU-A while SKU-B matches correctly still flags the invoice, and the specific impacted SKU is highlighted in the UI.

05: Decision Logic

Deterministic rules + LLM-generated explanation

The final reason_and_decide step combines two things: a deterministic rule engine that catches every known mismatch type, and a Gemini call that writes a plain-English explanation of exactly what was found and why.

Condition detected	Decision
All quantities and prices match	APPROVED
WMS received qty < invoiced qty	FLAGGED, SHORTAGE
Invoice unit price > PO agreed price	FLAGGED, PRICE MISMATCH
Line item not present in PO	FLAGGED, UNAUTHORIZED ITEMS
Same invoice number previously processed	FLAGGED, DUPLICATE
Tax totals don't match line-item arithmetic	FLAGGED, TAX MISMATCH
Currency conversion involved	FLAGGED, FX CONVERSION
Vendor name mismatch beyond fuzzy threshold	FLAGGED, VENDOR MISMATCH
Confidence below threshold / ambiguous	ESCALATED, HUMAN REVIEW

Why deterministic rules AND an LLM? The rules catch every mismatch reliably, the LLM cannot be trusted to do arithmetic on its own. But rules produce cryptic output: "SHORTAGE on SKU MON-27Q: invoiced 50, received 48". The LLM turns that into auditable prose that an Accounts Payable (AP) clerk can attach to a dispute ticket: "Invoice CSP-2024-2201 claims 50 units of the 27-inch QHD monitor (SKU MON-27Q) at $320 each, but the WMS receipt records only 48 units received at dock, a shortfall of 2 units valued at $640…"

06: Live Streaming UI

Watching the agent think: in real time

The UI is a Next.js 14 App Router application. When a user clicks an invoice card, the browser opens a Server-Sent Events connection to /api/agent/stream. Every tool call in the orchestrator emits a TraceEvent that is pushed down the stream immediately, the trace panel updates step-by-step as the agent works.

Invoice Gallery

12 cards with real PDF thumbnails, difficulty badges, and skill tags. Each card shows the agent's result as a colour dot (green/red/amber) once processed. Individual PDF download on each card.

Agent Trace Panel

Live log of every tool call, function name, inputs, and output summary. Steps flip from "running" to "done" in-place with animations. Matches exactly what orchestrator.ts emits.

Decision Output Panel

Animated reveal of status badge, confidence bar, flag reason, and full agent reasoning. Includes a direct link to the Langfuse trace for the run.

Process Today's Batch

Single button runs all 12 invoices sequentially. The ActionBar shows live approved/flagged/escalated counters updating as each result arrives.

// SSE stream consumer, client side
const reader = res.body.getReader()
while (true) {
  const { done, value } = await reader.read()
  if (done) break

  const events = parseSSE(value)
  for (const event of events) {
    if (event.type === 'step')   applyStep(event)  // updates trace panel
    if (event.type === 'result') setDecision(event) // reveals decision
  }
}

07: Evaluation Framework

Proving the agent is reliable: with numbers

Every AI demo should have an eval suite. This one runs automatically: hit "Run Eval" and the agent processes all 12 scenarios fresh, then computes a full report against ground-truth labels defined in data/scenarios.json.

Accuracy & Macro F1

Top-line metrics across all 12 scenarios. F1 is macro-averaged across the three classes (APPROVED / FLAGGED / ESCALATED) to avoid penalizing class imbalance.

Per-class P / R / F1

Precision, recall, and F1 broken out per class with a visual bar. Makes it easy to see whether the agent over-flags (high recall, low precision) or misses flags (high precision, low recall).

3×3 Confusion Matrix

Heat-map grid of actual vs predicted class. Green diagonal = correct. Red off-diagonal = specific error type, e.g. ESCALATED misclassified as APPROVED tells you the agent isn't conservative enough.

p50 / p95 Latency

Latency percentiles across all 12 runs. p95 catches outliers, a single slow scenario (e.g. handwritten OCR) can dominate average but is invisible without the percentile view.

Eval skip detection If an invoice is missing from the database (e.g. seed not run), the eval now shows an amber ⚠ warning row instead of a silent dash. The per-scenario table also labels mismatches as "expected / got" instead of just showing two unlabelled chips.

08: Database Explorer & Escalations

ERP-style data visibility baked into the demo

One of the goals was to make the data behind the agent visible, to show that this isn't a black box but a system you could wire into a real supply chain. Two dedicated views provide that transparency.

Database Explorer

Browse all Purchase Orders, WMS receipts, and processed invoices in ERP-style header/detail rows, one header per document, then one row per line item. Non-approved invoices show a ⚠ badge on the specific impacted SKUs.

Escalations View

Mirrors what an Accounts Payable (AP) manager's review queue looks like. Every FLAGGED or ESCALATED invoice is shown as a card with the flag reason, the agent's full explanation, confidence score, and the exact line items responsible highlighted in red.

The impacted-SKU logic is deterministic: for PRICE_MISMATCH, it compares invoice unit price vs PO price per SKU; for SHORTAGE, invoice qty vs WMS received qty; for UNAUTHORIZED_ITEMS, SKUs present in the invoice but absent from the PO.

09: Bring Your Own Invoice

Upload any real invoice and watch the agent find a flaw

Beyond the 12 pre-seeded scenarios, users can upload any invoice PDF, JPEG, PNG, or WEBP. The system:

Extracts the invoice data with Gemini Vision
Generates a synthetic PO and WMS receipt from the extracted data
Injects one intentional discrepancy (randomly chosen: price variance, quantity shortage, or unauthorized line item)
Runs the full agent pipeline and explains what it found

Subtle orchestrator bug, and how it was fixed An early version re-extracted the uploaded PDF during the agent run, which caused the agent to read the original PO reference from the document (e.g. "PO-2024-0005") and find the wrong PO in the database, making the injected discrepancy invisible. The fix: BYOI invoices skip PDF re-extraction entirely and use the line items already stored in the DB, with the synthetic BYOI PO reference reconstructed from the scenario_id.

Uploaded files are written to /tmp (Vercel's only writable directory) and deleted immediately after processing. They are never committed to git or persisted beyond the request.

10: Deployment & Observability

Production-grade on a $0/month budget

Vercel + GitHub Actions

CI runs tsc --noEmit + ESLint + Next.js build on every push. Merge to master triggers an automatic Vercel deploy. Zero-downtime, zero config.

Neon Postgres

Serverless Postgres, scales to zero between runs, autoscales on load. The same repo runs against SQLite locally via a single env-var switch.

Langfuse observability

Every agent run creates a Langfuse trace with one span per tool call. The decision panel includes a direct "View Trace" link that opens the full trace in Langfuse, token counts, latency, inputs, outputs.

Abuse protection

Upstash Redis rate-limiting at the edge (20 req/min/IP). Daily hard cap on total LLM calls. BYOI file size limit (10 MB) and MIME validation. All LLM calls are server-side only.