>

AI Agents

How to Build a Document Extraction Agent on StackAI: Step-by-Step Guide for Accurate PDF Data Extraction

StackAI

AI Agents for the Enterprise

StackAI

AI Agents for the Enterprise

How to Build a Document Extraction Agent on StackAI (Step-by-Step Tutorial)

If you’ve ever tried to build a document extraction agent on StackAI, you already know the hardest part isn’t getting an LLM to read a PDF. It’s getting reliable, structured data out of messy real-world documents and pushing that data into the systems your team actually uses.


In this guide, you’ll learn how to build a document extraction agent on StackAI end-to-end: ingestion, OCR for PDFs, schema design, LLM document parsing, validation and exception handling, and finally webhook or API export (plus practical options like Sheets and databases). The goal is a production-ready document extraction workflow that’s accurate, maintainable, and easy to iterate on as templates change.


What a Document Extraction Agent Is (and Why It Matters)

A document extraction agent is an automated workflow that turns unstructured documents (PDFs, scans, images) into structured data (usually JSON) that downstream systems can consume.


A production-grade document extraction agent typically does five things:


  • Ingests documents (upload, email, drive folder, API/webhook)

  • Runs OCR and text normalization when needed

  • Extracts fields into a structured schema (JSON schema extraction)

  • Validates outputs and routes exceptions for review

  • Exports the results to business systems (Sheets, CRM, DB, webhook / API export)


This matters because most operational work is still buried in PDFs: invoices, contracts, onboarding packets, insurance forms, and compliance documentation. When extraction is reliable, you can move faster without sacrificing controls.


Common use cases for document extraction

Teams usually start with a narrow, high-volume workflow where accuracy has obvious business value, like:


  • Invoice extraction for AP: vendor, invoice number, due date, totals, line items

  • Contract data extraction: parties, renewal dates, fees, key clauses

  • Claims and insurance forms: policy numbers, claimant info, diagnosis/procedure codes

  • KYC/onboarding: IDs, proof of address, business registrations


Where extraction typically fails

Most PDF data extraction projects break down for predictable reasons:


  • Poor scans, skewed pages, faint text, or multi-page documents

  • Tables and line items that get merged or dropped

  • Template drift (the vendor updates layout and fields move)

  • No validation layer (wrong totals slip through)

  • No exception handling (every edge case becomes a manual fire drill)


The rest of this tutorial is designed to prevent those failure modes from day one.


Before You Start: What You’ll Build (Architecture + Example Output)

To keep this practical, the running example here is invoice extraction. Invoices are perfect for learning because they combine messy layouts with clear validation rules (numbers should add up, due dates should be after invoice dates, currencies should be consistent).


What the finished agent does

Your StackAI document extraction agent will:


  • Accept a PDF or image (scanned or digital)

  • Run OCR for PDFs when needed (especially scanned PDFs)

  • Extract invoice fields into a consistent JSON output

  • Validate the output (schema + business rules)

  • Export the result to your system of choice (webhook/API is the most flexible)


Example target JSON output

This is the shape you want before you touch prompts. Schema-first design is how you get consistent structured data extraction across varying templates.


{
 "vendor_name": "Acme Office Supplies",
 "invoice_number": "INV-10493",
 "invoice_date": "2026-01-15",
 "due_date": "2026-02-14",
 "currency": "USD",
 "subtotal": 1280.00,
 "tax": 102.40,
 "total": 1382.40,
 "line_items": [
   {
     "description": "Printer paper, 10 reams",
     "quantity": 2,
     "unit_price": 45.00,
     "line_total": 90.00
   },
   {
     "description": "Toner cartridge - black",
     "quantity": 4,
     "unit_price": 297.50,
     "line_total": 1190.00
   }
 ]

{
 "vendor_name": "Acme Office Supplies",
 "invoice_number": "INV-10493",
 "invoice_date": "2026-01-15",
 "due_date": "2026-02-14",
 "currency": "USD",
 "subtotal": 1280.00,
 "tax": 102.40,
 "total": 1382.40,
 "line_items": [
   {
     "description": "Printer paper, 10 reams",
     "quantity": 2,
     "unit_price": 45.00,
     "line_total": 90.00
   },
   {
     "description": "Toner cartridge - black",
     "quantity": 4,
     "unit_price": 297.50,
     "line_total": 1190.00
   }
 ]

{
 "vendor_name": "Acme Office Supplies",
 "invoice_number": "INV-10493",
 "invoice_date": "2026-01-15",
 "due_date": "2026-02-14",
 "currency": "USD",
 "subtotal": 1280.00,
 "tax": 102.40,
 "total": 1382.40,
 "line_items": [
   {
     "description": "Printer paper, 10 reams",
     "quantity": 2,
     "unit_price": 45.00,
     "line_total": 90.00
   },
   {
     "description": "Toner cartridge - black",
     "quantity": 4,
     "unit_price": 297.50,
     "line_total": 1190.00
   }
 ]

{
 "vendor_name": "Acme Office Supplies",
 "invoice_number": "INV-10493",
 "invoice_date": "2026-01-15",
 "due_date": "2026-02-14",
 "currency": "USD",
 "subtotal": 1280.00,
 "tax": 102.40,
 "total": 1382.40,
 "line_items": [
   {
     "description": "Printer paper, 10 reams",
     "quantity": 2,
     "unit_price": 45.00,
     "line_total": 90.00
   },
   {
     "description": "Toner cartridge - black",
     "quantity": 4,
     "unit_price": 297.50,
     "line_total": 1190.00
   }
 ]


Why schema-first wins

When you define fields upfront, three things get easier immediately:


  1. Prompting becomes more precise because the model isn’t guessing the shape of the answer.

  2. Validation becomes straightforward (types, required fields, allowed formats).

  3. Exports become reliable because downstream mapping doesn’t change every time.


This is the difference between a demo and a document extraction workflow your finance team can trust.


Step 1 — Set Up Your StackAI Project and Agent Workflow

In StackAI, you’ll want to structure your agent like a pipeline rather than a single monolithic step. Teams get better reliability when they break work into small stages with clear inputs and outputs.


Recommended workflow structure

Use a modular flow that mirrors how humans actually process documents:


  1. Ingestion

  2. OCR + text preparation

  3. Extraction (LLM document parsing into schema)

  4. Validation + exception handling (human-in-the-loop review when needed)

  5. Export (webhook / API export, Sheets, DB, etc.)


This approach is especially important if you’re building multiple agents over time. In enterprise settings, the highest-performing initiatives avoid “do everything” agents and instead build targeted workflows with clear inputs and outputs, then scale from there.


Versioning and naming

Adopt conventions early so you can maintain and audit changes:


  • Agent name: invoice_extraction_v1

  • Prompt version: prompt_v3_line_items_fix

  • Schema version: invoice_schema_v2

  • Validation version: validation_rules_v1


Even a simple naming standard prevents painful confusion later when accuracy changes after a prompt tweak.


Build a small test dataset

Start with 5–10 documents. Make sure they’re intentionally varied:


  • A clean digital PDF (easy baseline)

  • A scanned PDF with skew or low contrast

  • A multi-page invoice

  • A long line-item table

  • One invoice with discounts or shipping lines

  • A non-USD currency example (even if you don’t plan to support it yet)


This dataset becomes your regression suite for Step 8.


Step 2 — Ingest Documents (PDFs, Images, Email, or Upload)

Your ingestion choice depends on whether you’re testing or going live.


Ingestion options

For most teams, the progression looks like this:


  • Manual upload for development and testing

  • Drive folder or shared inbox for production intake

  • Webhook/API for integration into an internal app or portal


If you’re planning to build a document extraction agent on StackAI for real operations, webhook-based ingestion is usually the cleanest because it’s deterministic and easier to secure.


Capture metadata at ingestion

The file alone is rarely enough. Add metadata so downstream actions are traceable:


  • source (email, drive, portal, API)

  • received_at timestamp

  • uploader or sender identity (if applicable)

  • customer_id / vendor_id / property_id (whatever matters to your workflow)

  • document_type (if known, or leave for a classifier later)


Good metadata makes exception handling far easier because reviewers can route issues back to the right owner.


Pre-processing tips that dramatically improve accuracy

A few small steps can improve OCR and extraction more than any clever prompt:


  • Convert images to a consistent format (PDF or PNG)

  • Deskew and rotate pages before OCR

  • Split extremely large PDFs (especially if they contain multiple documents)

  • Avoid overly aggressive compression that destroys text edges


Document ingestion checklist

  • Ensure correct orientation (no sideways scans)

  • Avoid cropped margins (totals often live in corners)

  • Keep original file name (useful for audit trails)

  • Store document source metadata alongside the file


Step 3 — OCR and Text Preparation (Make Messy Docs Usable)

OCR is where most document extraction workflows either become stable or fragile. It’s also where many tutorials cut corners.


When OCR is required

  • Scanned PDFs: OCR is required (the “text layer” is basically an image)

  • Digital PDFs: OCR may not be required, but you still need text extraction and normalization

  • Photos of documents: OCR is required and quality varies widely


A robust workflow detects whether the document has extractable text before defaulting to OCR. That keeps costs down and reduces noise.


OCR best practices for real documents

The big decision is whether to preserve layout.


  • Plain text OCR is simpler for the LLM, but tables and columns may collapse.

  • Layout-aware OCR is better for invoices and statements, but can introduce artifacts like repeated headers.


In invoice extraction, layout often matters because line items rely on row structure. If line items are critical, prioritize table-aware or layout-preserving OCR.


Text normalization that improves LLM extraction

After OCR, normalize the text before sending it to the model:


  • Remove repeated headers/footers that appear on every page

  • Join hyphenated words broken across lines

  • Normalize whitespace (collapse excessive spacing)

  • Preserve page boundaries when multi-page context matters


OCR quality gating (don’t skip this)

Add a quick check before you attempt extraction:


  • If extracted text length is suspiciously low, OCR likely failed

  • If most characters are non-alphanumeric, you may have encoding noise

  • If the document language is unexpected, route to review

  • If confidence is low, don’t “force extraction” — escalate


A simple rule like “if text < 500 characters for a 2-page invoice, route to review” can save hours of debugging later.


Step 4 — Define the Extraction Schema (Fields, Types, Rules)

Schema design is where you translate business requirements into something the agent can reliably produce and validate.


Start with business requirements

Define:


  • Required fields: invoice_number, vendor_name, total, invoice_date

  • Optional fields: PO number, remit_to_address, notes

  • Acceptance criteria: totals must reconcile within a tolerance, dates must be valid


This prevents the common failure where the agent returns plausible-looking JSON that can’t actually be used.


A practical invoice extraction schema

Use snake_case and predictable types. Keep formats strict.


  • vendor_name: string

  • invoice_number: string

  • invoice_date: string (ISO 8601: YYYY-MM-DD)

  • due_date: string (ISO 8601) or null

  • currency: string (ISO 4217 like USD, EUR)

  • subtotal: number or null

  • tax: number or null

  • total: number

  • line_items: array of objects

  • notes: string or null


For line items, keep it minimal at first: description, quantity, unit_price, line_total. You can expand later with SKU, tax category, or service dates.


Add field-level descriptions

Field descriptions do more than help humans. They improve extraction reliability because the model has less ambiguity. For example:


  • invoice_number: “Unique invoice identifier as printed on the invoice. Do not invent one.”

  • total: “Total amount due including tax and fees. Prefer ‘Amount Due’ if present.”


Define constraints and rules

Use constraints to reduce garbage outputs:


  • Dates must be ISO 8601

  • Currency must be one of your supported codes

  • Numbers must be numeric (no “$1,234.00” strings)

  • If a value is missing, return null instead of guessing


Plan for multi-entity extraction early, even if you don’t support it yet. For example, some PDFs contain multiple invoices in one file.


Step 5 — Build the Extraction Prompt (Reliable, Structured Outputs)

A good extraction prompt is direct, strict about formatting, and explicit about what not to do. The goal is consistent JSON schema extraction, not a narrative summary.


Prompt template (copy/paste)

Use a structure like this and adapt the schema section to your fields:


You are a document extraction system.

Task:
Extract invoice data from the provided document text. Output MUST be valid JSON that matches the schema below.

Rules:

- Return only JSON. No extra text.
- If a field is not present or cannot be confidently determined, return null.
- Do not infer or guess values that are not explicitly stated in the document.
- Use ISO 8601 dates: YYYY-MM-DD.
- Numbers must be numeric values (no currency symbols, no commas).
- Currency must be a 3-letter code (e.g., USD, EUR). If not stated, return null.
- Line items: extract each row as a separate object when possible.

Schema:
{
 "vendor_name": string|null,
 "invoice_number": string|null,
 "invoice_date": string|null,
 "due_date": string|null,
 "currency": string|null,
 "subtotal": number|null,
 "tax": number|null,
 "total": number|null,
 "line_items": [
   {
     "description": string|null,
     "quantity": number|null,
     "unit_price": number|null,
     "line_total": number|null
   }
 ]

You are a document extraction system.

Task:
Extract invoice data from the provided document text. Output MUST be valid JSON that matches the schema below.

Rules:

- Return only JSON. No extra text.
- If a field is not present or cannot be confidently determined, return null.
- Do not infer or guess values that are not explicitly stated in the document.
- Use ISO 8601 dates: YYYY-MM-DD.
- Numbers must be numeric values (no currency symbols, no commas).
- Currency must be a 3-letter code (e.g., USD, EUR). If not stated, return null.
- Line items: extract each row as a separate object when possible.

Schema:
{
 "vendor_name": string|null,
 "invoice_number": string|null,
 "invoice_date": string|null,
 "due_date": string|null,
 "currency": string|null,
 "subtotal": number|null,
 "tax": number|null,
 "total": number|null,
 "line_items": [
   {
     "description": string|null,
     "quantity": number|null,
     "unit_price": number|null,
     "line_total": number|null
   }
 ]

You are a document extraction system.

Task:
Extract invoice data from the provided document text. Output MUST be valid JSON that matches the schema below.

Rules:

- Return only JSON. No extra text.
- If a field is not present or cannot be confidently determined, return null.
- Do not infer or guess values that are not explicitly stated in the document.
- Use ISO 8601 dates: YYYY-MM-DD.
- Numbers must be numeric values (no currency symbols, no commas).
- Currency must be a 3-letter code (e.g., USD, EUR). If not stated, return null.
- Line items: extract each row as a separate object when possible.

Schema:
{
 "vendor_name": string|null,
 "invoice_number": string|null,
 "invoice_date": string|null,
 "due_date": string|null,
 "currency": string|null,
 "subtotal": number|null,
 "tax": number|null,
 "total": number|null,
 "line_items": [
   {
     "description": string|null,
     "quantity": number|null,
     "unit_price": number|null,
     "line_total": number|null
   }
 ]

You are a document extraction system.

Task:
Extract invoice data from the provided document text. Output MUST be valid JSON that matches the schema below.

Rules:

- Return only JSON. No extra text.
- If a field is not present or cannot be confidently determined, return null.
- Do not infer or guess values that are not explicitly stated in the document.
- Use ISO 8601 dates: YYYY-MM-DD.
- Numbers must be numeric values (no currency symbols, no commas).
- Currency must be a 3-letter code (e.g., USD, EUR). If not stated, return null.
- Line items: extract each row as a separate object when possible.

Schema:
{
 "vendor_name": string|null,
 "invoice_number": string|null,
 "invoice_date": string|null,
 "due_date": string|null,
 "currency": string|null,
 "subtotal": number|null,
 "tax": number|null,
 "total": number|null,
 "line_items": [
   {
     "description": string|null,
     "quantity": number|null,
     "unit_price": number|null,
     "line_total": number|null
   }
 ]


Few-shot examples

If you want to push accuracy quickly, add two short examples:


  • One clean invoice with obvious totals

  • One messy invoice where the “total” appears as “Balance Due” or where tax is included


Keep examples short and focused on edge cases. You’re teaching formatting and decision rules, not summarization.


Common prompt pitfalls to avoid

  • Vague field definitions (“total” without specifying whether it’s subtotal or amount due)

  • No instruction for missing values (models fill gaps)

  • No numeric formatting rules (you get “$1,382.40” as a string)

  • No guidance for tables (line items collapse into one blob)


If you plan to build a document extraction agent on StackAI that’s stable over time, the prompt should read like a contract: explicit, testable, and hard to misinterpret.


Step 6 — Add Validation, Error Handling, and Human-in-the-Loop

Extraction is not the finish line. Validation is what makes this safe enough for finance, legal, or compliance workflows.


Validation layer 1: schema validation

Schema validation checks:


  • Types are correct (numbers are numbers, arrays are arrays)

  • Required fields exist (invoice_number, vendor_name, total)

  • Date strings match expected format


If schema validation fails, you can automatically re-run with a stricter prompt, or route the document to review.


Validation layer 2: business rule validation

Business rules catch “looks right but is wrong” outputs:


  • subtotal + tax ≈ total (use a small tolerance like 0.01–0.05 depending on rounding)

  • invoice_date ≤ due_date (if both present)

  • total > 0

  • If line_items exist, sum(line_total) ≈ subtotal (optional but powerful)


These rules are also great for exception routing because they produce clear error messages.


Validation rules checklist

  • Required fields present: vendor_name, invoice_number, total

  • Date format correct and logically consistent

  • Totals reconcile within tolerance

  • Currency consistent across amounts (or explicitly null)

  • Line items extracted as separate entries when present


Confidence scoring and gating

Even without an explicit confidence score from each step, you can derive useful signals:


  • Missing invoice_number or total is a hard fail

  • OCR text quality low is a hard fail

  • Totals not reconciling is a soft fail that should trigger review

  • Too many nulls is a warning sign


Human-in-the-loop review (what it should look like)

Human-in-the-loop review works best when reviewers see:


  • The original PDF

  • The extracted JSON

  • The specific fields that failed validation

  • The text snippet where the field was found (or where it should have been found)


Most importantly, corrections should feed back into iteration: update the schema descriptions, add an example, or adjust OCR settings.


Step 7 — Export the Extracted Data (Sheets, CRM, DB, Webhook/API)

Once your agent outputs validated JSON, you can connect it to almost anything. This is where document extraction becomes operational automation.


Common export targets

  • Webhook / API export to an internal service

  • Google Sheets row (quick ops workflows)

  • Airtable or Notion database (lightweight tracking)

  • Postgres insert (reporting and audit trails)

  • CRM or ERP integration via middleware


Webhook export is usually the best default because it keeps your integration logic in your application, where you can handle retries, deduplication, and authentication cleanly.


Mapping tips (especially for line items)

Line items are nested arrays, which some destinations don’t handle well. A practical approach is:


  • Store invoice-level fields in one record

  • Store line_items as a separate list in a second system/table

  • Or store line_items as JSON in a single field if your DB supports it


Also store both:


  • raw_text (from OCR/text extraction)

  • extracted_json (the structured output)


This is essential for auditability and debugging.


Idempotency and deduplication

Documents get re-uploaded. Emails get forwarded. Webhooks get retried. Build deduplication into your workflow:


  • Use (vendor_name + invoice_number + invoice_date) as a natural key when available

  • If invoice_number is missing, generate a file hash of the PDF and use that as a fallback

  • Keep a processing log with statuses: received, extracted, validated, exported, failed_review


What to log for traceability

For production-grade PDF data extraction, log:


  • Document ID and source metadata

  • Extraction timestamp

  • Model version / configuration

  • Prompt version

  • Schema version

  • Validation results (pass/fail + which rules failed)


This makes changes debuggable and helps explain accuracy shifts.


Step 8 — Test and Iterate (Evaluation for Accuracy and Drift)

If you want this to stay reliable, you need lightweight evaluation. Otherwise, each improvement attempt becomes guesswork.


Build a test set that reflects reality

Keep your original 5–10 docs, then expand:


  • 20–50 documents once you’re serious about production

  • Include new vendor templates as they appear

  • Keep a few “nightmare docs” on purpose


Measure accuracy in a way that matches business value

Track at least these three metrics:


  1. Field-level accuracy: percent of fields that are correct

  2. Document-level success rate: percent of documents that pass validation without review

  3. Critical field pass rate: invoice_number, vendor_name, total, due_date


Critical field pass rate is often the best early metric because it’s directly tied to whether the workflow is usable.


Regression testing

When you update OCR settings, change the prompt, or revise the schema:


  • Re-run the full test set

  • Compare results against “golden” outputs

  • Make sure you didn’t fix one vendor and break three others


Handling template drift

Template drift is inevitable. The best mitigation strategies:


  • Add new examples to your prompt or evaluation set

  • Improve schema descriptions for ambiguous fields

  • Add a routing step that classifies document types (invoice vs receipt vs statement)

  • If you have a few dominant templates, consider specialized prompts per template


This is how teams scale from one successful workflow to many without compounding risk.


Best Practices for Production-Grade Document Extraction

Once you’ve built the baseline, these practices are what make it durable.


Choose the right approach per document type

  • Template-heavy workflows: combine rules with targeted extraction and strict validation

  • Template-light workflows: rely on LLM document parsing with strong schema constraints and review gating


Cost and performance controls

  • Cache OCR outputs so you don’t re-OCR the same file during iteration

  • Only run heavy extraction when OCR quality passes basic checks

  • Consider a two-pass approach:


Security and compliance basics

Document workflows often include PII, banking details, or contract terms. Build basic controls early:


  • Restrict access to documents and outputs by role

  • Define retention policies (don’t keep everything forever by default)

  • Redact sensitive fields when exporting to lower-trust systems

  • Maintain audit logs for review and compliance needs


Observability and maintenance cadence

  • Monitor failure rates by vendor/template

  • Track which validation rules fail most often (these point to prompt/schema gaps)

  • Review failed docs monthly and update prompts/examples accordingly


Troubleshooting Guide (Common Issues + Fixes)

Output isn’t valid JSON

Cause: the prompt allows extra text, or the model is being “helpful.”


Fix:


  • Add a strict “Return only JSON” rule

  • Enforce schema validation and auto-retry with a stricter prompt

  • Remove any instruction that invites explanations


Totals are wrong or tax doesn’t reconcile

Cause: the model chose subtotal instead of total, or misread a “Balance Due.”


Fix:


  • Add explicit definitions: total = amount due including tax

  • Add business rule validation and route mismatches to review

  • If needed, run a second pass that explicitly re-checks totals


Line items are missing or merged

Cause: OCR collapsed a table into text blocks.


Fix:


  • Use layout-aware OCR settings when possible

  • Add line-item extraction rules: “each row becomes one array item”

  • Consider a dedicated line-item extraction step separate from header fields


Dates and currencies are inconsistent

Cause: formatting drift across documents.


Fix:


  • Lock ISO date formatting and currency code rules

  • Add normalization (strip symbols, parse formats)

  • Keep a “raw_value” field only if you truly need it


OCR returns gibberish

Cause: low-quality scan, compression artifacts, or wrong language settings.


Fix:


  • Improve input quality (deskew, increase contrast)

  • Re-run OCR with different settings

  • Fail fast and route to human-in-the-loop review instead of forcing extraction


Conclusion + Next Steps

To build a document extraction agent on StackAI that actually holds up in production, focus on the full pipeline, not just the extraction step. The reliable path is:


Ingest → OCR → schema-first extraction → validation → exception handling → export


Once this baseline is working, the most valuable next upgrades are:


  • Add a document type classifier/router (invoice vs receipt vs statement)

  • Build a lightweight review queue for exceptions

  • Maintain an evaluation set and run regression tests after every change


If you want to see what a production-grade document extraction workflow looks like in your environment, book a StackAI demo: https://www.stack-ai.com/demo

StackAI

AI Agents for the Enterprise


Table of Contents

Make your organization smarter with AI.

Deploy custom AI Assistants, Chatbots, and Workflow Automations to make your company 10x more efficient.