From PDF to Pricing Signal: The Document Intelligence Layer

Why building the data layer first is the unglamorous prerequisite for everything else in insurance technology.

Insurance runs on documents. A submission is a PDF. A policy is a PDF. A statement of values is an Excel file pretending to be a PDF. A loss run is twelve scanned pages from a printer made in 2007, sometimes with the carrier's logo stamped diagonally across the table you're trying to read.

Before any of the interesting work in insurance can happen, that stack has to become something a machine can reason over. Pricing needs structured exposure. Benchmarking needs normalized coverage. Compliance verification needs both, plus the full policy text. None of it is possible until the documents have been read accurately, consistently, and at portfolio scale.

Most of the industry has solved this by hiring data entry teams. We didn't want to. So we built a document intelligence layer that turns insurance paperwork into structured, verifiable data without the labor cost and without the inconsistency.

Why generic document AI doesn't work here

The reason general-purpose document tools struggle with insurance is that insurance documents disagree with themselves.

Limits on the declarations page get modified by endorsements buried seventy pages later. Totals at the bottom of loss runs rarely match the row sums, because reserves get released mid-period and nobody updates the footer. A certificate of insurance often has the certificate holder field retyped by a leasing agent with a typo. Construction type on a statement of values might be coded against a classification standard that varies by insurer. The deductible the broker quoted is not always the deductible the carrier issued, and the issued one is what counts.

Reading insurance well, in the sense of producing data you can actually price against, requires three things that don't exist out of the box. Domain-specific extraction that knows where to look in each document type. Validation that catches when a document contradicts itself. And calibrated confidence scoring that signals when a human should look at the result. Each of these has to be built deliberately, and built differently for every document type insurance throws at you.

A specialist for every document

Our extraction layer is not a single model with one giant schema. It's a set of specialists.

The certificate-of-insurance specialist knows that the certificate holder field is often where the lender's name was retyped by hand. The loss run specialist knows that row totals and footer totals are independent sources of truth and treats them that way. The policy specialist treats the declarations page as authoritative for limits and the endorsements as authoritative for modifications, and it knows how to resolve a conflict between the two.

Each specialist is powered by modern AI, but the AI is not the whole system. Around it sit constraints on what gets extracted, validators that check internal consistency, and reconciliation logic that resolves contradictions between fields. The model proposes. The constraints decide whether to accept the proposal.

This is the part that separates a working extraction system from a demo. A demo extracts a clean field from a clean document. A working system handles the document where the broker forgot to update page 4 after the carrier changed the wind deductible on page 12.

Calibrated confidence, not vibes

One of the biggest leaks in document AI is unjustified confidence. A model returns clean JSON and downstream code assumes it's right. We don't.

Every field in our data layer carries a calibrated confidence score. Not the model's self-reported confidence, which is unreliable in both directions, but a score derived from how the field was extracted, how the underlying text was captured, and what we've learned historically about accuracy on that specific field type. We know which fields the system is reliably good at and which ones still need a second pair of eyes.

High-confidence fields enter the data layer directly. Lower-confidence fields go to a review queue staffed by our auditor team. The queue shrinks as the system gets better. It does not go to zero, and we don't try to make it. Some insurance documents are genuinely ambiguous, and some decisions should always have a human in the loop.

What this layer makes possible

A normalized data layer is the foundation everything else sits on.

  • Benchmarking that compares like to like. We've written about why benchmarking has historically been bad in commercial property insurance. The short version is that you cannot compare "premium per square foot" between two policies with different deductibles, sublimits, and endorsements. Normalized coverage data across thousands of properties is what makes meaningful benchmarking possible.
  • Loss run intelligence. Years of unstructured claims PDFs across a portfolio, once normalized, become a feature set. Frequency curves by peril. Severity distributions. Time since last loss. This is what drives renewal forecasting and risk-adjusted pricing.
  • Compliance verification. Verifying that a policy meets a requirement requires both the structured fields and the underlying clause text. The combination is what allows clause-level verification at portfolio scale.
  • Submission preparation. Forms auto-populate from extracted exposures, prior policies, and broker history. The goal isn't to replace the broker. It's to remove the part of the broker's day spent retyping data that already exists.
What this layer doesn't do

A few things worth being honest about.

We don't extract from images below a readable resolution threshold. We capture what's there and flag the rest. We can't extract data that isn't in the document. If a statement of values is missing construction type, the field stays empty and a human handles it. And we don't trust any single extraction blindly. The system produces probabilistic outputs with confidence attached, and the platform downstream is designed to deal with uncertainty deliberately rather than pretend it isn't there.

Why this matters

Insurance is not a field where "the model said so" is a satisfying answer. Every number has to have a source, a justification, and an audit trail. The systems that try to take a shortcut here tend to ship hallucinated coverage data that nobody catches until a claim gets denied.

Building a document intelligence layer that lives up to that standard is harder than it looks. There is no off-the-shelf version. There is no model release that solves it. The work is in the structure around the model, the validators between the extractions, and the discipline to flag uncertainty instead of paper over it. Doing it properly is the foundation of everything else we've built. Pricing, benchmarking, compliance, renewal forecasting. None of them are honest unless the underlying data is.