Document Intelligence

Document Intelligence parses PDF and DOCX files into LLM-ready structure, extracts schema-driven fields, and combines parse with PII sanitization in a single workflow.

Beta — opt-in per tenant. Endpoints return 403 until the matching feature flag is enabled. Document Intelligence is available for approved Professional and Business tenants during rollout, and available by request on Starter. The shortest truthful first-success path today is raw HTTP — see Onboarding path below.

Parse, Extract, Safe-Parse — pick the right one

The three operations all run on the same parse substrate. They differ by what they return and what gets persisted.

Operation	Endpoint	What it returns	When to use
Parse	`POST /api/v1/parse`	Raw `canonical_document` + `markdown_render` + `chunks_v1`	You need RAG-ready chunks, search-ready text, or layout-aware structure. PII is not removed.
Extract	`POST /api/v1/extract`	`extract_result` matching your JSON Schema or `template_id`	You need specific fields out of a document (invoice totals, dates, names).
Safe-Parse	`POST /api/v1/workflows/safe-parse`	`sanitized_canonical_document` + `sanitized_markdown_render` + `sanitized_chunks_v1`	You need parse output that is safe to embed, store, or send to a third-party LLM. The raw canonical is ephemeral and never returned.

safe_parse is not a third parser. It is parse + sanitize exposed as one workflow so you never have to choreograph the two yourself, and so raw artifacts are never persisted by default.

extract runs an internal parse if you upload a file; pass parse_artifact_id to reuse a parse you already submitted instead.

Onboarding path

The published Python SDK, Node SDK, CLI, and MCP server do not yet expose document-intelligence operations. New package versions that include parse/extract/safe-parse are being prepared. Until those ship:

Use raw HTTP (curl, httpx, fetch) for parse, extract, and safe-parse.
Use the SDKs for redaction.

When SDK / CLI / MCP support is published, this page will be updated and a migration callout added to each integration page.

Async, with artifacts

All three endpoints are asynchronous — they return a job_id immediately and process in the background. Poll GET /api/v1/documents/jobs/{job_id} for status and the list of produced artifact IDs.

Each job produces one or more artifacts — immutable JSON payloads stored on your behalf:

Workflow	Artifacts produced
`parse`	`canonical_document`, `markdown_render`, `chunks_v1`
`extract`	`canonical_document` (intermediate), `extract_result`
`safe-parse`	`canonical_document` (ephemeral), `sanitized_canonical_document`, `sanitized_markdown_render`, `sanitized_chunks_v1`

Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.

Supported file types

Format	Extension
PDF	`.pdf`
Word document	`.docx`

Other formats (images, audio, video) are supported by the Redaction API, not Document Intelligence.

Feature flags and beta gating

Each operation is gated by a separate feature flag. Contact support to enable them on your tenant:

Flag	Endpoint	Plan default
`document_parse_api`	`/parse`	Off on Free; opt-in on Starter; on for approved Professional/Business
`document_extract_api`	`/extract`	Off on Free; opt-in on Starter; on for approved Professional/Business
`document_safe_parse_workflow`	`/workflows/safe-parse`	Off on Free; opt-in on Starter; on for approved Professional/Business

Calls to a disabled endpoint return 403 with feature '<flag>' is not enabled for this tenant.

Upload size limits

Plan	Max file size
Free	10 MB
Starter	50 MB
Professional	50 MB
Business	200 MB

Next steps

Parse Extract Safe Parse Workflows (RAG, extraction recipes)