Skip to Content
Document IntelligenceOverview

Document Intelligence

Document Intelligence parses PDF and DOCX files into LLM-ready structure, extracts schema-driven fields, and combines parse with PII sanitization in a single workflow.

Beta — opt-in per tenant. Endpoints return 403 until the matching feature flag is enabled. Document Intelligence is available for approved Professional and Business tenants during rollout, and available by request on Starter. The shortest truthful first-success path today is raw HTTP — see Onboarding path below.

Parse, Extract, Safe-Parse — pick the right one

The three operations all run on the same parse substrate. They differ by what they return and what gets persisted.

OperationEndpointWhat it returnsWhen to use
ParsePOST /api/v1/parseRaw canonical_document + markdown_render + chunks_v1You need RAG-ready chunks, search-ready text, or layout-aware structure. PII is not removed.
ExtractPOST /api/v1/extractextract_result matching your JSON Schema or template_idYou need specific fields out of a document (invoice totals, dates, names).
Safe-ParsePOST /api/v1/workflows/safe-parsesanitized_canonical_document + sanitized_markdown_render + sanitized_chunks_v1You need parse output that is safe to embed, store, or send to a third-party LLM. The raw canonical is ephemeral and never returned.

safe_parse is not a third parser. It is parse + sanitize exposed as one workflow so you never have to choreograph the two yourself, and so raw artifacts are never persisted by default.

extract runs an internal parse if you upload a file; pass parse_artifact_id to reuse a parse you already submitted instead.

Onboarding path

The published Python SDK, Node SDK, CLI, and MCP server do not yet expose document-intelligence operations. New package versions that include parse/extract/safe-parse are being prepared. Until those ship:

  • Use raw HTTP (curl, httpx, fetch) for parse, extract, and safe-parse.
  • Use the SDKs for redaction.

When SDK / CLI / MCP support is published, this page will be updated and a migration callout added to each integration page.

Async, with artifacts

All three endpoints are asynchronous — they return a job_id immediately and process in the background. Poll GET /api/v1/documents/jobs/{job_id} for status and the list of produced artifact IDs.

Each job produces one or more artifacts — immutable JSON payloads stored on your behalf:

WorkflowArtifacts produced
parsecanonical_document, markdown_render, chunks_v1
extractcanonical_document (intermediate), extract_result
safe-parsecanonical_document (ephemeral), sanitized_canonical_document, sanitized_markdown_render, sanitized_chunks_v1

Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.

Supported file types

FormatExtension
PDF.pdf
Word document.docx

Other formats (images, audio, video) are supported by the Redaction API, not Document Intelligence.

Feature flags and beta gating

Each operation is gated by a separate feature flag. Contact support to enable them on your tenant:

FlagEndpointPlan default
document_parse_api/parseOff on Free; opt-in on Starter; on for approved Professional/Business
document_extract_api/extractOff on Free; opt-in on Starter; on for approved Professional/Business
document_safe_parse_workflow/workflows/safe-parseOff on Free; opt-in on Starter; on for approved Professional/Business

Calls to a disabled endpoint return 403 with feature '<flag>' is not enabled for this tenant.

Upload size limits

PlanMax file size
Free10 MB
Starter50 MB
Professional50 MB
Business200 MB

Next steps