Document Intelligence
Document Intelligence parses PDF and DOCX files into LLM-ready structure, extracts schema-driven fields, and combines parse with PII sanitization in a single workflow.
Beta — opt-in per tenant. Endpoints return
403until the matching feature flag is enabled. Document Intelligence is available for approved Professional and Business tenants during rollout, and available by request on Starter. The shortest truthful first-success path today is raw HTTP — see Onboarding path below.
Parse, Extract, Safe-Parse — pick the right one
The three operations all run on the same parse substrate. They differ by what they return and what gets persisted.
| Operation | Endpoint | What it returns | When to use |
|---|---|---|---|
| Parse | POST /api/v1/parse | Raw canonical_document + markdown_render + chunks_v1 | You need RAG-ready chunks, search-ready text, or layout-aware structure. PII is not removed. |
| Extract | POST /api/v1/extract | extract_result matching your JSON Schema or template_id | You need specific fields out of a document (invoice totals, dates, names). |
| Safe-Parse | POST /api/v1/workflows/safe-parse | sanitized_canonical_document + sanitized_markdown_render + sanitized_chunks_v1 | You need parse output that is safe to embed, store, or send to a third-party LLM. The raw canonical is ephemeral and never returned. |
safe_parse is not a third parser. It is parse + sanitize exposed as one workflow so you never have to choreograph the two yourself, and so raw artifacts are never persisted by default.
extract runs an internal parse if you upload a file; pass parse_artifact_id to reuse a parse you already submitted instead.
Onboarding path
The published Python SDK, Node SDK, CLI, and MCP server do not yet expose document-intelligence operations. New package versions that include parse/extract/safe-parse are being prepared. Until those ship:
- Use raw HTTP (
curl,httpx,fetch) for parse, extract, and safe-parse. - Use the SDKs for redaction.
When SDK / CLI / MCP support is published, this page will be updated and a migration callout added to each integration page.
Async, with artifacts
All three endpoints are asynchronous — they return a job_id immediately and process in the background. Poll GET /api/v1/documents/jobs/{job_id} for status and the list of produced artifact IDs.
Each job produces one or more artifacts — immutable JSON payloads stored on your behalf:
| Workflow | Artifacts produced |
|---|---|
parse | canonical_document, markdown_render, chunks_v1 |
extract | canonical_document (intermediate), extract_result |
safe-parse | canonical_document (ephemeral), sanitized_canonical_document, sanitized_markdown_render, sanitized_chunks_v1 |
Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.
Supported file types
| Format | Extension |
|---|---|
.pdf | |
| Word document | .docx |
Other formats (images, audio, video) are supported by the Redaction API, not Document Intelligence.
Feature flags and beta gating
Each operation is gated by a separate feature flag. Contact support to enable them on your tenant:
| Flag | Endpoint | Plan default |
|---|---|---|
document_parse_api | /parse | Off on Free; opt-in on Starter; on for approved Professional/Business |
document_extract_api | /extract | Off on Free; opt-in on Starter; on for approved Professional/Business |
document_safe_parse_workflow | /workflows/safe-parse | Off on Free; opt-in on Starter; on for approved Professional/Business |
Calls to a disabled endpoint return 403 with feature '<flag>' is not enabled for this tenant.
Upload size limits
| Plan | Max file size |
|---|---|
| Free | 10 MB |
| Starter | 50 MB |
| Professional | 50 MB |
| Business | 200 MB |