Skip to Content

Parse

POST /api/v1/parse submits a PDF or DOCX for structural parsing. Processing is asynchronous — the endpoint returns a job_id immediately. Poll for completion using GET /api/v1/documents/jobs/{job_id}.

Required feature flag: document_parse_api

Request

The request is a multipart/form-data upload.

FieldTypeRequiredDescription
filefileYesPDF or DOCX file to parse
languagestringNoLanguage code (default: en)
configstringNoJSON string with additional config options
tenant_idstringNoTenant override (defaults to authenticated tenant)

Example

curl -X POST https://api.expunct.ai/api/v1/parse \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/document.pdf" \ -F "language=en"

Response (202 Accepted)

FieldTypeDescription
idstringJob ID — use to poll status
statusstringInitial status: queued
workflow_kindstringAlways parse
media_typestringpdf or docx
progress_pctnumberProcessing progress (0–100)
created_atstringISO 8601 timestamp
updated_atstringISO 8601 timestamp
{ "id": "3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c", "status": "queued", "workflow_kind": "parse", "media_type": "pdf", "progress_pct": 0, "created_at": "2025-03-01T12:00:00Z", "updated_at": "2025-03-01T12:00:00Z" }

Polling for completion

curl https://api.expunct.ai/api/v1/documents/jobs/3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c \ -H "X-API-Key: pk_live_abc123"

Job detail response

GET /api/v1/documents/jobs/{job_id} returns the job plus all produced artifacts:

FieldTypeDescription
idstringJob ID
statusstringqueued, processing, completed, or failed
workflow_kindstringparse
progress_pctnumber0–100
error_messagestringSet only when status is failed
artifactsarrayList of artifact metadata objects

Each artifact in artifacts:

FieldTypeDescription
idstringArtifact ID
artifact_kindstringSee artifact types below
page_countnumberPage count (canonical document only)
retention_classstringshort_ttl or persistent
payload_purgedbooleantrue if content has been deleted
created_atstringISO 8601 timestamp

Artifact types

A completed parse job produces three artifacts:

artifact_kindDescription
canonical_documentStructured JSON — pages, blocks, tables, reading order
markdown_renderDocument rendered as Markdown
chunks_v1Semantic chunks ready for vector embedding

Retrieving artifact content

curl https://api.expunct.ai/api/v1/documents/{artifact_id}/content \ -H "X-API-Key: pk_live_abc123"

canonical_document shape

{ "document_id": "3f2a1b4c-...", "page_count": 3, "block_count": 42, "parse_route": "text_native", "parse_duration_ms": 380, "pages": [ { "page_number": 1, "blocks": [ { "block_id": "b_001", "kind": "heading", "text": "Invoice #INV-2024-001", "reading_order": 0 }, { "block_id": "b_002", "kind": "paragraph", "text": "Issued: March 1, 2025", "reading_order": 1 } ], "tables": [] } ] }

parse_route values:

ValueMeaning
text_nativeText extracted directly from the PDF layer
hybrid_defaultMix of native text and OCR
hybrid_verifiedOCR output verified against native text

chunks_v1 shape

{ "document_id": "3f2a1b4c-...", "source_artifact_id": "art_abc...", "chunks": [ { "chunk_id": "c_001", "text": "Invoice #INV-2024-001\nIssued: March 1, 2025", "page_number": 1, "block_ids": ["b_001", "b_002"], "token_count": 12 } ] }

Error responses

StatusMeaning
400Unsupported file type or invalid config JSON
403Feature flag document_parse_api not enabled
413File exceeds plan upload limit