Parse
POST /api/v1/parse submits a PDF or DOCX for structural parsing. Processing is asynchronous — the endpoint returns a job_id immediately. Poll for completion using GET /api/v1/documents/jobs/{job_id}.
Required feature flag: document_parse_api
Request
The request is a multipart/form-data upload.
| Field | Type | Required | Description |
|---|---|---|---|
file | file | Yes | PDF or DOCX file to parse |
language | string | No | Language code (default: en) |
config | string | No | JSON string with additional config options |
tenant_id | string | No | Tenant override (defaults to authenticated tenant) |
Example
cURL
curl -X POST https://api.expunct.ai/api/v1/parse \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/document.pdf" \
-F "language=en"Response (202 Accepted)
| Field | Type | Description |
|---|---|---|
id | string | Job ID — use to poll status |
status | string | Initial status: queued |
workflow_kind | string | Always parse |
media_type | string | pdf or docx |
progress_pct | number | Processing progress (0–100) |
created_at | string | ISO 8601 timestamp |
updated_at | string | ISO 8601 timestamp |
{
"id": "3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c",
"status": "queued",
"workflow_kind": "parse",
"media_type": "pdf",
"progress_pct": 0,
"created_at": "2025-03-01T12:00:00Z",
"updated_at": "2025-03-01T12:00:00Z"
}Polling for completion
cURL
curl https://api.expunct.ai/api/v1/documents/jobs/3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c \
-H "X-API-Key: pk_live_abc123"Job detail response
GET /api/v1/documents/jobs/{job_id} returns the job plus all produced artifacts:
| Field | Type | Description |
|---|---|---|
id | string | Job ID |
status | string | queued, processing, completed, or failed |
workflow_kind | string | parse |
progress_pct | number | 0–100 |
error_message | string | Set only when status is failed |
artifacts | array | List of artifact metadata objects |
Each artifact in artifacts:
| Field | Type | Description |
|---|---|---|
id | string | Artifact ID |
artifact_kind | string | See artifact types below |
page_count | number | Page count (canonical document only) |
retention_class | string | short_ttl or persistent |
payload_purged | boolean | true if content has been deleted |
created_at | string | ISO 8601 timestamp |
Artifact types
A completed parse job produces three artifacts:
artifact_kind | Description |
|---|---|
canonical_document | Structured JSON — pages, blocks, tables, reading order |
markdown_render | Document rendered as Markdown |
chunks_v1 | Semantic chunks ready for vector embedding |
Retrieving artifact content
curl https://api.expunct.ai/api/v1/documents/{artifact_id}/content \
-H "X-API-Key: pk_live_abc123"canonical_document shape
{
"document_id": "3f2a1b4c-...",
"page_count": 3,
"block_count": 42,
"parse_route": "text_native",
"parse_duration_ms": 380,
"pages": [
{
"page_number": 1,
"blocks": [
{
"block_id": "b_001",
"kind": "heading",
"text": "Invoice #INV-2024-001",
"reading_order": 0
},
{
"block_id": "b_002",
"kind": "paragraph",
"text": "Issued: March 1, 2025",
"reading_order": 1
}
],
"tables": []
}
]
}parse_route values:
| Value | Meaning |
|---|---|
text_native | Text extracted directly from the PDF layer |
hybrid_default | Mix of native text and OCR |
hybrid_verified | OCR output verified against native text |
chunks_v1 shape
{
"document_id": "3f2a1b4c-...",
"source_artifact_id": "art_abc...",
"chunks": [
{
"chunk_id": "c_001",
"text": "Invoice #INV-2024-001\nIssued: March 1, 2025",
"page_number": 1,
"block_ids": ["b_001", "b_002"],
"token_count": 12
}
]
}Error responses
| Status | Meaning |
|---|---|
400 | Unsupported file type or invalid config JSON |
403 | Feature flag document_parse_api not enabled |
413 | File exceeds plan upload limit |