Parse

POST /api/v1/parse submits a PDF or DOCX for structural parsing. Processing is asynchronous — the endpoint returns a job_id immediately. Poll for completion using GET /api/v1/documents/jobs/{job_id}.

Required feature flag: document_parse_api

Request

The request is a multipart/form-data upload.

Field	Type	Required	Description
`file`	file	Yes	PDF or DOCX file to parse
`language`	string	No	Language code (default: `en`)
`config`	string	No	JSON string with additional config options
`tenant_id`	string	No	Tenant override (defaults to authenticated tenant)

Example

cURL


curl -X POST https://api.expunct.ai/api/v1/parse \
  -H "X-API-Key: pk_live_abc123" \
  -F "file=@/path/to/document.pdf" \
  -F "language=en"

Python


import httpx
 
with open("/path/to/document.pdf", "rb") as f:
    response = httpx.post(
        "https://api.expunct.ai/api/v1/parse",
        headers={"X-API-Key": "pk_live_abc123"},
        files={"file": ("document.pdf", f, "application/pdf")},
        data={"language": "en"},
    )
 
job = response.json()
print(job["id"])  # e.g. "3f2a1b4c-..."

Node.js


import FormData from 'form-data';
import fs from 'fs';
import fetch from 'node-fetch';
 
const form = new FormData();
form.append('file', fs.createReadStream('/path/to/document.pdf'), 'document.pdf');
form.append('language', 'en');
 
const response = await fetch('https://api.expunct.ai/api/v1/parse', {
  method: 'POST',
  headers: { 'X-API-Key': 'pk_live_abc123', ...form.getHeaders() },
  body: form,
});
 
const job = await response.json();
console.log(job.id);

Response (202 Accepted)

Field	Type	Description
`id`	string	Job ID — use to poll status
`status`	string	Initial status: `queued`
`workflow_kind`	string	Always `parse`
`media_type`	string	`pdf` or `docx`
`progress_pct`	number	Processing progress (0–100)
`created_at`	string	ISO 8601 timestamp
`updated_at`	string	ISO 8601 timestamp


{
  "id": "3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c",
  "status": "queued",
  "workflow_kind": "parse",
  "media_type": "pdf",
  "progress_pct": 0,
  "created_at": "2025-03-01T12:00:00Z",
  "updated_at": "2025-03-01T12:00:00Z"
}

Polling for completion

cURL


curl https://api.expunct.ai/api/v1/documents/jobs/3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c \
  -H "X-API-Key: pk_live_abc123"

Python


import time
import httpx
 
job_id = "3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c"
headers = {"X-API-Key": "pk_live_abc123"}
 
while True:
    r = httpx.get(
        f"https://api.expunct.ai/api/v1/documents/jobs/{job_id}",
        headers=headers,
    )
    job = r.json()
    print(f"Status: {job['status']} ({job['progress_pct']}%)")
 
    if job["status"] == "completed":
        for artifact in job["artifacts"]:
            print(f"  {artifact['artifact_kind']}: {artifact['id']}")
        break
    elif job["status"] == "failed":
        print(f"Failed: {job['error_message']}")
        break
 
    time.sleep(2)

Node.js


const jobId = '3f2a1b4c-8d9e-4f2a-b1c4-5d6e7f8a9b0c';
const headers = { 'X-API-Key': 'pk_live_abc123' };
 
while (true) {
  const r = await fetch(
    `https://api.expunct.ai/api/v1/documents/jobs/${jobId}`,
    { headers },
  );
  const job = await r.json();
  console.log(`Status: ${job.status} (${job.progress_pct}%)`);
 
  if (job.status === 'completed') {
    for (const artifact of job.artifacts) {
      console.log(`  ${artifact.artifact_kind}: ${artifact.id}`);
    }
    break;
  } else if (job.status === 'failed') {
    console.error(`Failed: ${job.error_message}`);
    break;
  }
 
  await new Promise((r) => setTimeout(r, 2000));
}

Job detail response

GET /api/v1/documents/jobs/{job_id} returns the job plus all produced artifacts:

Field	Type	Description
`id`	string	Job ID
`status`	string	`queued`, `processing`, `completed`, or `failed`
`workflow_kind`	string	`parse`
`progress_pct`	number	0–100
`error_message`	string	Set only when status is `failed`
`artifacts`	array	List of artifact metadata objects

Each artifact in artifacts:

Field	Type	Description
`id`	string	Artifact ID
`artifact_kind`	string	See artifact types below
`page_count`	number	Page count (canonical document only)
`retention_class`	string	`short_ttl` or `persistent`
`payload_purged`	boolean	`true` if content has been deleted
`created_at`	string	ISO 8601 timestamp

Artifact types

A completed parse job produces three artifacts:

`artifact_kind`	Description
`canonical_document`	Structured JSON — pages, blocks, tables, reading order
`markdown_render`	Document rendered as Markdown
`chunks_v1`	Semantic chunks ready for vector embedding

Retrieving artifact content


curl https://api.expunct.ai/api/v1/documents/{artifact_id}/content \
  -H "X-API-Key: pk_live_abc123"

canonical_document shape


{
  "document_id": "3f2a1b4c-...",
  "page_count": 3,
  "block_count": 42,
  "parse_route": "text_native",
  "parse_duration_ms": 380,
  "pages": [
    {
      "page_number": 1,
      "blocks": [
        {
          "block_id": "b_001",
          "kind": "heading",
          "text": "Invoice #INV-2024-001",
          "reading_order": 0
        },
        {
          "block_id": "b_002",
          "kind": "paragraph",
          "text": "Issued: March 1, 2025",
          "reading_order": 1
        }
      ],
      "tables": []
    }
  ]
}

parse_route values:

Value	Meaning
`text_native`	Text extracted directly from the PDF layer
`hybrid_default`	Mix of native text and OCR
`hybrid_verified`	OCR output verified against native text

chunks_v1 shape


{
  "document_id": "3f2a1b4c-...",
  "source_artifact_id": "art_abc...",
  "chunks": [
    {
      "chunk_id": "c_001",
      "text": "Invoice #INV-2024-001\nIssued: March 1, 2025",
      "page_number": 1,
      "block_ids": ["b_001", "b_002"],
      "token_count": 12
    }
  ]
}

Error responses

Status	Meaning
`400`	Unsupported file type or invalid config JSON
`403`	Feature flag `document_parse_api` not enabled
`413`	File exceeds plan upload limit