Extract
POST /api/v1/extract extracts structured fields from a PDF or DOCX using a JSON Schema or a built-in template. Processing is asynchronous.
Required feature flag: document_extract_api
Two input paths
| Path | When to use |
|---|---|
| Upload a file directly | Single-step convenience — parse and extract in one call |
Pass a parse_artifact_id | Reuse an existing parse result — faster, no re-parsing |
Exactly one of file or parse_artifact_id must be provided.
Schema vs. template
| Option | When to use |
|---|---|
template_id | Use a built-in schema (currently: invoice) |
extraction_schema | Provide your own JSON Schema |
Exactly one of template_id or extraction_schema must be provided.
Request
The request is a multipart/form-data upload.
| Field | Type | Required | Description |
|---|---|---|---|
file | file | One of file/parse_artifact_id | PDF or DOCX file to parse and extract |
parse_artifact_id | string | One of file/parse_artifact_id | ID of an existing canonical_document artifact |
template_id | string | One of template_id/extraction_schema | Built-in template ID (e.g. invoice) |
extraction_schema | string | One of template_id/extraction_schema | JSON Schema string |
language | string | No | Language code (default: en) |
config | string | No | JSON string with additional config options |
tenant_id | string | No | Tenant override |
Example — file upload with built-in template
cURL
curl -X POST https://api.expunct.ai/api/v1/extract \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/invoice.pdf" \
-F "template_id=invoice"Example — reuse an existing parse artifact
cURL
curl -X POST https://api.expunct.ai/api/v1/extract \
-H "X-API-Key: pk_live_abc123" \
-F "parse_artifact_id=art_3f2a1b4c..." \
-F "template_id=invoice"Example — custom schema
curl -X POST https://api.expunct.ai/api/v1/extract \
-H "X-API-Key: pk_live_abc123" \
-F "file=@/path/to/contract.pdf" \
-F 'extraction_schema={
"type": "object",
"properties": {
"party_name": { "type": "string", "description": "Name of the contracting party" },
"effective_date": { "type": "string", "description": "Contract effective date" },
"total_value": { "type": "number", "description": "Total contract value" }
},
"required": ["party_name", "effective_date"]
}'Response (202 Accepted)
Same shape as the parse response with workflow_kind: "extract".
Polling and artifacts
Poll GET /api/v1/documents/jobs/{job_id}. A completed extract job produces:
artifact_kind | Description |
|---|---|
canonical_document | Intermediate parse result (ephemeral, deleted after extraction) |
extract_result | Extracted fields with confidence scores and citations |
Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.
extract_result shape
{
"document_id": "3f2a1b4c-...",
"source_artifact_id": "art_abc...",
"template_id": "invoice",
"schema_used": { "...": "..." },
"fields": [
{
"field_name": "invoice_number",
"value": "INV-2024-001",
"confidence": 0.85,
"citations": [
{
"page_number": 1,
"block_id": "b_001",
"text_snippet": "Invoice #INV-2024-001\nIssued: March 1, 2025"
}
]
},
{
"field_name": "total_amount",
"value": 4250.00,
"confidence": 0.85,
"citations": [
{
"page_number": 2,
"block_id": "b_041",
"text_snippet": "Total Due: $4,250.00"
}
]
},
{
"field_name": "vendor_name",
"value": null,
"confidence": 0.0,
"citations": []
}
],
"raw_output": {
"invoice_number": "INV-2024-001",
"total_amount": 4250.00
},
"validation_errors": [],
"extraction_duration_ms": 45,
"model_versions": { "extraction_engine": "rule_v1" }
}Field confidence levels
| Score range | Meaning |
|---|---|
| 0.80–1.0 | Label found in same block as value |
| 0.70–0.79 | Label found in adjacent block |
| 0.30–0.69 | Pattern match only (no label context) |
| 0.0 | Field not found |
validation_errors
Present when a field marked required in the schema was not found:
"validation_errors": ["required field 'invoice_number' not found"]Built-in templates
invoice
Extracts common invoice fields from PDF or DOCX invoices.
| Field | Type | Required |
|---|---|---|
invoice_number | string | Yes |
invoice_date | string | Yes |
total_amount | number | Yes |
vendor_name | string | No |
vendor_address | string | No |
customer_name | string | No |
customer_address | string | No |
due_date | string | No |
purchase_order_number | string | No |
currency | string | No |
subtotal | number | No |
tax_amount | number | No |
tax_rate | string | No |
discount_amount | number | No |
amount_due | number | No |
payment_terms | string | No |
line_items | array | No |
line_items is an array of objects with description, quantity, unit_price, and amount.
Error responses
| Status | Meaning |
|---|---|
400 | Missing required fields, conflicting inputs, or invalid JSON |
403 | Feature flag document_extract_api not enabled |
404 | parse_artifact_id not found |
410 | Parse artifact payload has been purged |
413 | File exceeds plan upload limit |