Skip to Content

Safe Parse

POST /api/v1/workflows/safe-parse submits a PDF or DOCX for combined structural parsing and PII redaction. The document is parsed, all PII is detected and replaced in-place, and three sanitized artifacts are produced — ready for storage or indexing without privacy risk.

Processing is asynchronous. Poll GET /api/v1/documents/jobs/{job_id} for status.

Required feature flag: document_safe_parse_workflow

How it works

  1. Parse — the file is ingested and converted to a canonical document (pages, blocks, tables)
  2. Sanitize — Expunct’s PII detection engine (Presidio) scans every text block and replaces detected entities with redaction labels (e.g. [PERSON], [EMAIL_ADDRESS])
  3. Render — sanitized markdown and semantic chunks are produced from the sanitized canonical document

The raw canonical document is ephemeral — it is deleted after sanitization and never returned to the caller. Only sanitized artifacts are retained.

Request

The request is a multipart/form-data upload.

FieldTypeRequiredDescription
filefileYesPDF or DOCX file to parse and sanitize
languagestringNoLanguage code (default: en)
policy_idstringNoRedaction policy ID — controls which entity types are redacted
configstringNoJSON string with additional config options
tenant_idstringNoTenant override (defaults to authenticated tenant)

Config options

Pass additional options as a JSON string in the config field:

KeyTypeDefaultDescription
redaction_modestringtype_labelHow to render redacted spans. type_label[PERSON]; mask████
pii_typesarray["all"]Entity types to redact (e.g. ["PERSON", "EMAIL_ADDRESS"]). all means every supported type
pii_categoriesarray["PII","PCI","PHI"]Categories to include

Example

curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/document.pdf" \ -F "language=en"

Example — restrict to specific entity types

curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/document.pdf" \ -F 'config={"pii_types": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"], "redaction_mode": "type_label"}'

Example — apply a saved redaction policy

curl -X POST https://api.expunct.ai/api/v1/workflows/safe-parse \ -H "X-API-Key: pk_live_abc123" \ -F "file=@/path/to/document.pdf" \ -F "policy_id=pol_hipaa_strict"

Response (202 Accepted)

Same shape as the parse response with workflow_kind: "safe_parse".

{ "id": "7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2", "status": "queued", "workflow_kind": "safe_parse", "media_type": "pdf", "progress_pct": 0, "created_at": "2025-03-01T12:00:00Z", "updated_at": "2025-03-01T12:00:00Z" }

Polling for completion

Poll GET /api/v1/documents/jobs/{job_id} until status is completed or failed.

curl https://api.expunct.ai/api/v1/documents/jobs/7a8b9c0d-1e2f-3a4b-c5d6-e7f8a9b0c1d2 \ -H "X-API-Key: pk_live_abc123"

Artifacts

A completed safe-parse job produces three artifacts:

artifact_kindRetentionDescription
sanitized_canonical_documentPersistentPII-free structured document (pages, blocks, tables)
sanitized_markdown_renderPersistentSanitized document rendered as Markdown
sanitized_chunks_v1PersistentSemantic chunks of the sanitized document, ready for embedding

The raw canonical_document is ephemeral — it is created internally during sanitization and deleted before the job completes. It is never included in the artifact list.

Retrieve artifact content with GET /api/v1/documents/{artifact_id}/content.

sanitized_canonical_document shape

Identical structure to the canonical_document from the parse workflow, but with all PII replaced:

{ "document_id": "7a8b9c0d-...", "page_count": 2, "block_count": 18, "parse_route": "text_native", "pages": [ { "page_number": 1, "blocks": [ { "block_id": "b_001", "kind": "paragraph", "text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]", "reading_order": 0 } ], "tables": [] } ] }

sanitized_chunks_v1 shape

Identical structure to chunks_v1 but sourced from the sanitized canonical document:

{ "document_id": "7a8b9c0d-...", "source_artifact_id": "art_sanitized...", "chunks": [ { "chunk_id": "c_001", "text": "Patient [PERSON] visited on [DATE_TIME]. Contact: [EMAIL_ADDRESS]", "page_number": 1, "block_ids": ["b_001"], "token_count": 14 } ] }

Redaction modes

redaction_modeExample output
type_label (default)[PERSON], [EMAIL_ADDRESS], [PHONE_NUMBER]
mask████ (fixed-length block character)

Error responses

StatusMeaning
400Unsupported file type or invalid config JSON
403Feature flag document_safe_parse_workflow not enabled
413File exceeds plan upload limit