Skip to content

运行提取

概述

提取端点接受一个文档(base64 编码),通过您定义的模板进行处理,并返回与模板变量匹配的结构化数据。单次 API 调用即可完成整个流程 -- 上传、AI 处理和结构化输出。

所有提取请求通过 POST /v1/extractions/run 发送。

准备文档

在将文档发送到 API 之前,您需要对文件内容进行 base64 编码。

约束条件:

  • 最大请求体大小:15 MB
  • 支持的 MIME 类型:
    • application/pdf(PDF)
    • application/vnd.openxmlformats-officedocument.wordprocessingml.document(DOCX)
bash
# Base64-encode a PDF file
base64 -i invoice.pdf -o invoice_b64.txt

# Or inline (macOS / Linux)
PDF_BASE64=$(base64 -w 0 invoice.pdf)
typescript
import { readFileSync } from "fs";

const pdfBuffer = readFileSync("invoice.pdf");
const pdfBase64 = pdfBuffer.toString("base64");
python
import base64

with open("invoice.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode("utf-8")

WARNING

Base64 编码会使文件大小增加约 33%。10 MB 的 PDF 编码后约为 13.3 MB,因此请保持源文件小于约 11 MB 以符合 15 MB 的请求限制。

发起请求

/v1/extractions/run 发送 POST 请求,包含以下 JSON 请求体:

字段类型必需描述
templateIdstring用于提取的模板 ID
fileNamestring原始文件名(例如 "invoice.pdf"
pdfBase64stringBase64 编码的文件内容
mimeTypestring文件的 MIME 类型
runIdstring可选标识符,用于将多个提取分组为一个批次
bash
curl -X POST https://api.docmap.io/v1/extractions/run \
  -H "Authorization: Bearer dm_live_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "templateId": "tmpl_abc123",
    "fileName": "invoice.pdf",
    "pdfBase64": "JVBERi0xLjQKJeLj...",
    "mimeType": "application/pdf"
  }'
typescript
const response = await fetch("https://api.docmap.io/v1/extractions/run", {
  method: "POST",
  headers: {
    Authorization: "Bearer dm_live_your_api_key",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    templateId: "tmpl_abc123",
    fileName: "invoice.pdf",
    pdfBase64: pdfBase64,
    mimeType: "application/pdf",
  }),
});

const { data } = await response.json();
console.log(data.extractedData);
python
import requests

response = requests.post(
    "https://api.docmap.io/v1/extractions/run",
    headers={
        "Authorization": "Bearer dm_live_your_api_key",
        "Content-Type": "application/json",
    },
    json={
        "templateId": "tmpl_abc123",
        "fileName": "invoice.pdf",
        "pdfBase64": pdf_base64,
        "mimeType": "application/pdf",
    },
)

data = response.json()["data"]
print(data["extractedData"])

理解响应

成功的提取返回一个包含在 data 对象中的响应:

json
{
  "data": {
    "id": "ext_abc123def456",
    "userId": "user_789",
    "templateId": "tmpl_abc123",
    "templateName": "Invoice Parser",
    "fileName": "invoice.pdf",
    "status": "completed",
    "extractedData": {
      "vendor_name": "Acme Corp",
      "invoice_number": "INV-2024-001",
      "total_amount": 1250.00,
      "line_items": [
        { "description": "Widget A", "quantity": 10, "unit_price": 125.00 }
      ]
    },
    "error": null,
    "variables": [
      { "name": "vendor_name", "type": "string", "description": "Company name of the vendor" },
      { "name": "invoice_number", "type": "string", "description": "Invoice reference number" },
      { "name": "total_amount", "type": "number", "description": "Total invoice amount" },
      { "name": "line_items", "type": "array", "description": "List of line items" }
    ],
    "source": "api",
    "runId": null,
    "processingTimeMs": 3420,
    "createdAt": "2025-07-15T10:30:00.000Z"
  }
}
字段描述
id唯一提取 ID(以 ext_ 为前缀)
status运行中(异步模式)为 "processing",数据成功提取为 "completed",处理出错为 "failed"
extractedData键与模板变量名匹配的对象。提取失败时为 null
error提取失败时的错误信息字符串。成功时为 null
processingTimeMsAI 处理文档所用时间(毫秒)
source通过 API 密钥触发时为 "api",通过网页界面触发时为 "dashboard"
variables用于此次提取的模板变量定义
runId您提供的批次标识符,未指定时为 null
templateName所使用模板的可读名称
createdAt提取创建时间的 ISO 8601 时间戳

TIP

在访问 extractedData 之前,请始终先检查 status 字段。如果 status"failed",则 error 字段包含错误描述。

批量提取

要将多个文件作为逻辑批次处理,请在每个提取请求中传入相同的 runId。这不会改变文档的处理方式 -- 每个文件仍然独立提取 -- 但允许您一起查询某个批次的所有结果。

typescript
const runId = "batch-invoices-2025-07";
const files = ["invoice-001.pdf", "invoice-002.pdf", "invoice-003.pdf"];

// Process each file with the same runId
const results = await Promise.all(
  files.map(async (fileName) => {
    const pdfBase64 = readFileSync(fileName).toString("base64");

    const response = await fetch("https://api.docmap.io/v1/extractions/run", {
      method: "POST",
      headers: {
        Authorization: "Bearer dm_live_your_api_key",
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        templateId: "tmpl_abc123",
        fileName,
        pdfBase64,
        mimeType: "application/pdf",
        runId,
      }),
    });

    return response.json();
  })
);

然后获取该批次的所有提取结果:

bash
curl "https://api.docmap.io/v1/extractions?runId=batch-invoices-2025-07" \
  -H "Authorization: Bearer dm_live_your_api_key"
typescript
const response = await fetch(
  "https://api.docmap.io/v1/extractions?runId=batch-invoices-2025-07",
  {
    headers: { Authorization: "Bearer dm_live_your_api_key" },
  }
);

const { data } = await response.json();
console.log(`Batch contains ${data.length} extractions`);
python
response = requests.get(
    "https://api.docmap.io/v1/extractions",
    params={"runId": "batch-invoices-2025-07"},
    headers={"Authorization": "Bearer dm_live_your_api_key"},
)

data = response.json()["data"]
print(f"Batch contains {len(data)} extractions")

TIP

列表端点还支持按 templateId 筛选和 limit 参数(1--100,默认 50)。您可以组合筛选条件:?runId=batch-001&templateId=tmpl_abc123&limit=100

异步工作流

默认情况下,提取请求是同步的 -- API 会阻塞直到处理完成。对于长时间运行的提取,或者想要避免 HTTP 超时时,可以通过在 URL 中添加 ?async=true 来使用异步模式。API 会立即返回 "processing" 状态,您可以轮询另一个端点直到结果就绪。

提交 + 轮询模式

typescript
import { readFileSync } from "fs";

const API_BASE = "https://api.docmap.io";
const API_KEY = process.env.DOCMAP_API_KEY!;

// 1. Submit the extraction asynchronously
const submitResponse = await fetch(`${API_BASE}/v1/extractions/run?async=true`, {
  method: "POST",
  headers: {
    Authorization: `Bearer ${API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    templateId: "tmpl_abc123",
    fileName: "invoice.pdf",
    pdfBase64: readFileSync("invoice.pdf").toString("base64"),
    mimeType: "application/pdf",
  }),
});

const { data: submitted } = await submitResponse.json();
console.log(`Extraction ${submitted.id} submitted, status: ${submitted.status}`);

// 2. Poll until complete
async function poll(extractionId: string): Promise<any> {
  for (let i = 0; i < 30; i++) {
    const res = await fetch(`${API_BASE}/v1/extractions/${extractionId}`, {
      headers: { Authorization: `Bearer ${API_KEY}` },
    });
    const { data } = await res.json();
    if (data.status !== "processing") return data;
    await new Promise((r) => setTimeout(r, 2000));
  }
  throw new Error("Extraction timed out");
}

const result = await poll(submitted.id);
console.log(`Final status: ${result.status}`);
console.log("Extracted data:", result.extractedData);
python
import base64
import time
import requests

API_BASE = "https://api.docmap.io"
API_KEY = "dm_live_your_api_key"
headers = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# 1. Submit the extraction asynchronously
with open("invoice.pdf", "rb") as f:
    pdf_base64 = base64.b64encode(f.read()).decode("utf-8")

submit_response = requests.post(
    f"{API_BASE}/v1/extractions/run?async=true",
    headers=headers,
    json={
        "templateId": "tmpl_abc123",
        "fileName": "invoice.pdf",
        "pdfBase64": pdf_base64,
        "mimeType": "application/pdf",
    },
)

submitted = submit_response.json()["data"]
print(f"Extraction {submitted['id']} submitted, status: {submitted['status']}")

# 2. Poll until complete
def poll(extraction_id: str):
    for _ in range(30):
        res = requests.get(
            f"{API_BASE}/v1/extractions/{extraction_id}",
            headers={"Authorization": f"Bearer {API_KEY}"},
        )
        data = res.json()["data"]
        if data["status"] != "processing":
            return data
        time.sleep(2)
    raise TimeoutError("Extraction timed out")

result = poll(submitted["id"])
print(f"Final status: {result['status']}")
print("Extracted data:", result["extractedData"])

何时使用异步模式

处理大型文档时、HTTP 客户端超时较短时,或者想要提交多个提取后再收集结果时,请使用异步模式。对于大多数单文档提取,同步模式更简单。

完整工作流示例

以下是一个完整的端到端 TypeScript 示例,从磁盘读取 PDF,运行提取,并处理成功和失败情况:

typescript
import { readFileSync } from "fs";

const API_BASE = "https://api.docmap.io";
const API_KEY = process.env.DOCMAP_API_KEY!;

async function extractInvoice(filePath: string) {
  // 1. Read and encode the PDF
  const pdfBuffer = readFileSync(filePath);
  const pdfBase64 = pdfBuffer.toString("base64");
  const fileName = filePath.split("/").pop()!;

  console.log(`Processing ${fileName} (${(pdfBuffer.length / 1024).toFixed(0)} KB)...`);

  // 2. Run the extraction
  const response = await fetch(`${API_BASE}/v1/extractions/run`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      templateId: "tmpl_invoice_parser",
      fileName,
      pdfBase64,
      mimeType: "application/pdf",
    }),
  });

  if (!response.ok) {
    const error = await response.json();
    throw new Error(`API error ${response.status}: ${error.error.message}`);
  }

  const { data } = await response.json();

  // 3. Check the extraction result
  if (data.status === "failed") {
    console.error(`Extraction failed: ${data.error}`);
    return null;
  }

  console.log(`Extraction completed in ${data.processingTimeMs}ms`);
  console.log("Extracted data:", JSON.stringify(data.extractedData, null, 2));

  return data.extractedData;
}

// Run it
extractInvoice("./invoices/invoice-001.pdf")
  .then((result) => {
    if (result) {
      console.log(`Vendor: ${result.vendor_name}`);
      console.log(`Total: $${result.total_amount}`);
    }
  })
  .catch(console.error);

DocMap API 文档