Document Extraction API Overview
This page provides a comprehensive overview of the API, including authentication, available endpoints, and a step-by-step guide to processing documents.
Overview
Introduction
Welcome to the Suparse Document Processing API! Our goal is to provide a powerful yet simple interface to automate the extraction of structured data from your documents. Whether you're processing invoices, receipts, or bank statements, this API is designed to handle the entire lifecycle, from upload to data retrieval.
This guide will walk you through the essential concepts, including authentication, the available endpoints, and the asynchronous workflow you'll use to process your documents.
Authentication
All API requests must be authenticated using an API key. You can generate and manage your API keys from your user dashboard.
Provide your API key in the X-API-Key
header with every request.
X-API-Key: pk_abcd1234_secretsecretsecretsecretsecret
Requests without a valid API key will fail with a 401 Unauthorized
error.
API Endpoints
Here is a summary of the primary endpoints for managing your documents:
Method | Endpoint | Description |
---|---|---|
POST | /api/v1/documents/{doc_type} | Uploads a new document for processing. |
GET | /api/v1/documents/ | Lists all your accessible documents with pagination and filtering. |
GET | /api/v1/documents/{document_id}/result | Retrieves the extracted data and status for a specific processed document. |
POST | /api/v1/documents/download_selected | Downloads extracted data for multiple documents in JSON, CSV, XLSX, or QuickBooks CSV format. |
DELETE | /api/v1/documents/{document_id} | Deletes a document permanently |
The Asynchronous Workflow
Document processing is an asynchronous operation. You upload a file and then check for the result later. This ensures that you get a fast response and that intensive processing happens in the background.
Here’s the typical flow:
[Client] --1. Upload File--> [API: 202 Accepted]
[Client] <--2. Poll for Result-- [API: 202 Processing / 200 OK / 404 Error]
Step 1: Upload the Document
Begin by making a POST
request to the Upload Document endpoint with your file. If the request is valid, the API will immediately respond with a 202 Accepted
status, a unique document_id
, and a status
of "queued".
Step 2: Poll for Processing Results
Because processing happens in the background, you must periodically check for the result using the Get Document Result endpoint.
Best Practices for Polling:
- Check the Status Code:
- A
202 Accepted
response means the document is still processing. You should wait and try again. - A
200 OK
response means processing is complete and the body contains your extracted data. - A
404 Not Found
response indicates either the document ID is invalid or processing failed. Thedetail
field will provide more information.
- A
- To avoid excessive requests, we recommend making first request 5 seconds after submitting the file for processing, then after every 3 seconds.
Step 3: Download or Use the Results
Once the status is 200 OK
, the response body will contain your structured data. You can use this data directly or download it in bulk using the Download Selected Documents endpoint.
Step 4: Delete the Document (Optional)
After you have retrieved the results, you can make a DELETE
request to the Delete Document endpoint. This action is irreversible.
Handling Responses & Errors
Understanding the HTTP status codes our API returns will help you build a robust integration.
200 OK
: The request was successful, and the response body contains the requested data.202 Accepted
: Your document was successfully uploaded and is queued for processing. Poll for the result.204 No Content
: YourDELETE
request was successful.400 Bad Request
: The request was malformed (e.g., invalid JSON, wrong file type, invalid UUID format).401 Unauthorized
: YourX-API-Key
is missing, invalid, or expired.403 Forbidden
: You do not have permission to perform this action (e.g., insufficient credits).404 Not Found
: The requested resource (like a document or its result) does not exist.429 Too Many Requests
: You have exceeded the rate limit for an endpoint.500 Internal Server Error
: An unexpected error occurred on our end. Please try again later.
Code Examples
import requests
import time
import os
import json
# --- Configuration ---
API_KEY = "pk_abcd1234_secretsecretsecretsecretsecret" # Replace with your actual API key
BASE_URL = "https://api.suparse.com/api/v1"
FILE_PATH = "/path/to/your/invoice.pdf" # Replace with your file path
DOC_TYPE = "invoice"
def upload_document(file_path, doc_type):
"""Uploads a document and returns its ID."""
print(f"Uploading {file_path}...")
url = f"{BASE_URL}/documents/{doc_type}"
headers = {"X-API-Key": API_KEY}
with open(file_path, "rb") as f:
files = {"file": (os.path.basename(file_path), f)}
response = requests.post(url, headers=headers, files=files)
if response.status_code == 202:
data = response.json()
print(f"Upload successful. Document ID: {data['document_id']}")
return data['document_id']
else:
print(f"Error during upload: {response.status_code} {response.text}")
return None
def poll_for_result(document_id):
"""Polls for the processing result of a document."""
url = f"{BASE_URL}/documents/{document_id}/result"
headers = {"X-API-Key": API_KEY}
max_attempts = 20 # Increased attempts for shorter polling interval
delay = 5 # Initial 5-second delay
for attempt in range(max_attempts):
print(f"Polling for result (Attempt {attempt + 1}/{max_attempts})...")
response = requests.get(url, headers=headers)
if response.status_code == 200:
print("Processing complete!")
return response.json()
elif response.status_code == 202:
print(f"Status is 'processing', waiting for {delay} seconds...")
time.sleep(delay)
delay = 3 # Set all subsequent delays to 3 seconds
else:
print(f"Polling failed with status {response.status_code}: {response.text}")
return None
print("Polling timed out after maximum attempts.")
return None
if __name__ == "__main__":
doc_id = upload_document(FILE_PATH, DOC_TYPE)
if doc_id:
result = poll_for_result(doc_id)
if result:
print("\n--- Extracted Data ---")
print(json.dumps(result, indent=2))