Table Extraction
Extract structured tables from PDF documents using pdfplumber.
Port: 8098
What it Does
Table Extraction parses PDFs and extracts tabular data:
- Table detection - Find tables in PDF pages
- Structure extraction - Headers, rows, columns
- Filtering - Minimum columns, row limits
Use cases:
- Extract financial data from reports
- Parse invoices and receipts
- Convert PDF tables to structured data
API Endpoints
Extract Tables
Endpoint: POST /extract-tables
curl -X POST "http://table-extraction:8098/extract-tables" \
-F "file=@document.pdf" \
-F "max_pages=10" \
-F "min_columns=2"Request (multipart/form-data):
| Parameter | Type | Description |
|---|---|---|
file | file | PDF file |
max_pages | number | Only scan first N pages |
min_columns | number | Minimum non-empty headers (default: 2) |
max_rows | number | Limit rows per table (default: 500) |
Response:
{
"page_count": 3,
"tables": [
{
"page": 1,
"index": 0,
"columns": ["Name", "Value", "Date"],
"rows": [
["Item A", "100", "2024-01-15"],
["Item B", "200", "2024-01-16"]
],
"row_count": 2,
"preview": "Name, Value, Date → Item A | 100 | 2024-01-15"
}
]
}Health Check
curl http://table-extraction:8098/healthUsage Example
import requests
# Extract tables from PDF
with open("financial_report.pdf", "rb") as f:
response = requests.post(
"http://table-extraction:8098/extract-tables",
files={"file": f},
data={
"max_pages": 20,
"min_columns": 3
}
)
result = response.json()
print(f"Found {len(result['tables'])} tables in {result['page_count']} pages")
for table in result["tables"]:
print(f"\nPage {table['page']}, Table {table['index']}:")
print(f" Columns: {table['columns']}")
print(f" Rows: {table['row_count']}")Convert to DataFrame
import pandas as pd
# Extract tables
result = requests.post(
"http://table-extraction:8098/extract-tables",
files={"file": open("data.pdf", "rb")}
).json()
# Convert first table to DataFrame
if result["tables"]:
table = result["tables"][0]
df = pd.DataFrame(table["rows"], columns=table["columns"])
print(df)Tips
- min_columns=2 filters out false positives (single-column “tables”)
- max_rows=500 prevents huge tables from blocking the response
- Tables with merged cells may have empty values
- Complex layouts may require manual post-processing