DocsGPU ServicesTable Extraction

Table Extraction

Extract structured tables from PDF documents using pdfplumber.

Port: 8098

What it Does

Table Extraction parses PDFs and extracts tabular data:

  • Table detection - Find tables in PDF pages
  • Structure extraction - Headers, rows, columns
  • Filtering - Minimum columns, row limits

Use cases:

  • Extract financial data from reports
  • Parse invoices and receipts
  • Convert PDF tables to structured data

API Endpoints

Extract Tables

Endpoint: POST /extract-tables

curl -X POST "http://table-extraction:8098/extract-tables" \
  -F "file=@document.pdf" \
  -F "max_pages=10" \
  -F "min_columns=2"

Request (multipart/form-data):

ParameterTypeDescription
filefilePDF file
max_pagesnumberOnly scan first N pages
min_columnsnumberMinimum non-empty headers (default: 2)
max_rowsnumberLimit rows per table (default: 500)

Response:

{
  "page_count": 3,
  "tables": [
    {
      "page": 1,
      "index": 0,
      "columns": ["Name", "Value", "Date"],
      "rows": [
        ["Item A", "100", "2024-01-15"],
        ["Item B", "200", "2024-01-16"]
      ],
      "row_count": 2,
      "preview": "Name, Value, Date → Item A | 100 | 2024-01-15"
    }
  ]
}

Health Check

curl http://table-extraction:8098/health

Usage Example

import requests
 
# Extract tables from PDF
with open("financial_report.pdf", "rb") as f:
    response = requests.post(
        "http://table-extraction:8098/extract-tables",
        files={"file": f},
        data={
            "max_pages": 20,
            "min_columns": 3
        }
    )
 
result = response.json()
print(f"Found {len(result['tables'])} tables in {result['page_count']} pages")
 
for table in result["tables"]:
    print(f"\nPage {table['page']}, Table {table['index']}:")
    print(f"  Columns: {table['columns']}")
    print(f"  Rows: {table['row_count']}")

Convert to DataFrame

import pandas as pd
 
# Extract tables
result = requests.post(
    "http://table-extraction:8098/extract-tables",
    files={"file": open("data.pdf", "rb")}
).json()
 
# Convert first table to DataFrame
if result["tables"]:
    table = result["tables"][0]
    df = pd.DataFrame(table["rows"], columns=table["columns"])
    print(df)

Tips

  • min_columns=2 filters out false positives (single-column “tables”)
  • max_rows=500 prevents huge tables from blocking the response
  • Tables with merged cells may have empty values
  • Complex layouts may require manual post-processing