Table Extraction

Extract structured tables from PDF documents using Camelot and other table detection methods. Returns columns, rows, and metadata for each table found.

Why Table Extraction Matters

Tables are one of the most information-dense structures in documents, yet they are among the hardest elements to extract accurately. Standard text extraction treats table content as flat text, losing the critical row-column relationships that give the data its meaning. A financial figure without its corresponding label and time period is essentially useless. The Table Extraction service solves this by preserving the structural relationships between cells, returning clean column headers and properly aligned row data that downstream processes can work with directly.

Within the IntelligenceBox document intelligence pipeline, table extraction complements the text-based chunking and embedding stages. When a PDF is processed, tables are detected and extracted separately so that their structured data can be stored, queried, and presented in its original tabular format rather than being flattened into paragraph text. This is especially valuable for RAG applications where an LLM needs to reason over numerical data or compare values across rows.

Supported Formats and Accuracy

The service accepts PDF files uploaded via multipart form data. Camelot, the primary extraction engine, works best with digitally created PDFs that contain clearly bordered or lined tables. For scanned documents, ensure that an OCR step has been applied beforehand so that the text layer is present. Borderless tables and complex merged-cell layouts may require manual verification. Accuracy is highest on well-structured financial reports, invoices, and scientific papers where table borders are explicit.

Service Info

Port

8098

Internal URL

http://table-extractor:8098

Endpoint

/extract-tables

Input

PDF (multipart)

Use Cases

  • Extract financial data from reports
  • Parse invoices and receipts
  • Convert PDF tables to spreadsheets
  • Extract data tables from scientific papers
  • Process forms with tabular data

Request Format

Uses multipart/form-data to upload the PDF file:

POST /extract-tables (multipart/form-data)
file: <PDF file>
max_pages: 10          // Max pages to process (optional)
min_columns: 2         // Minimum columns for a table (optional)
max_rows: 1000         // Maximum rows per table (optional)
use_camelot: true      // Use Camelot extraction (default: true)

Response Format

{
  "page_count": 10,
  "tables": [
    {
      "page": 1,
      "index": 0,
      "columns": ["Product", "Quantity", "Price", "Total"],
      "rows": [
        ["Widget A", "10", "$5.00", "$50.00"],
        ["Widget B", "5", "$10.00", "$50.00"],
        ["Widget C", "20", "$2.50", "$50.00"]
      ],
      "row_count": 3,
      "preview": "| Product | Quantity | Price | Total |\n|..."
    },
    {
      "page": 3,
      "index": 1,
      "columns": ["Date", "Description", "Amount"],
      "rows": [
        ["2024-01-15", "Invoice #123", "$150.00"],
        ["2024-01-20", "Invoice #124", "$200.00"]
      ],
      "row_count": 2
    }
  ]
}

TypeScript Client

Using TableExtractionClient
import { TableExtractionClient } from '@/services/gpu/tableExtractionClient';

const client = new TableExtractionClient('http://table-extractor:8098');

// Health check
const healthy = await client.healthCheck();

// Extract tables from PDF
const result = await client.extractTables({
  pdfPath: '/path/to/document.pdf',
  maxPages: 10,
  minColumns: 2,
  maxRows: 500,
  useCamelot: true,
});

// Process extracted tables
console.log(`Found ${result.tables.length} tables in ${result.page_count} pages`);

for (const table of result.tables) {
  console.log(`Table on page ${table.page}:`);
  console.log('Columns:', table.columns.join(', '));
  console.log('Rows:', table.row_count);

  // Convert to objects
  const objects = table.rows.map(row =>
    Object.fromEntries(
      table.columns.map((col, i) => [col, row[i]])
    )
  );
  console.log('Data:', objects);
}

cURL Example

curl -X POST http://table-extractor:8098/extract-tables \
  -F "file=@document.pdf" \
  -F "max_pages=10" \
  -F "min_columns=2" \
  -F "use_camelot=true"

Convert to JSON/CSV

Export table data
// Convert table to array of objects
function tableToObjects(table: ExtractedTable) {
  return table.rows.map(row =>
    Object.fromEntries(
      table.columns.map((col, i) => [col, row[i] ?? ''])
    )
  );
}

// Convert table to CSV string
function tableToCsv(table: ExtractedTable) {
  const header = table.columns.join(',');
  const rows = table.rows.map(row =>
    row.map(cell => `"${cell.replace(/"/g, '""')}"`).join(',')
  );
  return [header, ...rows].join('\n');
}

// Usage
const objects = tableToObjects(result.tables[0]);
const csv = tableToCsv(result.tables[0]);

Tips

  • Table detection: Camelot works best with clearly bordered tables.
  • Scanned PDFs: For scanned documents, ensure good OCR quality first.
  • Large PDFs: Use max_pages to limit processing time.
  • Empty cells: Empty cells are returned as empty strings in the rows array.