Docling

Advanced PDF processing service that extracts text, tables, and figures from documents with intelligent chunking for RAG applications.

Service Info

Port

8080

Internal URL

http://docling:8080

Endpoint

/process

Input

PDF (base64)

Use Cases

Extract text from PDFs with layout preservation
Parse tables into structured data
Extract figures and images from documents
Generate markdown output for documentation
Chunk documents for vector database ingestion

Request Format

POST /process

{
  "pdf_base64": "JVBERi0xLjQKJeLj...",  // Base64-encoded PDF
  "source_name": "document.pdf",         // Filename for reference
  "collection_name": "my-collection",    // Collection identifier
  "client_id": "client-123",             // Optional client ID
  "chunk_size": 1000,                    // Characters per chunk (default: 1000)
  "chunk_overlap": 100,                  // Overlap between chunks (default: 100)
  "return_markdown": true,               // Return full markdown
  "include_plain_text": false            // Include plain text extraction
}

Response Format

{
  "success": true,
  "chunks": [
    {
      "content": "Extracted text content...",
      "page_no": 1,
      "chunk_id": 0,
      "type": "text",
      "label": "paragraph",
      "bbox": { "x0": 72, "y0": 100, "x1": 540, "y1": 200 }
    }
  ],
  "tables": [
    {
      "page": 2,
      "columns": ["Name", "Value"],
      "rows": [["Item 1", "100"], ["Item 2", "200"]]
    }
  ],
  "figures": [
    {
      "page": 3,
      "caption": "Figure 1: Architecture diagram",
      "image_base64": "..."
    }
  ],
  "markdown": "# Document Title\n\nContent...",
  "metadata": {
    "title": "Document Title",
    "author": "Author Name"
  },
  "page_count": 10,
  "processing_time": 2.5
}

TypeScript Client

Using DoclingClient

import { DoclingClient } from '@/services/gpu/doclingClient';

const client = new DoclingClient('http://docling:8080');

// Process from file path
const result = await client.processFileFromPath('/path/to/document.pdf', {
  collectionName: 'my-collection',
  chunkSize: 1000,
  chunkOverlap: 100,
  returnMarkdown: true,
});

// Or process from buffer
const pdfBuffer = await fs.readFile('document.pdf');
const result = await client.processPdfBuffer(pdfBuffer, {
  collectionName: 'my-collection',
  sourceName: 'document.pdf',
});

// Access extracted content
console.log('Chunks:', result.chunks.length);
console.log('Tables:', result.tables.length);
console.log('Figures:', result.figures.length);
console.log('Pages:', result.page_count);

cURL Example

curl -X POST http://docling:8080/process \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'$(base64 -i document.pdf)'",
    "source_name": "document.pdf",
    "collection_name": "my-collection",
    "chunk_size": 1000,
    "return_markdown": true
  }'

Chunking Strategy

Recommended settings for RAG:

chunk_size: 1000 - Good balance for most LLM context windows
chunk_overlap: 100 - 10% overlap preserves context at boundaries
return_markdown: true - Enables better formatting in responses