Docling

Advanced PDF processing service that extracts text, tables, and figures from documents with intelligent chunking for RAG applications.

Service Info

Port

8080

Internal URL

http://docling:8080

Endpoint

/process

Input

PDF (base64)

Use Cases

  • Extract text from PDFs with layout preservation
  • Parse tables into structured data
  • Extract figures and images from documents
  • Generate markdown output for documentation
  • Chunk documents for vector database ingestion

Request Format

POST /process
{
  "pdf_base64": "JVBERi0xLjQKJeLj...",  // Base64-encoded PDF
  "source_name": "document.pdf",         // Filename for reference
  "collection_name": "my-collection",    // Collection identifier
  "client_id": "client-123",             // Optional client ID
  "chunk_size": 1000,                    // Characters per chunk (default: 1000)
  "chunk_overlap": 100,                  // Overlap between chunks (default: 100)
  "return_markdown": true,               // Return full markdown
  "include_plain_text": false            // Include plain text extraction
}

Response Format

{
  "success": true,
  "chunks": [
    {
      "content": "Extracted text content...",
      "page_no": 1,
      "chunk_id": 0,
      "type": "text",
      "label": "paragraph",
      "bbox": { "x0": 72, "y0": 100, "x1": 540, "y1": 200 }
    }
  ],
  "tables": [
    {
      "page": 2,
      "columns": ["Name", "Value"],
      "rows": [["Item 1", "100"], ["Item 2", "200"]]
    }
  ],
  "figures": [
    {
      "page": 3,
      "caption": "Figure 1: Architecture diagram",
      "image_base64": "..."
    }
  ],
  "markdown": "# Document Title\n\nContent...",
  "metadata": {
    "title": "Document Title",
    "author": "Author Name"
  },
  "page_count": 10,
  "processing_time": 2.5
}

TypeScript Client

Using DoclingClient
import { DoclingClient } from '@/services/gpu/doclingClient';

const client = new DoclingClient('http://docling:8080');

// Process from file path
const result = await client.processFileFromPath('/path/to/document.pdf', {
  collectionName: 'my-collection',
  chunkSize: 1000,
  chunkOverlap: 100,
  returnMarkdown: true,
});

// Or process from buffer
const pdfBuffer = await fs.readFile('document.pdf');
const result = await client.processPdfBuffer(pdfBuffer, {
  collectionName: 'my-collection',
  sourceName: 'document.pdf',
});

// Access extracted content
console.log('Chunks:', result.chunks.length);
console.log('Tables:', result.tables.length);
console.log('Figures:', result.figures.length);
console.log('Pages:', result.page_count);

cURL Example

curl -X POST http://docling:8080/process \
  -H "Content-Type: application/json" \
  -d '{
    "pdf_base64": "'$(base64 -i document.pdf)'",
    "source_name": "document.pdf",
    "collection_name": "my-collection",
    "chunk_size": 1000,
    "return_markdown": true
  }'

Chunking Strategy

Recommended settings for RAG:

  • chunk_size: 1000 - Good balance for most LLM context windows
  • chunk_overlap: 100 - 10% overlap preserves context at boundaries
  • return_markdown: true - Enables better formatting in responses