Docling
Advanced PDF processing service that extracts text, tables, and figures from documents with intelligent chunking for RAG applications.
Service Info
Port
8080
Internal URL
http://docling:8080
Endpoint
/process
Input
PDF (base64)
Use Cases
- Extract text from PDFs with layout preservation
- Parse tables into structured data
- Extract figures and images from documents
- Generate markdown output for documentation
- Chunk documents for vector database ingestion
Request Format
POST /process
{
"pdf_base64": "JVBERi0xLjQKJeLj...", // Base64-encoded PDF
"source_name": "document.pdf", // Filename for reference
"collection_name": "my-collection", // Collection identifier
"client_id": "client-123", // Optional client ID
"chunk_size": 1000, // Characters per chunk (default: 1000)
"chunk_overlap": 100, // Overlap between chunks (default: 100)
"return_markdown": true, // Return full markdown
"include_plain_text": false // Include plain text extraction
}Response Format
{
"success": true,
"chunks": [
{
"content": "Extracted text content...",
"page_no": 1,
"chunk_id": 0,
"type": "text",
"label": "paragraph",
"bbox": { "x0": 72, "y0": 100, "x1": 540, "y1": 200 }
}
],
"tables": [
{
"page": 2,
"columns": ["Name", "Value"],
"rows": [["Item 1", "100"], ["Item 2", "200"]]
}
],
"figures": [
{
"page": 3,
"caption": "Figure 1: Architecture diagram",
"image_base64": "..."
}
],
"markdown": "# Document Title\n\nContent...",
"metadata": {
"title": "Document Title",
"author": "Author Name"
},
"page_count": 10,
"processing_time": 2.5
}TypeScript Client
Using DoclingClient
import { DoclingClient } from '@/services/gpu/doclingClient';
const client = new DoclingClient('http://docling:8080');
// Process from file path
const result = await client.processFileFromPath('/path/to/document.pdf', {
collectionName: 'my-collection',
chunkSize: 1000,
chunkOverlap: 100,
returnMarkdown: true,
});
// Or process from buffer
const pdfBuffer = await fs.readFile('document.pdf');
const result = await client.processPdfBuffer(pdfBuffer, {
collectionName: 'my-collection',
sourceName: 'document.pdf',
});
// Access extracted content
console.log('Chunks:', result.chunks.length);
console.log('Tables:', result.tables.length);
console.log('Figures:', result.figures.length);
console.log('Pages:', result.page_count);cURL Example
curl -X POST http://docling:8080/process \
-H "Content-Type: application/json" \
-d '{
"pdf_base64": "'$(base64 -i document.pdf)'",
"source_name": "document.pdf",
"collection_name": "my-collection",
"chunk_size": 1000,
"return_markdown": true
}'Chunking Strategy
Recommended settings for RAG:
chunk_size: 1000- Good balance for most LLM context windowschunk_overlap: 100- 10% overlap preserves context at boundariesreturn_markdown: true- Enables better formatting in responses
