DocsGPU ServicesColPali (Document Vision)

ColPali (Document Vision)

High-performance vision-language embedding service for document understanding. ColPali generates multi-vector embeddings from document images, enabling visual document retrieval.

Port: 8001
Model: vidore/colqwen2-v1.0-hf

What it Does

ColPali treats documents as images and generates embeddings that capture both text and visual layout. This is ideal for:

  • Document retrieval based on visual similarity
  • Finding documents by layout patterns
  • Cross-modal search (text query → document image)

API Endpoints

Generate Embeddings

Endpoint: POST /embed

curl -X POST "http://colpali-server:8001/embed" \
  -H "Content-Type: application/json" \
  -d '{
    "images": ["<base64-encoded-image>"],
    "batch_size": 8,
    "include_pooling": true
  }'

Request:

ParameterTypeDescription
imagesarrayBase64-encoded document images
batch_sizenumberProcessing batch size (default: 8)
include_poolingbooleanInclude pooled embeddings

Response:

{
  "embeddings": [{
    "original": [[0.1, 0.2, ...]],
    "mean_pooling_rows": [0.15, ...],
    "mean_pooling_columns": [0.18, ...]
  }]
}

Health Check

curl http://colpali-server:8001/health

Metrics

curl http://colpali-server:8001/metrics

Usage Example

import requests
import base64
 
# Load document image
with open("document.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()
 
# Generate embeddings
response = requests.post(
    "http://colpali-server:8001/embed",
    json={
        "images": [image_b64],
        "include_pooling": True
    }
)
 
result = response.json()
embedding = result["embeddings"][0]["mean_pooling_rows"]

Performance

HardwareBatch SizeImages/sec
RTX 40908~45 (INT4)
RTX 30808~30 (INT4)
Jetson Orin4~8 (INT4)