ColPali (Document Vision)
High-performance vision-language embedding service for document understanding. ColPali generates multi-vector embeddings from document images, enabling visual document retrieval.
Port: 8001
Model: vidore/colqwen2-v1.0-hf
What it Does
ColPali treats documents as images and generates embeddings that capture both text and visual layout. This is ideal for:
- Document retrieval based on visual similarity
- Finding documents by layout patterns
- Cross-modal search (text query → document image)
API Endpoints
Generate Embeddings
Endpoint: POST /embed
curl -X POST "http://colpali-server:8001/embed" \
-H "Content-Type: application/json" \
-d '{
"images": ["<base64-encoded-image>"],
"batch_size": 8,
"include_pooling": true
}'Request:
| Parameter | Type | Description |
|---|---|---|
images | array | Base64-encoded document images |
batch_size | number | Processing batch size (default: 8) |
include_pooling | boolean | Include pooled embeddings |
Response:
{
"embeddings": [{
"original": [[0.1, 0.2, ...]],
"mean_pooling_rows": [0.15, ...],
"mean_pooling_columns": [0.18, ...]
}]
}Health Check
curl http://colpali-server:8001/healthMetrics
curl http://colpali-server:8001/metricsUsage Example
import requests
import base64
# Load document image
with open("document.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
# Generate embeddings
response = requests.post(
"http://colpali-server:8001/embed",
json={
"images": [image_b64],
"include_pooling": True
}
)
result = response.json()
embedding = result["embeddings"][0]["mean_pooling_rows"]Performance
| Hardware | Batch Size | Images/sec |
|---|---|---|
| RTX 4090 | 8 | ~45 (INT4) |
| RTX 3080 | 8 | ~30 (INT4) |
| Jetson Orin | 4 | ~8 (INT4) |