Jina Embeddings
Multimodal embedding service that generates dense vectors for both text and images. Returns 2048-dimensional vectors suitable for semantic search.
What Are Multimodal Embeddings?
Multimodal embeddings are numerical representations that map different types of content, such as text and images, into a shared vector space. When two pieces of content are semantically similar, their vectors will be close together regardless of whether they are text, images, or a combination of both. This enables powerful cross-modal retrieval: you can search for images using text descriptions, or find related text passages given an image input.
The Jina Embeddings service in IntelligenceBox uses the Jina CLIP model to produce 2048-dimensional dense vectors. Dense vectors encode meaning across all dimensions, capturing nuanced semantic relationships that simpler keyword-based methods miss. This makes them particularly effective for understanding natural language queries and visual content where exact keyword matching falls short.
Jina Embeddings vs. FastEmbed
IntelligenceBox provides two embedding services. Jina Embeddings produces dense vectors that capture deep semantic meaning and supports both text and image inputs. FastEmbed, on the other hand, produces sparse vectors optimized for keyword-level matching. Choose Jina Embeddings when you need cross-modal search, visual similarity, or rich semantic understanding. Choose FastEmbed when precise keyword matching or domain-specific terminology recall is more important. For the best retrieval quality, combine both in a hybrid search pipeline where dense vectors handle semantic understanding and sparse vectors ensure keyword coverage.
Supported Modalities
Jina Embeddings supports the following input types, all projected into the same 2048-dimensional space:
- Text: Passages, queries, titles, and document chunks via the
/textendpoint - Images: PNG, JPEG, and WebP images as base64-encoded strings via the
/imageendpoint - Document pages: Rendered PDF pages treated as images for visual document retrieval
Service Info
Port
8080
Internal URL
http://jina:8080
Endpoint
/image
Vector Dim
2048
Use Cases
- Semantic image search
- Cross-modal retrieval (text-to-image, image-to-text)
- Visual similarity detection
- Document page embeddings for RAG
- Image clustering and classification
Request Format
{
"images": [
"base64-encoded-image-1",
"base64-encoded-image-2"
],
"normalize": true // L2 normalize vectors (default: true)
}Response Format
{
"success": true,
"embeddings": [
[0.023, -0.041, 0.087, ...], // 2048-dim vector
[0.012, -0.033, 0.056, ...] // 2048-dim vector
],
"count": 2
}TypeScript Client
import { JinaClient } from '@/services/gpu/jinaClient';
const client = new JinaClient('http://jina:8080');
// Generate embeddings for images
const result = await client.embedImages(
[imageBase64_1, imageBase64_2],
{ normalize: true }
);
// Access embeddings
for (const embedding of result.embeddings) {
console.log('Dimension:', embedding.length); // 2048
console.log('Vector:', embedding.slice(0, 5)); // First 5 values
}
// Store in vector database
await qdrant.upsert('my-collection', {
points: result.embeddings.map((vec, i) => ({
id: i,
vector: vec,
payload: { image_id: i }
}))
});cURL Example
curl -X POST http://jina:8080/image \
-H "Content-Type: application/json" \
-d '{
"images": ["'$(base64 -i image.png)'"],
"normalize": true
}'Text Embeddings
For text embeddings, use the /text endpoint:
{
"texts": [
"First text to embed",
"Second text to embed"
],
"normalize": true
}
// Response: Same format as image embeddingsNotes
- Normalization: Keep
normalize: truefor cosine similarity search. - Image format: Supports PNG, JPEG, and WebP as base64 strings.
- Batch size: Process multiple images in one request for efficiency.
- Cross-modal: Text and image embeddings live in the same vector space.
