Jina Embeddings

Multimodal embedding service that generates dense vectors for both text and images. Returns 2048-dimensional vectors suitable for semantic search.

What Are Multimodal Embeddings?

Multimodal embeddings are numerical representations that map different types of content, such as text and images, into a shared vector space. When two pieces of content are semantically similar, their vectors will be close together regardless of whether they are text, images, or a combination of both. This enables powerful cross-modal retrieval: you can search for images using text descriptions, or find related text passages given an image input.

The Jina Embeddings service in IntelligenceBox uses the Jina CLIP model to produce 2048-dimensional dense vectors. Dense vectors encode meaning across all dimensions, capturing nuanced semantic relationships that simpler keyword-based methods miss. This makes them particularly effective for understanding natural language queries and visual content where exact keyword matching falls short.

Jina Embeddings vs. FastEmbed

IntelligenceBox provides two embedding services. Jina Embeddings produces dense vectors that capture deep semantic meaning and supports both text and image inputs. FastEmbed, on the other hand, produces sparse vectors optimized for keyword-level matching. Choose Jina Embeddings when you need cross-modal search, visual similarity, or rich semantic understanding. Choose FastEmbed when precise keyword matching or domain-specific terminology recall is more important. For the best retrieval quality, combine both in a hybrid search pipeline where dense vectors handle semantic understanding and sparse vectors ensure keyword coverage.

Supported Modalities

Jina Embeddings supports the following input types, all projected into the same 2048-dimensional space:

Text: Passages, queries, titles, and document chunks via the /text endpoint
Images: PNG, JPEG, and WebP images as base64-encoded strings via the /image endpoint
Document pages: Rendered PDF pages treated as images for visual document retrieval

Service Info

Port

8080

Internal URL

http://jina:8080

Endpoint

/image

Vector Dim

2048

Use Cases

Semantic image search
Cross-modal retrieval (text-to-image, image-to-text)
Visual similarity detection
Document page embeddings for RAG
Image clustering and classification

Request Format

POST /image

{
  "images": [
    "base64-encoded-image-1",
    "base64-encoded-image-2"
  ],
  "normalize": true  // L2 normalize vectors (default: true)
}

Response Format

{
  "success": true,
  "embeddings": [
    [0.023, -0.041, 0.087, ...],  // 2048-dim vector
    [0.012, -0.033, 0.056, ...]   // 2048-dim vector
  ],
  "count": 2
}

TypeScript Client

Using JinaClient

import { JinaClient } from '@/services/gpu/jinaClient';

const client = new JinaClient('http://jina:8080');

// Generate embeddings for images
const result = await client.embedImages(
  [imageBase64_1, imageBase64_2],
  { normalize: true }
);

// Access embeddings
for (const embedding of result.embeddings) {
  console.log('Dimension:', embedding.length);  // 2048
  console.log('Vector:', embedding.slice(0, 5));  // First 5 values
}

// Store in vector database
await qdrant.upsert('my-collection', {
  points: result.embeddings.map((vec, i) => ({
    id: i,
    vector: vec,
    payload: { image_id: i }
  }))
});

cURL Example

curl -X POST http://jina:8080/image \
  -H "Content-Type: application/json" \
  -d '{
    "images": ["'$(base64 -i image.png)'"],
    "normalize": true
  }'

Text Embeddings

For text embeddings, use the /text endpoint:

POST /text

{
  "texts": [
    "First text to embed",
    "Second text to embed"
  ],
  "normalize": true
}

// Response: Same format as image embeddings

Notes

Normalization: Keep normalize: true for cosine similarity search.
Image format: Supports PNG, JPEG, and WebP as base64 strings.
Batch size: Process multiple images in one request for efficiency.
Cross-modal: Text and image embeddings live in the same vector space.