Jina Reranker

Cross-encoder reranking service that scores document relevance to a query. Use after initial retrieval to improve result quality.

What Is Reranking?

Reranking is a second-stage retrieval technique that re-scores a set of candidate documents against a query using a more computationally expensive but significantly more accurate model. In a typical RAG pipeline, the first stage uses fast approximate nearest-neighbor search over embeddings to retrieve a broad set of candidates. The reranker then evaluates each candidate individually by feeding the full query-document pair through a cross-encoder, producing a precise relevance score. This two-stage approach combines the speed of vector search with the accuracy of cross-encoder models, resulting in noticeably better answer quality from the downstream LLM.

The Jina Reranker is especially effective at resolving ambiguous queries where multiple retrieved documents seem superficially similar but differ in relevance. By attending to the fine-grained interaction between query tokens and document tokens, the cross-encoder catches nuances that bi-encoder embeddings miss. In practice, adding a reranking step typically improves top-5 retrieval precision by 10 to 25 percent with minimal additional latency when the candidate set is kept to a reasonable size.

Service Info

Port

8081

Internal URL

http://jina-reranker:8081

Endpoint

/rerank

Timeout

120s

Why Reranking?

Vector search is fast but approximate. Reranking uses a more powerful cross-encoder model to precisely score query-document pairs:

  1. Initial retrieval: Vector search returns top 50-100 candidates (fast)
  2. Reranking: Cross-encoder scores each candidate (accurate)
  3. Final results: Return top K reranked documents

Request Format

POST /rerank
{
  "query": "What is the capital of France?",
  "documents": [
    "Paris is the capital and largest city of France.",
    "France is a country in Western Europe.",
    "The Eiffel Tower is located in Paris.",
    "Berlin is the capital of Germany."
  ],
  "top_k": 3,              // Return top K results (optional)
  "return_documents": true  // Include document text in response
}

Response Format

{
  "results": [
    {
      "index": 0,
      "relevance_score": 0.95,
      "document": "Paris is the capital and largest city of France."
    },
    {
      "index": 2,
      "relevance_score": 0.72,
      "document": "The Eiffel Tower is located in Paris."
    },
    {
      "index": 1,
      "relevance_score": 0.45,
      "document": "France is a country in Western Europe."
    }
  ],
  "query": "What is the capital of France?",
  "total_documents": 4,
  "returned_documents": 3
}

TypeScript Client

Using JinaRerankerClient
import { JinaRerankerClient } from '@/services/gpu/jinaRerankerClient';

const client = new JinaRerankerClient('http://jina-reranker:8081');

// Rerank search results
const result = await client.rerank(
  "How do I reset my password?",
  [
    "To reset your password, click the 'Forgot Password' link.",
    "Your account settings can be found in the menu.",
    "Password requirements include 8+ characters.",
    "Contact support for account issues."
  ],
  { top_k: 3, return_documents: true }
);

// Use reranked results
for (const item of result.results) {
  console.log(`Score: ${item.relevance_score.toFixed(2)}`);
  console.log(`Document: ${item.document}`);
  console.log('---');
}

// Health check
const healthy = await client.healthCheck();

RAG Pipeline Example

Two-stage retrieval
// Stage 1: Fast vector search (retrieve candidates)
const candidates = await qdrant.search('documents', {
  vector: queryEmbedding,
  limit: 50  // Get more candidates than needed
});

// Stage 2: Rerank for precision
const reranked = await rerankerClient.rerank(
  userQuery,
  candidates.map(c => c.payload.text),
  { top_k: 5 }
);

// Use top reranked documents for LLM context
const context = reranked.results
  .map(r => r.document)
  .join('\n\n');

Best Practices

  • Candidate count: Rerank 20-100 documents. More = better quality but slower.
  • Document length: Truncate long documents to ~512 tokens for best results.
  • Caching: Cache reranking results for repeated queries when possible.
  • Timeout: Default 120s timeout. Increase for large document sets.