How it works
Vector similarity search ranks results by approximate proximity in embedding space. This is fast but imprecise. A document can be a close neighbour in embedding space without actually answering the query.
A cross-encoder reranker fixes this. Instead of comparing independent embeddings, it receives the query and each candidate document together and produces an exact relevance score using bidirectional attention:
search() fetches top_k × 2 candidates from a vector store (over-fetch)
- The cross-encoder scores every candidate against the query in a single batch
- Results are re-sorted by cross-encoder score and truncated to
top_k
Enabling the reranker
Reranking is disabled by default. Enable it with a single config flag:
from memwire import MemWire, MemWireConfig
config = MemWireConfig(
qdrant_path="./memwire_data",
use_reranking=True,
)
memory = MemWire(config=config)
results = memory.search("when is the project deadline?", user_id="alice", top_k=5)
for record, score in results:
print(f"[{score:.4f}] {record.content}")
The model is lazy-loaded, it is downloaded and initialised only on the first search() call that triggers reranking, not at every startup.
Changing the model
The default model is Xenova/ms-marco-MiniLM-L-6-v2, a lightweight MS MARCO-trained cross-encoder. Swap it via reranker_model_name:
config = MemWireConfig(
qdrant_path="./memwire_data",
use_reranking=True,
reranker_model_name="Xenova/ms-marco-MiniLM-L-6-v2", # default
)
Any ONNX cross-encoder supported by FastEmbed can be used here.
Combining with hybrid search
For the best retrieval quality, run both hybrid search and reranking together:
config = MemWireConfig(
qdrant_path="./memwire_data",
use_hybrid_search=True, # dense + sparse retrieval (default)
use_reranking=True, # cross-encoder rescoring on top
)
memory = MemWire(config=config)
The pipeline becomes: sparse + dense fusion → cross-encoder rescore → top-k.
Hybrid search and reranking complement each other. Hybrid search maximises recall; the reranker maximises precision from those candidates.
| Concern | Detail |
|---|
| First call latency | The ONNX model is downloaded once (~85 MB) and cached by FastEmbed. |
| Per-query latency | Cross-encoder scores top_k × 2 documents in a single batched ONNX forward pass — typically 10–50 ms on CPU. |
| Memory | The reranker model stays resident after first use (~100 MB RAM). |
| Privacy | Everything runs locally. No query or memory content is sent to any external service. |
Reranking only applies to search(). The recall() method uses graph-based BFS traversal and is not affected by use_reranking.
Configuration reference
| Parameter | Default | Description |
|---|
use_reranking | False | Enable cross-encoder reranking on search() results. |
reranker_model_name | Xenova/ms-marco-MiniLM-L-6-v2 | FastEmbed-compatible cross-encoder model name. |