How RAGFlow Uses Elasticsearch to Filter by Dataset
How RAGFlow Uses Elasticsearch to Filter by Dataset
In RAGFlow, when a user performs a retrieval query, the system doesn’t search the entire document store — it narrows results to only the specified datasets (knowledge bases). Elasticsearch achieves this by injecting a kb_id field filter into every query. This post walks through the architecture, the code path from API to storage, and how other backends handle the same problem differently.
The Core Mechanism: Single Index + Field Filtering
Elasticsearch in RAGFlow follows a single-index, field-filtered design:
- One index per tenant — All document chunks belonging to a tenant live in the same Elasticsearch index.
- Field-level isolation — Every chunk carries a
kb_idfield that identifies which dataset (knowledge base) it belongs to. - Query-time filtering — At search time, a
termsfilter onkb_idrestricts results to the requested datasets.
This is a classic multi-tenant pattern in Elasticsearch: rather than creating separate indices per dataset (which would increase cluster overhead), RAGFlow keeps a flat structure and relies on efficient filter caching.
Code Walkthrough
Step 1 — The Search Interface Receives Dataset IDs
The entry point is ESConnection.search() in rag/utils/es_conn.py. The method accepts a knowledgebase_ids parameter and immediately injects it into the condition dictionary:
def search(
self, select_fields: list[str],
highlight_fields: list[str],
condition: dict,
match_expressions: list[MatchExpr],
order_by: OrderByExpr,
offset: int,
limit: int,
index_names: str | list[str],
knowledgebase_ids: list[str], # list of dataset IDs to search
...
):
...
condition["kb_id"] = knowledgebase_ids # inject into filter conditions
Step 2 — Building the Elasticsearch Bool Query
The condition dictionary is then iterated to construct an Elasticsearch bool query. For list-type values (like kb_id), the code appends a terms filter:
bool_query = Q("bool", must=[])
condition["kb_id"] = knowledgebase_ids
for k, v in condition.items():
...
if isinstance(v, list):
bool_query.filter.append(Q("terms", **{k: v})) # terms filter for kb_id
elif isinstance(v, str) or isinstance(v, int):
bool_query.filter.append(Q("term", **{k: v}))
The resulting Elasticsearch query looks conceptually like this:
{
"query": {
"bool": {
"filter": [
{ "terms": { "kb_id": ["dataset_id_1", "dataset_id_2"] } }
],
"must": [ ... ]
}
}
}
Because terms queries are placed in the filter context, Elasticsearch can cache them and skip scoring — making dataset filtering essentially free in terms of relevance computation.
Step 3 — The Full Call Chain (API → Service → Storage)
The dataset IDs flow through several layers before reaching Elasticsearch:
| Layer | Entry Point | Role |
|---|---|---|
| HTTP API | POST /api/v1/retrieval |
Accepts dataset_ids in the request body |
| Python SDK | ragflow_sdk.RAGFlow.retrieve() |
Wraps the HTTP call with typed parameters |
| Service Layer | settings.retriever.retrieval() |
Orchestrates retrieval logic |
| Storage Layer | ESConnection.search() |
Constructs and executes the Elasticsearch query |
How Other Backends Handle Dataset Isolation
RAGFlow’s document store abstraction supports multiple backends, each with its own isolation strategy:
| Backend | Isolation Strategy | Implementation |
|---|---|---|
| Elasticsearch | Single index + kb_id field filter |
rag/utils/es_conn.py |
| Infinity | Separate table per dataset ({index_name}_{kb_id}) |
rag/utils/infinity_conn.py |
| OceanBase | Single table + kb_id field filter |
rag/utils/ob_conn.py |
The Infinity backend takes a fundamentally different approach: instead of filtering on a field, it creates a dedicated table for each knowledge base. The table name follows the pattern {index_name}_{kb_id}, so the query is physically scoped to the correct data. The trade-off is more tables to manage but zero filtering overhead.
OceanBase mirrors the Elasticsearch pattern — a single table with a kb_id column and query-time filtering.
Usage Example
To search specific datasets via the RAGFlow API:
curl --request POST \
--url http://{address}/api/v1/retrieval \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <YOUR_API_KEY>' \
--data '{
"question": "What is advantage of ragflow?",
"dataset_ids": ["b2a62730759d11ef987d0242ac120004"]
}'
You can pass multiple dataset IDs to search across several knowledge bases simultaneously. If you omit dataset_ids, you must provide document_ids instead — at least one scoping parameter is required.
Key Takeaways
- Efficient multi-dataset search: The
termsfilter onkb_idsupports querying across multiple datasets in a single request with minimal overhead. - Filter caching: Since
termsqueries in thefiltercontext are cacheable, repeated searches against the same datasets benefit from Elasticsearch’s filter cache. - Backend-agnostic design: RAGFlow’s storage abstraction means you can swap Elasticsearch for Infinity or OceanBase without changing the retrieval API — only the isolation mechanism differs under the hood.
- Single-index simplicity: By avoiding per-dataset indices, RAGFlow keeps cluster management simple while still achieving strong data isolation at query time.