How RAGflow Bootstraps a Knowledge Base: Inside init_kb and create_idx
How RAGflow Bootstraps a Knowledge Base: Inside init_kb and create_idx
When you create a knowledge base in RAGflow, a silent but critical piece of infrastructure is provisioned before a single document is ever uploaded. That infrastructure is the storage structure the index or table that will hold every parsed document chunk for the lifetime of that knowledge base. This post traces how RAGflow builds that structure through init_kb and its call into create_idx.
What Problem Does This Solve?
RAGflow supports multiple document store backends: Elasticsearch, OpenSearch, and Infinity. Each of these engines has its own data model Elasticsearch uses indices with JSON mappings, while Infinity uses typed tables with schema definitions. Before any document chunks can be written, the correct structure must exist in the target engine.
create_idx is the single method behind that provisioning step. It accepts the index name, dataset ID, vector size, and an optional parser ID, and creates the backend-specific structure on demand. Crucially, it is idempotent calling it on an already-provisioned knowledge base is a no-op.
The Entry Point: init_kb
In rag/svr/task_executor.py, every knowledge base initialization flows through init_kb:
def init_kb(row, vector_size: int):
idxnm = search.index_name(row["tenant_id"])
parser_id = row.get("parser_id", None)
return settings.docStoreConn.create_idx(idxnm, row.get("kb_id", ""), vector_size, parser_id)
Three values are resolved here:
| Parameter | Source | Purpose |
|---|---|---|
idxnm |
search.index_name(tenant_id) ragflow_{tenant_id} |
Namespace the index per tenant |
kb_id |
row["kb_id"] |
Identify the specific knowledge base dataset |
vector_size |
Caller-provided | Must match the embedding model output dimension |
parser_id |
row["parser_id"] |
Allows backend-specific schema customization |
settings.docStoreConn is the active backend connection, so the same init_kb function works regardless of which store is configured.
Backend 1: Elasticsearch
The Elasticsearch implementation in common/doc_store/es_conn_base.py is deliberately minimal:
def create_idx(self, index_name: str, dataset_id: str, vector_size: int, parser_id: str = None):
# parser_id is used by Infinity but not needed for ES (kept for interface compatibility)
if self.index_exist(index_name, dataset_id):
return True
try:
return IndicesClient(self.es).create(index=index_name,
settings=self.mapping["settings"],
mappings=self.mapping["mappings"])
except Exception:
self.logger.exception("ESConnection.createIndex error %s" % index_name)
Key observations:
- The index name follows the tenant pattern (
ragflow_{tenant_id}). All knowledge bases belonging to the same tenant share one Elasticsearch index, distinguished at query time by akb_idfield filter. - Mappings are loaded from a pre-defined configuration object (
self.mapping), keeping schema management centralized. parser_idis accepted but ignored the interface remains consistent across all backends.
Backend 2: OpenSearch
OpenSearch mirrors Elasticsearch’s pattern in rag/utils/opensearch_conn.py:
def create_idx(self, indexName: str, knowledgebaseId: str, vectorSize: int, parser_id: str = None):
if self.index_exist(indexName, knowledgebaseId):
return True
try:
from opensearchpy.client import IndicesClient
return IndicesClient(self.os).create(index=indexName, body=self.mapping)
except Exception:
logger.exception("OSConnection.createIndex error %s" % (indexName))
The structure is identical in spirit: check existence, create from mapping, handle errors. Both engines share the same tenant-scoped single-index design.
Backend 3: Infinity
The Infinity implementation in common/doc_store/infinity_conn_base.py is the most sophisticated. Unlike Elasticsearch, Infinity uses a per-knowledge-base table model:
Table name: {index_name}_{dataset_id}
This means each knowledge base gets its own isolated data table.
Schema Loading and Vector Column Injection
fp_mapping = os.path.join(get_project_base_directory(), "conf", self.mapping_file_name)
schema = json.load(open(fp_mapping))
# Inject vector column based on runtime embedding dimension
vector_name = f"q_{vector_size}_vec"
schema[vector_name] = {"type": f"vector,{vector_size},float"}
The base schema is read from a config file, and the vector column (q_768_vec, q_1024_vec, etc.) is injected at runtime based on the active embedding model. This keeps the static schema clean while supporting flexible embedding dimensions.
Parser-Specific Schema Extensions
When parser_id is TABLE, an extra JSON column is appended:
if parser_id == ParserType.TABLE.value:
schema["chunk_data"] = {"type": "json", "default": "{}"}
This lets table-parsed documents store structured cell data that would not fit into the standard text-chunk schema.
Index Creation: Three Index Types
After the table is created, Infinity builds three categories of indexes automatically.
1. HNSW Vector Index for nearest-neighbor semantic search:
inf_table.create_index(
"q_vec_idx",
IndexInfo(
vector_name,
IndexType.Hnsw,
{"M": "16", "ef_construction": "50", "metric": "cosine", "encode": "lvq"},
),
ConflictType.Ignore,
)
M=16andef_construction=50balance recall quality against build time.metric=cosinecomputes similarity as cosine distance.encode=lvqapplies learned vector quantization to reduce index memory usage.
2. Full-Text Indexes for keyword-based search over varchar fields that declare an analyzer:
for field_name, field_info in schema.items():
if field_info["type"] != "varchar" or "analyzer" not in field_info:
continue
for analyzer in analyzers:
inf_table.create_index(
f"ft_{field_name}_{analyzer}",
IndexInfo(field_name, IndexType.FullText, {"ANALYZER": analyzer}),
ConflictType.Ignore,
)
Multiple analyzers per field are supported, enabling multilingual full-text search on the same column.
3. Secondary Indexes for efficient filtering on metadata fields:
if index_config == "secondary":
inf_table.create_index(
f"sec_{field_name}",
IndexInfo(field_name, IndexType.Secondary),
ConflictType.Ignore,
)
Secondary indexes allow fast lookups on fields like status, type, or tenant_id without scanning the full table.
Why the Multi-Index Strategy Matters
RAGflow’s retrieval model is hybrid: it combines vector similarity search with full-text keyword matching and metadata filtering. Each index type serves a distinct role in that pipeline:
| Index Type | Search Mode | Use Case |
|---|---|---|
| HNSW | Semantic (vector) | Find chunks conceptually similar to the query |
| FullText | Keyword | Find chunks containing specific terms |
| Secondary | Filter | Restrict results by document type, status, or owner |
By provisioning all three during create_idx, RAGflow ensures every knowledge base is immediately capable of hybrid search no deferred index builds, no warm-up period.
Idempotency: Safe to Call Multiple Times
Every create_idx path uses ConflictType.Ignore (Infinity) or an existence check before creation (ES/OpenSearch). This means:
- Calling
init_kbon an existing knowledge base is always safe. - Retrying a failed setup will not corrupt existing data.
- The operation behaves correctly in distributed or retry-heavy environments.
Summary
init_kb create_idx is RAGflow’s knowledge base provisioning contract. Under the hood, it:
- Determines the correct namespace tenant-scoped for ES/OpenSearch, per-dataset table for Infinity.
- Builds the storage structure from a centralized schema definition.
- Injects runtime-variable columns the embedding vector column, sized to match the active model.
- Extends the schema for specific parsers e.g., adding a
chunk_dataJSON column for table-parsed documents. - Creates all necessary indexes HNSW for vectors, full-text for keywords, secondary for metadata in a single atomic provisioning step.
- Guarantees idempotency the operation is always safe to call on already-provisioned storage.
This design cleanly decouples the what (RAGflow’s uniform knowledge base abstraction) from the how (each backend’s native storage primitives), making it straightforward to add new backends without changing any upstream service logic.