How RAGflow Bootstraps a Knowledge Base: Inside init_kb and create_idx

When you create a knowledge base in RAGflow, a silent but critical piece of infrastructure is provisioned before a single document is ever uploaded. That infrastructure is the storage structure the index or table that will hold every parsed document chunk for the lifetime of that knowledge base. This post traces how RAGflow builds that structure through init_kb and its call into create_idx.


What Problem Does This Solve?

RAGflow supports multiple document store backends: Elasticsearch, OpenSearch, and Infinity. Each of these engines has its own data model Elasticsearch uses indices with JSON mappings, while Infinity uses typed tables with schema definitions. Before any document chunks can be written, the correct structure must exist in the target engine.

create_idx is the single method behind that provisioning step. It accepts the index name, dataset ID, vector size, and an optional parser ID, and creates the backend-specific structure on demand. Crucially, it is idempotent calling it on an already-provisioned knowledge base is a no-op.


The Entry Point: init_kb

In rag/svr/task_executor.py, every knowledge base initialization flows through init_kb:

def init_kb(row, vector_size: int):
    idxnm = search.index_name(row["tenant_id"])
    parser_id = row.get("parser_id", None)
    return settings.docStoreConn.create_idx(idxnm, row.get("kb_id", ""), vector_size, parser_id)

Three values are resolved here:

Parameter Source Purpose
idxnm search.index_name(tenant_id) ragflow_{tenant_id} Namespace the index per tenant
kb_id row["kb_id"] Identify the specific knowledge base dataset
vector_size Caller-provided Must match the embedding model output dimension
parser_id row["parser_id"] Allows backend-specific schema customization

settings.docStoreConn is the active backend connection, so the same init_kb function works regardless of which store is configured.


Backend 1: Elasticsearch

The Elasticsearch implementation in common/doc_store/es_conn_base.py is deliberately minimal:

def create_idx(self, index_name: str, dataset_id: str, vector_size: int, parser_id: str = None):
    # parser_id is used by Infinity but not needed for ES (kept for interface compatibility)
    if self.index_exist(index_name, dataset_id):
        return True
    try:
        return IndicesClient(self.es).create(index=index_name,
                                             settings=self.mapping["settings"],
                                             mappings=self.mapping["mappings"])
    except Exception:
        self.logger.exception("ESConnection.createIndex error %s" % index_name)

Key observations:

  • The index name follows the tenant pattern (ragflow_{tenant_id}). All knowledge bases belonging to the same tenant share one Elasticsearch index, distinguished at query time by a kb_id field filter.
  • Mappings are loaded from a pre-defined configuration object (self.mapping), keeping schema management centralized.
  • parser_id is accepted but ignored the interface remains consistent across all backends.

Backend 2: OpenSearch

OpenSearch mirrors Elasticsearch’s pattern in rag/utils/opensearch_conn.py:

def create_idx(self, indexName: str, knowledgebaseId: str, vectorSize: int, parser_id: str = None):
    if self.index_exist(indexName, knowledgebaseId):
        return True
    try:
        from opensearchpy.client import IndicesClient
        return IndicesClient(self.os).create(index=indexName, body=self.mapping)
    except Exception:
        logger.exception("OSConnection.createIndex error %s" % (indexName))

The structure is identical in spirit: check existence, create from mapping, handle errors. Both engines share the same tenant-scoped single-index design.


Backend 3: Infinity

The Infinity implementation in common/doc_store/infinity_conn_base.py is the most sophisticated. Unlike Elasticsearch, Infinity uses a per-knowledge-base table model:

Table name: {index_name}_{dataset_id}

This means each knowledge base gets its own isolated data table.

Schema Loading and Vector Column Injection

fp_mapping = os.path.join(get_project_base_directory(), "conf", self.mapping_file_name)
schema = json.load(open(fp_mapping))

# Inject vector column based on runtime embedding dimension
vector_name = f"q_{vector_size}_vec"
schema[vector_name] = {"type": f"vector,{vector_size},float"}

The base schema is read from a config file, and the vector column (q_768_vec, q_1024_vec, etc.) is injected at runtime based on the active embedding model. This keeps the static schema clean while supporting flexible embedding dimensions.

Parser-Specific Schema Extensions

When parser_id is TABLE, an extra JSON column is appended:

if parser_id == ParserType.TABLE.value:
    schema["chunk_data"] = {"type": "json", "default": "{}"}

This lets table-parsed documents store structured cell data that would not fit into the standard text-chunk schema.

Index Creation: Three Index Types

After the table is created, Infinity builds three categories of indexes automatically.

1. HNSW Vector Index for nearest-neighbor semantic search:

inf_table.create_index(
    "q_vec_idx",
    IndexInfo(
        vector_name,
        IndexType.Hnsw,
        {"M": "16", "ef_construction": "50", "metric": "cosine", "encode": "lvq"},
    ),
    ConflictType.Ignore,
)
  • M=16 and ef_construction=50 balance recall quality against build time.
  • metric=cosine computes similarity as cosine distance.
  • encode=lvq applies learned vector quantization to reduce index memory usage.

2. Full-Text Indexes for keyword-based search over varchar fields that declare an analyzer:

for field_name, field_info in schema.items():
    if field_info["type"] != "varchar" or "analyzer" not in field_info:
        continue
    for analyzer in analyzers:
        inf_table.create_index(
            f"ft_{field_name}_{analyzer}",
            IndexInfo(field_name, IndexType.FullText, {"ANALYZER": analyzer}),
            ConflictType.Ignore,
        )

Multiple analyzers per field are supported, enabling multilingual full-text search on the same column.

3. Secondary Indexes for efficient filtering on metadata fields:

if index_config == "secondary":
    inf_table.create_index(
        f"sec_{field_name}",
        IndexInfo(field_name, IndexType.Secondary),
        ConflictType.Ignore,
    )

Secondary indexes allow fast lookups on fields like status, type, or tenant_id without scanning the full table.


Why the Multi-Index Strategy Matters

RAGflow’s retrieval model is hybrid: it combines vector similarity search with full-text keyword matching and metadata filtering. Each index type serves a distinct role in that pipeline:

Index Type Search Mode Use Case
HNSW Semantic (vector) Find chunks conceptually similar to the query
FullText Keyword Find chunks containing specific terms
Secondary Filter Restrict results by document type, status, or owner

By provisioning all three during create_idx, RAGflow ensures every knowledge base is immediately capable of hybrid search no deferred index builds, no warm-up period.


Idempotency: Safe to Call Multiple Times

Every create_idx path uses ConflictType.Ignore (Infinity) or an existence check before creation (ES/OpenSearch). This means:

  • Calling init_kb on an existing knowledge base is always safe.
  • Retrying a failed setup will not corrupt existing data.
  • The operation behaves correctly in distributed or retry-heavy environments.

Summary

init_kb create_idx is RAGflow’s knowledge base provisioning contract. Under the hood, it:

  1. Determines the correct namespace tenant-scoped for ES/OpenSearch, per-dataset table for Infinity.
  2. Builds the storage structure from a centralized schema definition.
  3. Injects runtime-variable columns the embedding vector column, sized to match the active model.
  4. Extends the schema for specific parsers e.g., adding a chunk_data JSON column for table-parsed documents.
  5. Creates all necessary indexes HNSW for vectors, full-text for keywords, secondary for metadata in a single atomic provisioning step.
  6. Guarantees idempotency the operation is always safe to call on already-provisioned storage.

This design cleanly decouples the what (RAGflow’s uniform knowledge base abstraction) from the how (each backend’s native storage primitives), making it straightforward to add new backends without changing any upstream service logic.


References