How RAGFlow Creates a Dataset: A Code Walkthrough
How RAGFlow Creates a Dataset: A Code Walkthrough
RAGFlow exposes datasets (also called knowledge bases) as a first-class resource. Understanding how one gets created end-to-end — from an HTTP request all the way to a database row — is a good entry point into the codebase. This post traces that path through three layers: the API handler, the service layer, and the persistence call.
The API Layer: Accepting the Request
The entry point for dataset creation is the create function in api/apps/sdk/dataset.py dataset.py:55-L127. It is registered to handle POST /api/v1/datasets and is responsible for:
- Parsing and validating the JSON request body
- Resolving the authenticated tenant from the bearer token
- Forwarding the sanitised parameters to the service layer
Key fields accepted at this layer include name, embedding_model, chunk_method, parser_config, and permission. Any unsupported key is rejected before the call goes deeper.
The Service Layer: Core Creation Logic
The actual work happens in create_with_name inside api/db/services/knowledgebase_service.py knowledgebase_service.py:374-L430. This method:
- Validates the name — empty or whitespace-only names are rejected outright.
- Handles name collisions — if a dataset with the same name already exists under the tenant, a numeric suffix is appended automatically (e.g.
my-docs (2)). - Builds the payload — default values are merged with the caller-supplied fields, producing a complete record ready for persistence.
- Constructs
parser_config— chunk method-specific defaults are applied here; for instance,naiveandbookcarry different splitting parameters.
Saving to the Database
Once the payload is assembled, KnowledgebaseService.save() commits it. A unique ID is generated at this point, and the record becomes queryable immediately. The dataset is scoped to the tenant, so the name uniqueness constraint is per-tenant, not global.
Configuration Reference
The following fields can be set when creating a dataset dataset.py:55-L127:
| Field | Description |
|---|---|
name |
Human-readable label for the dataset (required, unique per tenant) |
embedding_model |
Model used to embed chunks for retrieval |
chunk_method |
Splitting strategy: naive, book, qa, and others |
parser_config |
Method-specific parser options |
permission |
Access control: me or team |
Quick Start: Python SDK
The Python SDK wraps the HTTP call with a thin client. A minimal example from the repo dataset_example.py:27-L32:
client.create_dataset(
name="my-knowledge-base",
embedding_model="text-embedding-ada-002",
chunk_method="naive",
)
Or directly via HTTP http_api_reference.md:440-L450:
curl -X POST http://localhost/api/v1/datasets \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"name": "my-knowledge-base", "chunk_method": "naive"}'
Key Takeaways
- Dataset = KnowledgeBase internally; the two terms are used interchangeably throughout the codebase.
- IDs are auto-generated — callers never supply a dataset ID on creation.
- Name uniqueness is tenant-scoped — the service layer handles collisions automatically with a suffix strategy rather than returning an error.
parser_configis derived — you rarely need to set it manually; the service builds sensible defaults fromchunk_method.