LiteLLM, Ollama, and vLLM: Understanding Model Switching at Every Layer
LiteLLM, Ollama, and vLLM: Understanding Model Switching at Every Layer
Published: March 17, 2026
Introduction
When working with LLMs, a common question arises: why does LiteLLM need a Router to switch between models, while Ollama and llama.cpp can swap models on the fly within a single process? The answer reveals a fundamental architectural divide — and understanding it helps you design better inference stacks from development all the way to cloud-scale production.
This post covers:
- How LiteLLM, Ollama, and vLLM each handle model switching
- Why vLLM deliberately avoids dynamic model swapping
- How cloud providers like Alibaba Cloud build cluster-level inference on top of vLLM
LiteLLM: Switching Across Providers (Protocol Translation)
LiteLLM sits in front of different LLM providers — OpenAI, Anthropic, Google Gemini, Azure, Ollama, and 100+ more. Each provider exposes a different API shape:
OpenAI → POST /v1/chat/completions { "model": "gpt-4o", ... }
Anthropic → POST /v1/messages { "model": "claude-...", "max_tokens": ... } ← different required fields
Gemini → POST /v1beta/models/... completely different URL and body structure
LiteLLM solves this by performing protocol translation (adapter pattern), converting every provider’s API into the OpenAI chat.completions format:
import litellm
# All three use the exact same code — only the model string changes
response = litellm.completion(model="gpt-4o", messages=[{"role": "user", "content": "Hello"}])
response = litellm.completion(model="claude-3-5-sonnet-20241022", messages=[{"role": "user", "content": "Hello"}])
response = litellm.completion(model="gemini/gemini-2.0-flash", messages=[{"role": "user", "content": "Hello"}])
On top of unification, LiteLLM’s Router adds load balancing, fallback, cost tracking, and caching. But the key insight is: LiteLLM does not run any models itself — it routes and translates.
Ollama: Switching Within a Single Runtime (Weight Swapping)
Ollama is a local inference runtime that manages model files on disk and loads/unloads them into VRAM on demand.
User requests model=llama3.2
│
▼
Is it already loaded in VRAM?
│
No → Load weights from disk into VRAM → Run inference → Unload after idle
│
Yes → Run inference directly
VRAM timeline:
[llama3.2 weights████████████][idle][mistral weights████████████][idle]
User A finishes, unload User B requests, load new model
Because Ollama assumes single-user, low-concurrency usage, the cost of unloading one model and loading another is perfectly acceptable — maybe 10-30 seconds of downtime between models.
# Same port, same API, just change the model field
curl http://localhost:11434/api/chat -d '{"model": "llama3.2", ...}'
curl http://localhost:11434/api/chat -d '{"model": "mistral", ...}'
# ↑ Ollama automatically swaps the model in VRAM
The API is already OpenAI-compatible. There is no protocol translation needed — it is a unified local API where model switching is an internal runtime behavior.
The Key Distinction: Where the Switching Happens
| LiteLLM | Ollama / llama.cpp | |
|---|---|---|
| Service location | Proxy layer (cloud or local) | Local inference runtime |
| What is switched | Different companies, different protocols | Different model weights in the same runtime |
| Main work | Protocol translation + routing + load balancing | Model weight loading/unloading in VRAM |
| API unification | LiteLLM provides it | Already unified natively |
They are often used together:
Your App
│
▼
LiteLLM Proxy ← Unified entry point: cost, fallback, rate limiting
│ │
▼ ▼
OpenAI Ollama ← Ollama handles local model swapping internally
(llama3 / mistral / qwen ...)
LiteLLM’s model: "ollama/llama3.2" simply forwards to Ollama; the actual model switch is performed by Ollama.
vLLM: Why It Does NOT Dynamically Swap Models
vLLM is a high-throughput inference engine designed for production serving. It supports OpenAI-compatible APIs, but its model switching story is fundamentally different from Ollama’s.
vLLM’s Modes
Single-model mode (default):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B \
--port 8000
One model is bound at startup. Changing it requires a restart.
Multi-LoRA mode (dynamic adapters, same base model):
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-8B \
--enable-lora --max-lora-rank 64
Different LoRA adapters can be swapped per request, but the base model stays the same.
Why Not Just Swap Models Like Ollama?
The reason is architectural — vLLM’s core innovations make dynamic model swapping prohibitively expensive.
1. Static VRAM Allocation with PagedAttention
At startup, vLLM partitions all available VRAM between model weights and a KV Cache memory pool:
vLLM VRAM layout (allocated at startup, static):
┌─────────────────────────────────────────────┐
│ Model Weights (static) │ ← bulk of VRAM
├─────────────────────────────────────────────┤
│ KV Cache Memory Pool (pre-allocated) │ ← managed by PagedAttention
│ Page0 │ Page1 │ Page2 │ ... │ PageN │
│ req_A │ req_B │ req_A │ ... │ req_C │ ← shared across concurrent requests
└─────────────────────────────────────────────┘
Swapping a model would require:
1. Wait for ALL in-flight requests to complete ← may never happen under high load
2. Release the KV Cache pool (all contexts lost) ← breaks PagedAttention continuity
3. Unload weights
4. Load new weights
5. Re-initialize the KV Cache pool and scheduler
This process takes 30 seconds to several minutes — catastrophic for a production request queue.
2. Continuous Batching Cannot Pause
vLLM uses Continuous Batching: new requests are inserted into the batch as old ones finish, and the GPU is never idle:
Timeline:
t=0 [req_A token1] [req_B token1] [req_C token1] ← batched inference
t=1 [req_A token2] [req_B token2] [req_D token1] ← req_C done, req_D joins
t=2 [req_A token3] [req_E token1] [req_D token2] ← req_B done, req_E joins
...
To swap models, you need a moment when zero requests are in flight — under high concurrency, that moment never comes.
3. Design Philosophy Comparison
Ollama's assumptions: vLLM's assumptions:
Few users Many users (100+ concurrent)
Requests are bursty Requests stream in continuously
VRAM utilization is secondary VRAM utilization must be maximized
Swap downtime is acceptable Any downtime is unacceptable
→ Dynamic load/unload ✅ → Static allocation, one model ✅
vLLM vs Ollama Summary
| Feature | Ollama | vLLM |
|---|---|---|
| Dynamic model swap | ✅ Auto unload/load | ⚠️ Limited (LoRA only) |
| Target scenario | Local dev, convenience | Production, high throughput |
| GPU utilization | Moderate | ✅ PagedAttention maximizes it |
| Concurrent requests | Limited | ✅ Built for high concurrency |
| Dynamic LoRA switching | ❌ | ✅ |
| Multi-tenant isolation | ❌ | ✅ |
Cloud-Scale Inference: How Alibaba Cloud and Others Build on vLLM
In cluster environments like Alibaba Cloud’s PAI-EAS or Bailian, the architecture goes far beyond a single vLLM instance:
┌─────────────────────────────────────────────────┐
│ User API Requests │
└─────────────────┬───────────────────────────────┘
│
┌─────────────────▼───────────────────────────────┐
│ API Gateway / Load Balancer │
│ (traffic shaping, auth, rate limiting, billing)│
└─────────────────┬───────────────────────────────┘
│
┌─────────────────▼───────────────────────────────┐
│ Inference Scheduler (proprietary) │
│ request routing / model versioning / autoscaling│
└──────┬──────────┬──────────┬────────────────────┘
│ │ │
┌──────▼──┐ ┌───▼─────┐ ┌─▼───────┐
│ vLLM │ │ vLLM │ │ vLLM │ ← multiple instances
│ Inst A │ │ Inst B │ │ Inst C │
│ 8×A100 │ │ 8×A100 │ │ 8×A100 │
└─────────┘ └─────────┘ └─────────┘
Cloud providers use vLLM as the per-node inference engine, then add cluster-level capabilities on top.
Key Cluster Optimizations
1. Tensor Parallelism vs Pipeline Parallelism
Single machine, 8 GPUs (vLLM natively supports):
GPU0 | GPU1 | GPU2 | GPU3 | GPU4 | GPU5 | GPU6 | GPU7
←────────── Tensor Parallelism: split one layer across GPUs ──────────→
Multi-machine (requires additional engineering):
Machine A [GPU0~7] ←→ Machine B [GPU0~7] ←→ Machine C [GPU0~7]
←──────────── Pipeline Parallelism: different layers on different machines ──────────→
Cross-machine communication:
NVLink (intra-node) ~600 GB/s vs InfiniBand (inter-node) ~50 GB/s
2. Prefill-Decode Disaggregation
This is the most impactful cluster optimization in recent years:
Traditional vLLM (Prefill + Decode on the same instance):
Request → [Prefill: process input prompt] → [Decode: generate tokens one by one] → Output
GPU compute-bound GPU memory-bandwidth-bound
Problem: two completely different compute profiles interfere with each other.
PD Disaggregation (cloud providers):
┌──────────────────────────────┐
Request ──→ Scheduler → │ Prefill Cluster │ → KV Cache Transfer
│ (compute-heavy, fewer big GPUs) │ │
└──────────────────────────────┘ │
▼
┌──────────────────────────────┐ │
Output ←── │ Decode Cluster │ ←───────┘
│ (memory-heavy, more smaller GPUs) │
└──────────────────────────────┘
3. Centralized KV Cache
Single-instance vLLM:
KV Cache lives in local VRAM → requests must stay on the same instance
Cluster problem:
Multi-turn chat → each request may route to a different instance → KV Cache miss → re-Prefill
Solution:
┌─────────────────────────────────────┐
│ Centralized KV Cache Store │
│ (high-speed NVMe / memory pool / RDMA) │
└──────┬──────────┬──────────┬────────┘
│ │ │
Inst A Inst B Inst C
Any instance can read previous KV Cache
4. Elastic Scaling
vLLM single instance: loading model weights takes time
Llama-70B (4-bit) ≈ 35 GB → load time 30s ~ 2min
Cloud optimizations:
① Weight pre-warming (popular models stay resident)
② Predictive scaling (scale before traffic arrives)
③ Spot instance utilization (preemptible GPUs for cost savings)
What Major Cloud Providers Use
| Provider | Approach |
|---|---|
| Alibaba Cloud PAI-EAS | Modified vLLM + proprietary scheduler |
| ByteDance | Custom LightSeq / modified vLLM |
| Tencent Cloud | Modified vLLM + proprietary TurboMind |
| Moonshot AI | Mooncake (PD disaggregation + KV Cache pooling) |
| AWS SageMaker | vLLM / TGI as selectable backends |
| Google Cloud | Custom Pathways system, does not use vLLM |
Putting It All Together
Your App
│
▼
LiteLLM Proxy ← Unified entry: routing / fallback / cost tracking
│ │
▼ ▼
vLLM Ollama
(production) (local dev)
Qwen2.5 llama3.2
Llama-3 mistral
Each layer exists for a reason:
- LiteLLM — protocol translation and provider routing (cross-provider switching)
- Ollama — convenient local model management with dynamic VRAM swapping (same-runtime switching)
- vLLM — maximum throughput for one model with PagedAttention and Continuous Batching (no switching by design)
- Cloud schedulers — cluster-level routing, PD disaggregation, centralized KV Cache, and elastic scaling (infrastructure-level orchestration)
In short: Ollama treats VRAM as a resource pool where models are tenants. vLLM treats VRAM as a battlefield where the model is permanently stationed. Cloud providers connect many battlefields into a war theater with centralized logistics. And LiteLLM is the diplomat that speaks every army’s language.