Configuration setup
fully_shard)
colocated / hybrid engine
vLLM TP=1 (4 DP replicas)
param_offload=False
actor optimizer_offload=False
ref param_offload=True
gpu_memory_utilization=0.25
free_cache_engine=True
vLLM ≥0.8.5 → sleep_level=2
TL;DR the actor (FSDP) and the rollout (vLLM)
live on the same GPUs. They cannot both own their full footprint at once, so verl time-shares the
GPU: vLLM is put to sleep (its weights + KV pool freed back to the OS via vLLM's CuMemAllocator)
while the actor trains, and woken while it generates. With actor offload off, the FSDP actor +
optimizer stay resident the whole time, so vLLM only gets a fraction of each GPU, and its memory is the part
that churns.
How to read the bars: each section below shows one GPU (rank 0, 80 GB) at that phase. Heights are illustrative, scaled to tell the story, not measured.
| FSDP actor params (sharded ¼) | Optimizer state (fp32 master + Adam m,v) |
| Gradients + activations | Ref policy (offloaded → CPU) |
| vLLM weights (full model, TP=1) | vLLM KV-cache pool |
| Free / reserved | Weight all-gather (transient, during sync) |
Each of the N GPUs permanently holds only 1/N of every parameter, without any rank ever materializing the whole model on its GPU.
cpu_init_weights → builds on CPU and loads the real checkpoint weights.init_empty_weights (accelerate) → builds on the meta device with zero-byte tensors, only registering the shape and dtype.get_init_weight_context_manager()FSDPEngine._build_module()fully_shard wraps each decoder layer (still on CPU) so every parameter becomes
a DTensor with placement Shard(0), one logical tensor split along dim 0 across the GPU
mesh, each rank owning a 1/N slice (.shape stays global, .to_local() is the
local slice)..to(device) (its
real 1/N shard), ranks 1…N-1 via .to_empty(device) (an empty 1/N slot). Note the broadcast
source is a separate full CPU snapshot (full_state), held only on rank 0.broadcasts it. Every rank (rank 0 included) transiently allocates a
full-size GPU buffer to receive it, then copies out only its own 1/N slice into its shard slot.apply_fsdp2() (→ fully_shard()),
fsdp2_load_full_state_dict() (→ _broadcast_state_dict())
mp_policy = MixedPrecisionPolicy(param_dtype=param_dtype, reduce_dtype=reduce_dtype) full_state = module.state_dict() # snapshot (rank 0 = real weights) apply_fsdp2(module, fsdp_kwargs, self.engine_config) # shard each layer → Shard(0) DTensors fsdp2_load_full_state_dict(module, full_state, fsdp_mesh, offload_policy) # broadcast → fill shards
module.parameters(), already the sharded DTensors, so its state is naturally 1/N-sized. The
m/v buffers (≈ 2× fp32 of the shard) are allocated lazily on the first
step(); they appear during the first actor update, not at init.
FSDPEngine._build_optimizer()
→ build_optimizer()_build_model_optimizer()With tensor-parallel size 1, vLLM does not shard. Each GPU gets its own complete copy of the model.
load_format="dummy": each replica allocates
full-model weight buffers and fills them with random values (no disk read).
load_format = "dummy"initialize_dummy_weights()gpu_memory_utilization budget and carves it into fixed-size blocks (each
holding ~16 tokens of KV, all layers), which it physically reserves up front as the empty KV pool (prints
#GPU blocks: N). Space is committed at init; the actual K/V values are only written during
generation. Engine uses enable_sleep_mode=True so the pool can be released on sleep.
gpu_memory_utilization, enable_sleep_mode)determine_num_available_blocks)TP=1 it's a straight
whole-tensor copy, no resharding needed. See section 2, Wake vLLM and sync weights.
get_per_tensor_param()update_weights() →
update_weights_from_ipc()A step's rollout begins by waking vLLM's freed memory and pushing the
actor's fresh weights into it (gated on free_cache_engine=True).
sleep(level=2) did. After vLLM rollout and before the actor starts training, vLLM called sleep(level=2). It unmapped and freed the
physical pages (handing that GPU memory back so the actor could train), but kept the virtual
reservations and the tensor objects. So the model structure survived; it just had no bytes behind it.wake_up does now. It maps fresh physical pages back onto those same
virtual addresses. The weight tensors instantly become valid again (same shapes, same addresses), but the
new pages hold undefined leftover bytes (no re-init, not the old weights; possibly the actor's stale
grad/activation data).resume() →
wake_up() · release() → sleep()get_per_tensor_param processes the
parameters one at a time through a "lazy generator". For each parameter, param.full_tensor() calls
all_gather. It combines the shards from all ranks so that every rank (not just rank 0) ends up with
the full tensor, which is then cast to bf16. Only one parameter is live at a time (the pink block in the GPU memory diagram above). The whole model is never unsharded at once, so the transient stays small.
get_per_tensor_param() →
DTensor.full_tensor()
.full_tensor() runs under the hoodparam.full_tensor() ← what you call (DTensor convenience method) └─ redistribute Shard(0) → Replicate ← DTensor: every rank needs all shards └─ all_gather(...) ← the actual collective function it invokes └─ ncclAllGather ← low-level NCCL/GPU primitive that moves the bytes
CUDA IPC (Inter-Process Communication) avoids it: one process exports a handle to its GPU
memory, the other maps and reads it: a direct GPU→GPU copy, no host hop.update_weights() is the orchestrator (rollout side). The entry point verl calls with
the per-tensor generator from step 2. It pulls tensors from the generator, batched by
update_weights_bucket_megabytes (e.g. WEIGHT_BUCKET_MB=512), and dispatches the
transfer to the vLLM worker via RPC, passing a use_shm flag (IPC vs. shared-memory fallback).update_weights_from_ipc() is the mover (in each vLLM worker). Imports the actor's CUDA
IPC handle, maps the same GPU memory into its own process, and copies it GPU→GPU into the model's weight
buffers, overwriting the dummy/garbage values with the current policy.update_weights() →
update_weights_from_ipc()
update_weights() dispatches the transferupdate_weights(generator) ← rollout side: batch by WEIGHT_BUCKET_MB, dispatch └─ update_weights_from_ipc(use_shm) ← in each vLLM worker: ├─ import CUDA IPC handle ← map the actor's GPU memory into this process └─ load_weights() ← copy GPU→GPU into vLLM's weight buffers (fallback: shared memory if IPC unsupported → use_shm)
resume(["kv_cache"]) → wake_up(tags=["kv_cache"])
re-allocates the (empty) KV-cache pool.
resume()data.train_batch_size
(TRAIN_BATCH_SIZE=8) is the number of prompts per training step;
actor_rollout_ref.rollout.n (ROLLOUT_N=8) is the number of responses per prompt
(the GRPO group). So vLLM generates TRAIN_BATCH_SIZE × ROLLOUT_N = 64 sequences this step.max_num_seqs, max_num_batched_tokens, and
the KV-pool size (set by gpu_memory_utilization and MAX_MODEL_LEN). As sequences
finish, queued ones are admitted, so the running batch flexes over time.ROLLOUT_N responses for one prompt come
from SamplingParams(n=ROLLOUT_N).generate_sequences()
· vLLM SamplingParams(n=...), max_num_seqs, max_num_batched_tokensrelease() → vLLM
sleep()
release() puts vLLM to sleeprelease() ← verl trigger (gated on free_cache_engine) └─ sleep(level) ← vLLM engine: free its GPU memory └─ CuMemAllocator ← unmaps/frees the physical pages (keeps the virtual addresses) level 1: offload weights to CPU, drop KV (weights restored from CPU on wake) level 2: discard weights AND KV (weights re-synced from the actor on wake) ← this run, vLLM ≥ 0.8.5
CPUOffloadPolicy(pin_memory=True), so its sharded params stay
pinned on CPU.
no_grad forward of the ref to get the
ref log-probs of the rollout responses.CPUOffloadPolicy (set for
forward_only) · ref log-probs in RayPPOTrainer.fit() →
compute_ref_log_prob()reshard_after_forward=True. With
enable_gradient_checkpointing=True, inner activations are not saved, only the layer input._context_switch is a no-op); vLLM being asleep is what makes room.apply_fsdp2() / fully_shard() ·
verl/workers/engine/base.py · _context_switch() · actor update in
RayPPOTrainer.fit() → update_actor()
Standard colocated verl GRPO (FSDP2 actor + vLLM hybrid engine). Code refs are verl HEAD in
.venv-verl. Bar sizes are illustrative. Open this file in any browser; no network needed.