Stop Letting Your GPU Nap: Stack Jobs and Supercharge Your Experiments
Tips for ML researchers on shared clusters who are tired of slow experiments and sleepy GPUs.
Wait, Why Is My GPU So Bored? 🥹
Ever peeked at nvidia-smi
mid-training and felt personally offended by a 15% GPU utilization reading?
You’re not alone.
In many ML setups—especially in deep reinforcement learning or self-supervised learning—the GPU ends up spending more time waiting around than doing actual work. Here’s why:
- Your model might be tiny (looking at you, MLPs and small CNNs).
- Environment steps in RL live on the CPU and take their sweet time.
- Data augmentation and preprocessing often clog the CPU while the GPU twiddles its thumbs.
- Even classic vision or SimCLR jobs on CIFAR-10 barely dent the surface of a modern A100’s power.
Moral of the story? You’ve got untapped compute just sitting there.
Signs of GPU Underuse
Here’s how to know your GPU’s taking a nap:
nvidia-smi
shows plenty of free VRAM (e.g., using 5 GB out of 40 GB).- Compute “Util” column idles in the teens while the CPU sits near 100 %.
- Example: a fastai ResNet-18 computer-vision run on an A100 sat at ~20 % util with memory to spare (reference) or an RLlib DQN job with 256 k batch size still spiked only briefly above 25 %
You might be tempted to buy more GPUs. Don’t. Use what you already have better.
The Secret: Run Multiple Jobs at Once
If your current job is only using a slice of the GPU, just stack more on top!
Here’s the magic formula:
# Run three jobs in parallel
for cfg in cfg1.yaml cfg2.yaml cfg3.yaml; do
python train.py --config $cfg &
done
wait # Let them all finish before exiting
Why it works:
-
Each job uses a slice of VRAM; their peaks rarely coincide.
- Streaming Multiprocessor stay busier because when one job waits on the CPU, another is mid-backprop.
- More info on SMs: Each SM handles the actual math operations (like matrix multiplies and convolutions). A100 has 108 SMs, which means it can handle a lot of parallel math — if you feed it well.
- You triple sweep throughput without touching the cluster queue.
This trick works great for:
- Hyperparameter sweeps
- Seed averaging
- Trying three ideas because you’re impatient (relatable)
Tips, Pitfalls, and Gotchas (With Explanations!)
✅ / ⚠️ | What You Should Know | Why It Matters |
---|---|---|
✅ | Leave ~10% VRAM unused | PyTorch loves to surprise you with memory spikes. A small buffer helps you avoid sudden OOM crashes that wipe out all jobs. |
✅ | Use /scratch or SSD storage |
If three jobs all hit the disk at once on slow storage, your fancy parallelism will turn into a data-loading traffic jam. |
✅ | Tag runs in your logger (e.g., wandb --group stacked ) |
Keeps your dashboards from looking like a spaghetti bowl of metrics. Easier to compare, track, and brag about. |
✅ | Watch num_workers and threads |
Each job spawns data loaders. Multiply that by three and suddenly your system has 48 zombie processes hoarding RAM. Keep things lean. |
⚠️ | Don’t stack giant models | If you’re running LLMs, ViTs, or anything eating 80%+ VRAM, just… don’t. You’ll get out-of-memory errors faster than you can say “SIGKILL”. |
⚠️ | Know your cluster’s rules | Some clusters have strict policies: one job per GPU, no background processes, etc. Break them, and you might lose access. Nobody wants that email. |
TL;DR 💛
If your GPU looks bored, it probably is.
Instead of leaving it idle, stack 2–3 light-to-medium jobs on the same card. You’ll:
- Finish sweeps 2–3x faster
- Reduce total GPU-hours
- Help your labmates get off the waitlist
Your Move 💅
- Fire up few extra jobs.
- Monitor
nvidia-smi
. - Watch your GPU actually break a sweat.
- Flex your productivity gains.
You don’t need more compute—you just need to use it smarter.