Here is a situation that plays out more often than most people admit. Someone downloads a large language model, points their machine at it, and waits. The model loads. Slowly. Then the first response starts generating. One word. Pause. Another word. Longer pause. By the time the third sentence appears, the person has already started questioning every hardware decision they have ever made.
This is not a software problem. It is not a configuration problem. It is a VRAM problem, and it is almost always avoidable if the hardware conversation happens before the download does.
Running LLMs locally makes genuine business sense. Full control over your data, no API costs compounding every month, no dependency on anyone else’s uptime. But the hardware requirements are specific in ways that catch most people off guard, and the gap between a setup that works and one that just technically runs is wider than any spec sheet will tell you.
The One Constraint That Drives Everything
There are a lot of numbers involved in LLM hardware planning, but almost all of them trace back to one thing: how much memory your GPU has, specifically VRAM.
When an LLM generates a response, the model’s weights need to be loaded into GPU memory and stay there throughout. If the model does not fit, the system starts borrowing from regular system RAM to compensate. This sounds like a reasonable fallback, but performance can drop to 1 or 2 tokens per second when a model spills into system RAM, slower than a person types, compared to 45 or more tokens per second when running fully in VRAM.
That gap is the difference between a tool people will actually use and one that gets quietly abandoned after a week.
Why Speed Is About More Than Just Having Enough VRAM
Here is something that surprises most people after they have already bought hardware. Having enough VRAM to load a model does not guarantee it will run fast. There is a second variable that quietly determines performance: memory bandwidth, which is essentially how quickly data moves between the GPU’s memory and its processing cores.
Think of it this way. If VRAM is the size of a water tank, memory bandwidth is the width of the pipe. A large tank with a narrow pipe still delivers water slowly.
Even when a model fits comfortably in VRAM, a GPU with low memory bandwidth will throttle its own performance. The processing cores end up sitting idle, waiting for data that memory cannot deliver fast enough. This is why two GPUs with identical VRAM figures can feel completely different in real use.
The RTX 5090, for example, delivers 213 tokens per second on common 8B models, a 67 percent improvement over the RTX 4090, despite only having 8GB more VRAM. Most of that gap comes down to bandwidth, not raw memory size.
When evaluating options, VRAM tells you what you can run. Bandwidth tells you how well it will actually run.
How Much VRAM Do You Actually Need?
Larger models produce better outputs but demand more memory. The good news is that memory requirements can be reduced significantly through quantization, a process that compresses model data without gutting its usefulness.
At full precision, a model needs roughly 2GB of VRAM per billion parameters. At 4-bit quantization, that drops to about 0.6 to 0.8GB per billion. A 13B model that would need 26GB of VRAM at full precision can be brought down to around 10GB. Same model, very different hardware requirement.
Here is how the tiers break down practically:
- Small models (1 to 3B parameters): 4 to 6GB VRAM. Good for prototyping and simple internal tools, limited beyond that.
- Mid-range models (7 to 14B parameters): 8 to 12GB VRAM. The sweet spot for most business use cases. Models like Llama 3 8B run at a genuinely interactive 40 or more tokens per second on a mid-range card.
- Larger models (13 to 30B parameters): 16 to 24GB VRAM. Noticeably stronger reasoning quality. Where most serious deployments land.
- 70B and above: 35 to 40GB VRAM at Q4 quantization. This is firmly multi-GPU territory. Even the most powerful consumer GPU available today, the RTX 5090 at 32GB, falls short of comfortably running a full 70B model.
One mistake worth flagging: assuming bigger always means better. A well-configured 13B model handles most practical business workloads comfortably. Forcing a 70B model onto hardware that cannot properly support it often produces worse results than simply running a smaller model cleanly.
GPU Recommendations for 2026
By 2026, the RTX 50-series Blackwell cards have become the current generation standard. The RTX 40-series remains capable and widely available as a solid entry point, but Blackwell’s improvements in memory bandwidth make it the more future-proof choice for teams making a decision today.
Entry level: RTX 40-series (8 to 24GB VRAM)
Still a strong option for teams starting out or working within tighter budgets. The RTX 4090 with 24GB VRAM remains one of the most popular choices for local LLM deployment, delivering 128 tokens per second on 8B models and a mature software ecosystem. For teams not ready to move to 50-series hardware, the 4090 is still a capable workhorse.
Mid-range: RTX 5080 and RTX 5070 Ti (16GB VRAM)
Both cards bring Blackwell’s architectural improvements and faster GDDR7 memory, which meaningfully improves real-world speed over their 40-series equivalents. The caveat worth knowing upfront: 16GB of VRAM will start to feel limiting as model sizes continue trending upward, so teams planning to run larger models should factor that into the decision now rather than later.
High end: RTX 5090 (32GB VRAM)
The ceiling for consumer hardware right now. 32GB of GDDR7 memory and 213 tokens per second on 8B models make it the strongest single-card option available. It handles models up to the 30B range comfortably, and can push into larger models with aggressive quantization, though 70B models are a genuine stretch even at this level. For teams serving multiple concurrent users or running larger models continuously, professional grade hardware is the more reliable path.
Professional grade: 40GB and above
This is where consumer cards stop being the right answer. Professional GPU configurations with 48GB or more of VRAM are built for sustained workloads, multi-user access, and the kind of day-long reliability that teams depending on local LLM infrastructure actually need. For 70B models and above, this is the only tier that handles them without compromise.
The Rest of the System Still Matters
VRAM and bandwidth get all the attention, but the hardware around the GPU creates its own problems when underspecified.
System RAM
When VRAM runs out and the model starts offloading to system memory, RAM requirements scale fast. A practical guide:
- 7B models: 16 to 32GB
- 13 to 30B models: 32 to 48GB
- 70B and above: 64GB minimum, ideally more
Storage
Model files are large and load times matter more in daily use than most teams expect:
- A 70B model at Q4 quantization takes around 40GB on disk
- A 13B model in full precision takes 26GB
- A slow mechanical drive turns a two-second load into a two-minute wait
A fast NVMe SSD is non-negotiable for anyone running models regularly.
CPU
For most local deployments the CPU is a supporting player. Where it becomes relevant is when VRAM runs out and the model starts offloading to system memory. In those situations a processor with higher memory bandwidth reduces the performance hit. For full GPU inference, it matters far less than the GPU and RAM.
Setting Up a Local LLM Server for Team Access
Running a model on one machine for one person is straightforward. Setting up a local LLM server that an entire team queries is a different problem, and a few things that barely matter in single-user setups become genuinely important:
- Concurrent users multiply memory pressure fast. Single-user benchmarks do not reflect what happens when five people query the same model simultaneously.
- ECC memory on professional GPU cards reduces the risk of silent errors during long inference runs, which matters when the system runs continuously.
- Linux, specifically Ubuntu, is the most stable and well-supported environment for LLM server frameworks. Most tooling is built and tested for it first.
- Network throughput becomes a real variable once multiple clients are making requests across the local network at the same time.
A quick reference for where VRAM limits land at the server level:
- 12GB GPU: 7B models and heavily quantized 13B variants
- 16GB GPU: 13 to 30B range comfortably
- 24GB GPU: practical entry point for larger models with quantization
- 40GB and above: where serious multi-user local LLM infrastructure begins
When Buying Hardware Outright Is Not the Right First Move
The hardware requirements for running capable LLMs locally are real, and they are moving quickly. A configuration that feels well-specified today may feel limiting within eighteen months as newer models become the standard. For teams still figuring out exactly what their workloads look like, committing to permanent infrastructure before validating those requirements is a risk that tends to be underestimated until it is too late to change course.
Renting an AI server is a practical middle path for exactly this situation. It gives teams access to the GPU resources they need for a specific phase of work, without locking capital into a configuration that has not yet been proven out. At Rank Computers, this is a pattern that comes up consistently among organisations building out their first local LLM environment. Running real workloads on properly specified hardware answers the infrastructure question far more reliably than any planning spreadsheet.
Before Any Hardware Gets Approved
The organisations that get local LLM deployment right are not the ones with the largest budgets. They are the ones that match the hardware to the actual workload before spending anything.
Know which models the team needs to run and what those models realistically demand. Know how many people will be querying the system at the same time, because that number shifts the hardware requirement more than most benchmarks suggest. And know whether this is still an experiment or whether teams are about to depend on it for real work, because those two situations call for very different infrastructure decisions.
Get those three questions answered honestly, and the hardware decision mostly makes itself.



