Quantization
The one trick that lets a 'too big' model run on your machine: store each number with fewer bits. A 4-bit model is a quarter the size of the full one — and almost as smart.
A model is just billions of numbers (its parameters or weights). By default each is stored in 16 bits — two bytes. A 12-billion-parameter model is therefore about 24 GB, which won't fit on most machines.
Quantization rounds those numbers to fewer bits. Store each in 8 bits and the model halves to ~12 GB. Store each in 4 bits and it's ~6 GB — a quarter of the original — and it still answers nearly as well.
What the labels mean
- FP16 / BF16 (full): the original, ~2 GB per billion params. Best quality, biggest footprint. Only worth it if you have the memory to spare.
- Q8 (8-bit): ~1 GB per billion. Quality loss is essentially invisible. A great default when you have the room.
- Q4 (4-bit): ~0.5 GB per billion. The workhorse of local AI — roughly a quarter of the size for a small, usually unnoticeable quality dip. This is what most people should run.
The names you'll see when downloading
In LM Studio and on Hugging Face you'll see tags like Q4_K_M, Q5_K_M, Q6_K, Q8_0. The number is the bit-width; _K_M is a smarter “k-quant” that spends bits where they matter most. For most people, Q4_K_M is the sweet-spot default. Step up to Q5/Q6 if you have memory headroom and want a touch more quality.
The rule of thumb
To estimate the memory a model needs: params (in billions) × bytes-per-param, then add 15–40% for the context window. So a 7B model at Q4 is 7 × 0.5 ≈ 3.5 GB of weights, ~4–5 GB in practice. A 14B at Q4 is ~7 GB of weights. That math is exactly what the Analyze page runs for you against your real memory.