LocalAI Navigator
← How it works

Quantization

The one trick that lets a 'too big' model run on your machine: store each number with fewer bits. A 4-bit model is a quarter the size of the full one — and almost as smart.

A model is just billions of numbers (its parameters or weights). By default each is stored in 16 bits — two bytes. A 12-billion-parameter model is therefore about 24 GB, which won't fit on most machines.

Quantization rounds those numbers to fewer bits. Store each in 8 bits and the model halves to ~12 GB. Store each in 4 bits and it's ~6 GB — a quarter of the original — and it still answers nearly as well.

Quantization — same model, less memory
Memory per 1 billion parametersFP16 (full)2 GB / BQ8 (8-bit)1 GB / BQ4 (4-bit)0.5 GB / BA 12B model: 24 GB at FP16 → 12 GB at Q8 → just 6 GB at Q4. Quality drops only slightly.

What the labels mean

  • FP16 / BF16 (full): the original, ~2 GB per billion params. Best quality, biggest footprint. Only worth it if you have the memory to spare.
  • Q8 (8-bit): ~1 GB per billion. Quality loss is essentially invisible. A great default when you have the room.
  • Q4 (4-bit): ~0.5 GB per billion. The workhorse of local AI — roughly a quarter of the size for a small, usually unnoticeable quality dip. This is what most people should run.

The names you'll see when downloading

In LM Studio and on Hugging Face you'll see tags like Q4_K_M, Q5_K_M, Q6_K, Q8_0. The number is the bit-width; _K_M is a smarter “k-quant” that spends bits where they matter most. For most people, Q4_K_M is the sweet-spot default. Step up to Q5/Q6 if you have memory headroom and want a touch more quality.

The rule of thumb

To estimate the memory a model needs: params (in billions) × bytes-per-param, then add 15–40% for the context window. So a 7B model at Q4 is 7 × 0.5 ≈ 3.5 GB of weights, ~4–5 GB in practice. A 14B at Q4 is ~7 GB of weights. That math is exactly what the Analyze page runs for you against your real memory.