Install it (any device)

Two apps do almost everything: LM Studio if you want a window to click, Ollama if you like the terminal. Pick one and you're running in ten minutes.

You don't need to compile anything or understand machine learning. Two tools cover every desktop platform:

LM Studio — a polished desktop app. Browse models, click download, chat. The friendliest start. lmstudio.ai
Ollama — a tiny command-line server. One line to pull a model, one line to run it. Ideal for wiring tools to. ollama.com

Model size → the hardware it needs

Windows

LM Studio: download the .exe from lmstudio.ai, install, open the Discover tab, pick a model that fits your RAM (this tool tells you which), click download, then Chat.
Ollama: install the Windows build, then in PowerShell: ollama run llama4:3b. First run downloads the model; after that it's instant and offline.
A discrete NVIDIA GPU helps a lot — models on the GPU's VRAM run several times faster than on system RAM.

macOS (Apple Silicon)

Apple Silicon is excellent for this: the M-series unified memory is shared between CPU and GPU, so a 16 GB Mac can hold a model a 16 GB Windows laptop with a small GPU cannot.
LM Studio: download the .dmg, drag to Applications, done. It auto-uses Apple's MLX runtime for speed.
Ollama:brew install ollama then ollama run gemma4:4b.

Ubuntu / Linux

Ollama (recommended): one line — curl -fsSL https://ollama.com/install.sh | sh — then ollama run qwen3.5:8b. It auto-detects an NVIDIA GPU and uses it.
LM Studio ships an AppImage if you want the GUI.
For NVIDIA GPUs, install the proprietary driver + CUDA first; Ollama then offloads layers to VRAM automatically.

Phone (Android & iOS)

Yes, a phone can run a real model — just a small one. A modern flagship with 8–12 GB of RAM comfortably runs a 1–4B model at Q4, which is plenty for offline chat, drafting, and translation.

iOS: apps like LLM Farm, Private LLM, or Enclave run 1–4B models fully on-device.
Android: PocketPal, MLC Chat, or Layla load GGUF models locally. Termux + Ollama also works on higher-end devices.

Why phones cap at ~4B models

Whatever the device, the rule is the same: the model has to fit in memory. The next section explains how quantization shrinks a model so a much bigger one fits than you'd expect.