Optimize your setup
A model that fits but crawls is usually a memory problem, not a speed problem. The fix is almost always the same: keep the context window and the conversation tight.
Once a model is running, the single thing that makes or breaks the experience is memory pressure. When the model plus its context no longer fits in fast memory, your machine spills to disk and everything grinds. Avoiding that is 90% of optimization.
The context window is RAM you're spending
The context window is how much text the model holds in mind at once. It isn't free: every token in context lives in memory (the “KV cache”). Doubling the window can add gigabytes — on top of the model weights themselves.
So the model weights are a fixed cost; the context window is a variable cost that you control. Set it to what the task actually needs. A 128K window for a quick chat is pure waste — it reserves the memory whether you use it or not.
Keep sessions tight
This is the habit that matters most, and it's the opposite of how people use cloud chatbots:
- Don't dump everything into one endless thread. Every past message stays in context and keeps costing memory and speed. A long-running chat slowly chokes the model.
- Start a fresh session per task. New topic → new chat. You get full speed back and the model isn't distracted by unrelated history.
- Paste only the relevant excerpt, not the whole 80-page document, unless you genuinely need all of it in view at once.
- Match the context window to the job. 4K–8K for chat and coding help; reserve 32K+ only for long-document work.
Other levers, in order of impact
- Drop one quant level (Q5 → Q4) if you're tight — frees memory immediately for almost no quality loss.
- Offload to GPU/VRAM where you can. On a discrete GPU, push as many layers as fit into VRAM; it's far faster than system RAM.
- Close memory hogs — browsers with 50 tabs, other apps. Unified-memory Macs share that RAM with everything.
- Pick a smaller model that fits comfortably over a bigger one that barely fits. A 8B running fast beats a 14B that swaps to disk every reply.
Rule of thumb: leave 20–25% of your memory free for the OS and the context. The Analyze page already bakes that headroom into its recommendations.