Current LLaMA.cpp Setup & Use
2026-05-18
Prerequisites
This assumes prior installation of LLaMA.cpp. That was a bit of a headache for me for some reason. LLaMA.cpp is an open-source program written C/C++ that runs open-weight large language models offline, in private, and for free. Learning about LLMs is a huge topic, prone to all kinds of wild conjecture and misunderstanding.
My System Limitations
The things that matter for this are:
- 11 GB VRAM
- 16 core CPU
- 32 GB RAM
To get any higher parameter models to run in this context, I have found that Mixture-of-Expert models provide me the best results.
Goals
I don’t care that much about speed. I want the highest quality reasoning/inference I can accomplish on my system, with enough KV cache (i.e., context window) to run small code repository analyses or long-form prompts. In my experiments, I have found that most the models can get to around 65 K tokens in a context before I start to notice attention and reasoning degradation. That’s just my impression, I don’t really know how this works, but even the big frontier models do it pretty consistently.
Open-Weight Models
The best source I can find is Hugging Face. Accounts are free. There are a huge selection of open weight models being released and the technology is developing very fast. I was very skeptical about how safe these would be. Use discretion… Models are released by companies like Google, OpenAI, Meta, or Alibaba. My understanding is that this is motivated by the desire of each to achieve ecosystem dominance and adoption. It’s not because these companies are friendly or wish me well.
My preference is the models quantized by bartowski (a very reputable developer in the field) or unsloth (a company that sells software for doing local training of models, with a good reputation).
Two new models were released last month, the Gemma 4 family and Qwen 3.6. I enjoy the Gemma 4 31 billion version but can’t run it locally. It is (for now) accessible on Google’s AI Studio or cheaply via API use. My understanding is that they are currently allowing 1000 or so requests per day for free. This is subject to change. Compare to using Google’s Gemini 3.1 Pro which can rack up a $10 charge in a non-CI/CD single session, and Gemma is a steal.
For local inference I use Gemma 4 26B A4B IT and Qwen 3.6 35B A3B. Qwen is my favorite this week. These are the best models I can currently run.
Splitting Between VRAM, CPU, RAM
Llama.cpp allows you to split the model “layers” between VRAM, CPU, and RAM. This radically reduces their speed, but not the quality of inference. You also can set the parameters for your session directly on invocation. There is a built in localhost interface when a model is running, or you can access via API.
Until I write myself a little dashboard app for launching and switching between models and settings, here are my common command strings:
Gemma 4 E4B IT
google_gemma-4-E4B-it-Q6_K_L.gguf
I can run this little version in VRAM. Limited reasoning, but fairly fast. I use:
llama-server -m "D:\models\google_gemma-4-E4B-it-Q6_K_L.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 50
Qwen 3.5 9B (MTP)
Qwen3.5-9B-Q4_K_M.gguf
Also runs in VRAM, pretty solid.
For general use:
llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on
Thinking mode for precise coding tasks:
llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none
Instruct (or non-thinking) mode for general tasks:
llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 0.7 --top-p 0.80 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning off
Instruct (or non-thinking) mode for reasoning tasks:
llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning off
Qwen 2.5.1 Coder 7B Instruct
Qwen2.5.1-Coder-7B-Instruct-Q5_K_L.gguf
This is a programming specific model, trained for only that purpose. I have found the new 3.6 35B to be superior for all purposes, but I haven’t deleted it yet.
llama-server -m "D:\models\Qwen2.5.1-Coder-7B-Instruct-Q5_K_L.gguf" -ngl 999 -c 65536 --temp 0.2 --top-p 0.95 --top-k 50
Qwen 3.6 35B A3B
Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf
My current best quality. Split onto CPU/RAM I can only get about 10-15 tokens per second performance. KV quantization at q8_0.
For general use:
llama-server -m "D:\models\Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf" -ngl 16 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none
For coding/deterministic use:
llama-server -m "D:\models\Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf" -ngl 16 -c 65536 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none
Gemma 4 26B A4B IT
gemma-4-26B-A4B-it-UD-Q4_K_M.gguf
Perhaps slightly better than Qwen at “conversational” use? I still prefer Qwen.
Note: I need to play with the split on this one… I’m sure I can get 65 K context if I try…
llama-server -m "D:\models\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" -ngl 14 -c 16384 --temp 1.0 --top-p 0.95 --top-k 50 -ctk q8_0 -ctv q8_0
Others
I recently grabbed a translation model and a medical model for testing, and set up an app called Off Grid on my IPhone and am playing with micro-models (Gemma 3 1B and Qwen 3.5 2B) to see what they are capable of. Maybe for quick reminders or forgotten words or names to look up? They seem pretty useless so far.
Next…
I want a new MacPro. With the unified memory on those laptops… I could run really high parameter models, and probably get close enough to frontier model performance that I would never have to touch the muddy waters of the dystopian walled gardens of the tech companies while they try to figure out how to dominate and profit but could still stay current on LLM behavior and use while they infiltrate every aspect of developed societies everywhere… The LLMs are not the problem. It’s the humans.