Current LLaMA.cpp Setup & Use Copypasta, Large Language Models, Computers

Current LLaMA.cpp Setup & Use

2026-05-18

Quick notes regarding how I have been using Llama.cpp and my current settings for local LLMs.

Prerequisites

This assumes prior installation of LLaMA.cpp. That was a bit of a headache for me for some reason. LLaMA.cpp is an open-source program written C/C++ that runs open-weight large language models offline, in private, and for free. Learning about LLMs is a huge topic, prone to all kinds of wild conjecture and misunderstanding.

My System Limitations

The things that matter for this are:

To get any higher parameter models to run in this context, I have found that Mixture-of-Expert models provide me the best results.

Goals

I don’t care that much about speed. I want the highest quality reasoning/inference I can accomplish on my system, with enough KV cache (i.e., context window) to run small code repository analyses or long-form prompts. In my experiments, I have found that most the models can get to around 65 K tokens in a context before I start to notice attention and reasoning degradation. That’s just my impression, I don’t really know how this works, but even the big frontier models do it pretty consistently.

Open-Weight Models

The best source I can find is Hugging Face. Accounts are free. There are a huge selection of open weight models being released and the technology is developing very fast. I was very skeptical about how safe these would be. Use discretion… Models are released by companies like Google, OpenAI, Meta, or Alibaba. My understanding is that this is motivated by the desire of each to achieve ecosystem dominance and adoption. It’s not because these companies are friendly or wish me well.

My preference is the models quantized by bartowski (a very reputable developer in the field) or unsloth (a company that sells software for doing local training of models, with a good reputation).

Two new models were released last month, the Gemma 4 family and Qwen 3.6. I enjoy the Gemma 4 31 billion version but can’t run it locally. It is (for now) accessible on Google’s AI Studio or cheaply via API use. My understanding is that they are currently allowing 1000 or so requests per day for free. This is subject to change. Compare to using Google’s Gemini 3.1 Pro which can rack up a $10 charge in a non-CI/CD single session, and Gemma is a steal.

For local inference I use Gemma 4 26B A4B IT and Qwen 3.6 35B A3B. Qwen is my favorite this week. These are the best models I can currently run.

Splitting Between VRAM, CPU, RAM

Llama.cpp allows you to split the model “layers” between VRAM, CPU, and RAM. This radically reduces their speed, but not the quality of inference. You also can set the parameters for your session directly on invocation. There is a built in localhost interface when a model is running, or you can access via API.

Until I write myself a little dashboard app for launching and switching between models and settings, here are my common command strings:

Gemma 4 E4B IT

google_gemma-4-E4B-it-Q6_K_L.gguf

I can run this little version in VRAM. Limited reasoning, but fairly fast. I use:

llama-server -m "D:\models\google_gemma-4-E4B-it-Q6_K_L.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 50

Qwen 3.5 9B (MTP)

Qwen3.5-9B-Q4_K_M.gguf

Also runs in VRAM, pretty solid.

For general use:

llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on

Thinking mode for precise coding tasks:

llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none

Instruct (or non-thinking) mode for general tasks:

llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 0.7 --top-p 0.80 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning off

Instruct (or non-thinking) mode for reasoning tasks:

llama-server -m "D:\models\Qwen3.5-9B-Q4_K_M.gguf" -ngl 999 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning off

Qwen 2.5.1 Coder 7B Instruct

Qwen2.5.1-Coder-7B-Instruct-Q5_K_L.gguf

This is a programming specific model, trained for only that purpose. I have found the new 3.6 35B to be superior for all purposes, but I haven’t deleted it yet.

llama-server -m "D:\models\Qwen2.5.1-Coder-7B-Instruct-Q5_K_L.gguf" -ngl 999 -c 65536 --temp 0.2 --top-p 0.95 --top-k 50

Qwen 3.6 35B A3B

Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf

My current best quality. Split onto CPU/RAM I can only get about 10-15 tokens per second performance. KV quantization at q8_0.

For general use:

llama-server -m "D:\models\Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf" -ngl 16 -c 65536 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none

For coding/deterministic use:

llama-server -m "D:\models\Qwen_Qwen3.6-35B-A3B-Q4_K_L.gguf" -ngl 16 -c 65536 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 -ctk q8_0 -ctv q8_0 --repeat-penalty 1.0 --reasoning on --reasoning-format none

Gemma 4 26B A4B IT

gemma-4-26B-A4B-it-UD-Q4_K_M.gguf

Perhaps slightly better than Qwen at “conversational” use? I still prefer Qwen.

Note: I need to play with the split on this one… I’m sure I can get 65 K context if I try…

llama-server -m "D:\models\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" -ngl 14 -c 16384 --temp 1.0 --top-p 0.95 --top-k 50 -ctk q8_0 -ctv q8_0

Others

I recently grabbed a translation model and a medical model for testing, and set up an app called Off Grid on my IPhone and am playing with micro-models (Gemma 3 1B and Qwen 3.5 2B) to see what they are capable of. Maybe for quick reminders or forgotten words or names to look up? They seem pretty useless so far.

Next…

I want a new MacPro. With the unified memory on those laptops… I could run really high parameter models, and probably get close enough to frontier model performance that I would never have to touch the muddy waters of the dystopian walled gardens of the tech companies while they try to figure out how to dominate and profit but could still stay current on LLM behavior and use while they infiltrate every aspect of developed societies everywhere… The LLMs are not the problem. It’s the humans.

Search Titles & Keywords