GLM-4.6: Run Locally Guide

A guide on how to run Z.ai GLM-4.6 and GLM-4.6V-Flash model on your own local device!

GLM-4.6 and GLM-4.6V-Flash are the latest reasoning models from Z.ai, achieving SOTA performance on coding and agent benchmarks while offering improved conversational chats. GLM-4.6V-Flash the smaller 9B model was released in December, 2025 and you can run it now too.

The full 355B parameter model requires 400GB of disk space, while the Unsloth Dynamic 2-bit GGUF reduces the size to 135GB (-75%). GLM-4.6-GGUF

All uploads use Unsloth Dynamic 2.0 for SOTA 5-shot MMLU and Aider performance, meaning you can run & fine-tune quantized GLM LLMs with minimal accuracy loss.

Tutorials navigation:

Run GLM-4.6V-FlashRun GLM-4.6

Unsloth Chat Template fixes

One of the significant fixes we did addresses an issue with prompting GGUFs, where the second prompt wouldn’t work. We fixed this issue however, this problem still persists in GGUFs without our fixes. For example, when using any non-Unsloth GLM-4.6 GGUF, the first conversation works fine, but the second one breaks.

We’ve resolved this in our chat template, so when using our version, conversations beyond the second (third, fourth, etc.) work without any errors. There are still some issues with tool-calling, which we haven’t fully investigated yet due to bandwidth limitations. We’ve already informed the GLM team about these remaining issues.

⚙️ Usage Guide

The 2-bit dynamic quant UD-Q2_K_XL uses 135GB of disk space - this works well in a 1x24GB card and 128GB of RAM with MoE offloading. The 1-bit UD-TQ1 GGUF also works natively in Ollama!

You must use --jinja for llama.cpp quants - this uses our fixed chat templates and enables the correct template! You might get incorrect results if you do not use --jinja

The 4-bit quants will fit in a 1x 40GB GPU (with MoE layers offloaded to RAM). Expect around 5 tokens/s with this setup if you have bonus 165GB RAM as well. It is recommended to have at least 205GB RAM to run this 4-bit. For optimal performance you will need at least 205GB unified memory or 205GB combined RAM+VRAM for 5+ tokens/s. To learn how to increase generation speed and fit longer contexts, read here.

According to Z.ai, there are different settings for GLM-4.6V-Flash & GLM-4.6 inference:

GLM-4.6V-Flash
GLM-4.6

temperature = 0.8

temperature = 1.0

top_p = 0.6 (recommended)

top_p = 0.95 (recommended for coding)

top_k = 2 (recommended)

top_k = 40 (recommended for coding)

128K context length or less

200K context length or less

repeat_penalty = 1.1

max_generate_tokens = 16,384

max_generate_tokens = 16,384

  • Use --jinja for llama.cpp variants - we fixed some chat template issues as well!

Run GLM-4.6 Tutorials:

See our step-by-step guides for running GLM-4.6V-Flash and the large GLM-4.6 models.

GLM-4.6V-Flash

Currently GLM-4.6V-Flash only works with text via llama.cpp. Vision support to come later.

✨ Run in llama.cpp

1

Obtain the latest llama.cpp on GitHub. You can also use the build instructions below. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q8_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 128K context length.

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL (dynamic 4bit quant) or other quantized versions like Q8_K_XL .

GLM-4.6

🦙 Run in Ollama

1

Install ollama if you haven't already! To run more variants of the model, see here.

2

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

3

To run other quants, you need to first merge the GGUF split files into 1 like the code below. Then you will need to run the model locally.

✨ Run in llama.cpp

1

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

2

If you want to use llama.cpp directly to load models, you can do the below: (:Q2_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location. Remember the model has only a maximum of 128K context length.

3

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL (dynamic 2bit quant) or other quantized versions like Q4_K_XL . We recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy.

4

You can edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

✨ Deploy with llama-server and OpenAI's completion library

To use llama-server for deployment, use the following command:

Then use OpenAI's Python library after pip install openai :

💽Model uploads

ALL our uploads - including those that are not imatrix-based or dynamic, utilize our calibration dataset, which is specifically optimized for conversational, coding, and language tasks.

  • Full GLM-4.6 model uploads below:

We also uploaded IQ4_NL and Q4_1 quants which run specifically faster for ARM and Apple devices respectively.

MoE Bits
Type + Link
Disk Size
Details

1.66bit

84GB

1.92/1.56bit

1.78bit

96GB

2.06/1.56bit

1.93bit

107GB

2.5/2.06/1.56

2.42bit

115GB

2.5/2.06bit

2.71bit

135GB

3.5/2.5bit

3.12bit

145GB

3.5/2.06bit

3.5bit

158GB

4.5/3.5bit

4.5bit

204GB

5.5/4.5bit

5.5bit

252GB

6.5/5.5bit

🏂 Improving generation speed

If you have more VRAM, you can try offloading more MoE layers, or offloading whole layers themselves.

Normally, -ot ".ffn_.*_exps.=CPU" offloads all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

If you have a bit more GPU memory, try -ot ".ffn_(up|down)_exps.=CPU" This offloads up and down projection MoE layers.

Try -ot ".ffn_(up)_exps.=CPU" if you have even more GPU memory. This offloads only up projection MoE layers.

You can also customize the regex, for example -ot "\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\.ffn_(gate|up|down)_exps.=CPU" means to offload gate, up and down MoE layers but only from the 6th layer onwards.

Llama.cpp also introduces high throughput mode. Use llama-parallel. Read more about it here. You can also quantize the KV cache to 4bits for example to reduce VRAM / RAM movement, which can also make the generation process faster.

📐How to fit long context (full 200K)

To fit longer context, you can use KV cache quantization to quantize the K and V caches to lower bits. This can also increase generation speed due to reduced RAM / VRAM data movement. The allowed options for K quantization (default is f16) include the below.

--cache-type-k f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

You should use the _1 variants for somewhat increased accuracy, albeit it's slightly slower. For eg q4_1, q5_1

You can also quantize the V cache, but you will need to compile llama.cpp with Flash Attention support via -DGGML_CUDA_FA_ALL_QUANTS=ON, and use --flash-attn to enable it. Then you can use together with --cache-type-k :

--cache-type-v f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

Last updated

Was this helpful?