llama-bench Command Generator

📖 Notes & Getting Started with llama-bench

▶

What is llama-bench? A tool withinllama.cpp that measures how fast a model runs on your hardware. It reports prompt processing t/s (how fast it reads input) and generation t/s (how fast it writes output).

Key rules & interactions:

ctx-size must be ≥ n-prompt + n-gen or the benchmark is invalid.
batch-size should usually be ≥ ubatch-size.
-ngl 99 offloads all layers to GPU — use this unless you want CPU fallback.
-fa 1 (Flash Attention) is almost always faster on modern GPUs.
No Conflicts for Tokens: Setting values for Prompt Tokens (-p), Generate Tokens (-n), and Combo (-pg) do not override each other. Instead, each field adds a separate set of benchmarks to run.
Conflicts: If you use both --threads and --cpu-mask, the thread count determines the size of the threadpool, but the CPU mask restricts which cores they run on, potentially causing severe bottlenecks if they mismatch. Additionally, when using --fit-target, the context size is automatically increased to the sum of n-prompt, n-gen, and n-depth if the provided context is smaller.

Formula Examples (Range Syntax):

4096-131072*2: Multiplies by 2 each step (4096, 8192, 16384, ..., 131072).
4-16+4: Adds 4 each step (4, 8, 12, 16).
128,512,1024: Specific comma-separated list of values.
0-100: Every number within a range (0, 1, 2, 3, ..., 98, 99, 100).

Extended Non-Native Functionality (Bash Loops):

This command generator has additional functionality to create longer running extensive benchmarks while keeping native llama-bench sweeps inside the tool CLI, using Bash loops only for scalar process parameters that llama-bench cannot parse as lists.
Auto-Generated Bash Loops: If you apply range syntax to a scalar variable (e.g., setting delay to 0-60+15), the UI instantly detects this limitation and upgrades your output from a single command into an OS-level Bash script.
It isolates the scalar variables into Bash arrays (e.g., delay_list=('0' '15' '30')) and wraps the native llama-bench execution in nested for loops.
The multi-value scalar variables that trigger Bash loops are: reps, delay, numa, and prio.
JSON Restriction: Selecting .json as the llama-bench output format is not supported with Bash looping (the UI will automatically convert this to .jsonl to ensure your appended data remains parseable).

Paste an existing command in Import Command to tweak it. Use Configuration Preview to see every run before executing.

🦙 Llama-Bench Command Generator 🦙

Configuration Preview

Import Command

📖 Notes & Getting Started with llama-bench

Models

Execution

Tokens

GPU / Offload

CPU / Threads

Memory / Cache

Fit / Autotune

Bash Output Script