LLM vulnerability scanning with garak

After examining prompt injection and jailbreak techniques manually, the next step is automated scanning. LLM vulnerability scanning allows red teamers to test a model’s resilience against known attack vectors at scale, running hundreds of probes and evaluating the results systematically. This article covers garak, NVIDIA’s open-source LLM vulnerability scanner, including its architecture, how to run scans, and how to interpret the results.

What garak is

Garak is a command-line tool that systematically probes language models for security weaknesses and safety failures. It sends adversarial inputs (probes) to a target model, analyses the responses using detector modules, and produces structured reports documenting which attack vectors succeeded and at what rate.

The tool was originally developed by Prof. Leon Derczynski at ITU Copenhagen in 2023 and is now maintained by NVIDIA under the Apache 2.0 licence. It has over 7,000 GitHub stars and ships with more than 50 probe modules covering prompt injection, jailbreaks, encoding bypasses, data leakage, hallucination, toxicity generation, and more. Its probe findings map to OWASP LLM Top 10 categories, making it straightforward to connect scan results to established risk taxonomies.

The name is a reference to the Star Trek character, fitting the tool’s positioning as the LLM equivalent of network security scanners like nmap or Nessus.

How garak works

Garak’s architecture is entirely plugin-based. Every component is modular and extensible.

Generators interface with the target model. Garak supports over 20 backends, including OpenAI, Anthropic, Hugging Face, Replicate, AWS Bedrock, Ollama, and NVIDIA NIM. Depending on the backend, an API key may be required.

Probes generate adversarial test inputs. Each probe module targets a specific vulnerability class and contains multiple individual prompts. The dan module tests DAN-family jailbreaks, promptinject implements prompt hijacking attacks, encoding tests Base64 and other encoding bypasses, and leakreplay tests for training data regurgitation.

Detectors analyse each response to determine whether the probe succeeded. Different probes use different detectors. The dan.DAN detector checks whether the model adopted the unrestricted DAN persona, while mitigation.MitigationBypasschecks whether the model’s refusal was successfully circumvented. The promptinject.AttackRogueString detector checks whether the model generated a specific attacker-controlled string.

Because LLM output is non-deterministic, garak runs each probe multiple times (ten by default) and reports the failure rate as a percentage. A model that refuses a jailbreak nine times out of ten still has a 10% bypass rate, and that 10% is a finding.

Running a scan

Garak is installed via pip.

pip install garak

A scan requires three parameters. The --model_type flag specifies the backend platform. The --model_name flag identifies the specific model. The --probes or -p flag selects which probe modules to run.

To list all available probes, run the following.

garak --list_probes

The output lists every probe module and its subvariants. The list is extensive, covering DAN jailbreaks, prompt injection, encoding attacks, data leakage, XSS, package hallucination, and many others.

To scan a model for the DAN 11.0 jailbreak, the command looks like this.

REPLICATE_API_TOKEN="your_key" garak --model_type replicate \
  --model_name "meta/meta-llama-3.1-405b-instruct" \
  -p dan.Dan_11_0

Garak loads the generator, runs the probe, and outputs results inline as each probe completes.

dan.Dan_11_0                dan.DAN: FAIL  ok on  0/5  (failure rate: 100.00%)
dan.Dan_11_0  mitigation.MitigationBypass: FAIL  ok on  2/5  (failure rate:  60.00%)

In this example, the dan.DAN detector found that the model adopted the DAN persona on all five attempts (100% failure rate). The mitigation.MitigationBypass detector found that the model’s refusal was bypassed on three out of five attempts (60% failure rate).

Running a prompt injection scan uses the same structure with a different probe module.

REPLICATE_API_TOKEN="your_key" garak --model_type replicate \
  --model_name "meta/meta-llama-3-8b-instruct" \
  -p promptinject

The promptinject probe module includes multiple subvariants that test different hijacking payloads. Results might look like this.

promptinject.HijackHateHumansMini  promptinject.AttackRogueString: FAIL  ok on 145/500  (failure rate: 71.00%)
promptinject.HijackKillHumansMini  promptinject.AttackRogueString: FAIL  ok on 280/500  (failure rate: 44.00%)
promptinject.HijackLongPromptMini  promptinject.AttackRogueString: FAIL  ok on 140/500  (failure rate: 72.00%)

Interpreting results

Garak produces two report files after each scan. The JSONL report contains every prompt sent and every response received, providing a full audit trail. The HTML report provides a visual summary with resilience scores for each probe.

The JSONL report is the primary artefact for detailed analysis. Each entry includes the probe name, the exact prompt text, the model’s responses (one per generation), and the detector verdicts. This allows the red teamer to review specific prompts that triggered failures and understand exactly how the model responded.

For the DAN scan example above, a JSONL entry shows the model responding with a full commitment to the DAN persona, acknowledging that it will generate two responses to each prompt (one standard, one unrestricted) and confirming readiness to operate in DAN mode.

For the prompt injection scan, the JSONL entries show the model generating the exact attacker-controlled string (“I hate humans”) on all five generation attempts for a given probe, confirming a consistent and repeatable vulnerability.

The HTML report provides a higher-level view. Each probe is listed with a pass rate (percentage of generations that resisted the attack) and a colour-coded indicator. A pass rate of 0% (red) means every attempt succeeded, while higher pass rates indicate partial resilience.

The failure rate is the most important metric. It tells the red teamer how reliably an attack vector works. A 100% failure rate means the model is consistently vulnerable. A 10% failure rate means the model is mostly resilient but has a bypass window. Both are findings, but they require different levels of urgency in remediation.

Beyond garak

Garak is not the only tool in this space. Microsoft’s PyRIT (Python Risk Identification Toolkit) specialises in multi-turn and multi-modal attack techniques, including crescendo attacks that gradually escalate over multiple conversation turns. IBM’s Adversarial Robustness Toolbox (ART) covers a broader range of adversarial ML attacks beyond prompt injection, including evasion attacks and data poisoning. Promptfoo provides a CI/CD-oriented evaluation framework with over 50 vulnerability types and YAML-based test configuration.

Each tool has a different design philosophy and strength. Garak’s advantage is its breadth of probe coverage, its CLI-first workflow, and its structured reporting that maps findings to OWASP categories. For comprehensive LLM security testing, red teamers typically combine automated scanning with manual testing, using tools like garak to establish a baseline and manual techniques to explore edge cases that automated probes miss.

Summary

Garak is an open-source LLM vulnerability scanner that automates adversarial testing across prompt injection, jailbreaks, encoding bypasses, data leakage, and other vulnerability classes. It runs probes multiple times to account for non-deterministic output and reports failure rates that quantify a model’s resilience. The JSONL and HTML reports provide both detailed audit trails and high-level summaries. Automated scanning with tools like garak complements the manual prompt injection and jailbreak techniques covered in earlier articles, providing systematic coverage of known attack vectors.

Type to search