Part of LLM Development

Claude Code Skills for Safety & Quality

Shipping an LLM feature without safety testing is like shipping a web app without input validation — it's only a matter of time. These skills cover evaluation frameworks, safety reviews, adversarial testing, and the quality assurance practices specific to AI systems. The goal isn't to eliminate all risk — it's to know where the edges are and have a plan for when outputs go sideways.

Published by ClaudeVaultLast updated 6 skills

Key takeaway

ClaudeVault's AI safety and quality skills give Claude Code a structured approach to LLM evaluation, prompt injection testing, guardrail design, and adversarial review — mapped to the OWASP LLM Top 10:2025. Persistent attackers still bypass published injection defenses more than 90% of the time, so the discipline here is layered defense, not a single guard that nobody tries to break.

At a glance

  • 6 skills covering eval writing, guardrail design, adversarial testing, prompt injection testing, AI code review, and safety review
  • Maps findings to the OWASP LLM Top 10:2025, including the new vector and embedding weakness category
  • Works with DeepEval, promptfoo, Langfuse, and LangSmith for evaluation harness generation
  • Covers 20+ single-turn and multi-turn red team attack patterns built into the prompt injection tester
  • Designed around layered defense — joint research from OpenAI, Anthropic, and DeepMind shows single-layer injection defenses fail more than 90% of the time against persistent attackers

When you reach for these skills

  • When an LLM feature is approaching launch and the team has no eval suite beyond manual spot-checks

  • When a prompt injection attack has been demonstrated against a production system and the team needs a defense-in-depth plan fast

  • When an AI code reviewer is producing inconsistent findings and the team cannot tell if the prompt, model, or harness is the problem

  • When a compliance audit asks for documented LLM safety testing and the honest answer is 'we click around'

How these skills work together

A full Claude Code LLM safety pass layers these skills from strategy down to the runtime guardrail, because evals without guardrails catch nothing at inference time and guardrails without evals cannot prove anything.

  1. 1

    Plan the safety strategy before writing tests

    Start with the AI testing strategist. Claude reviews the feature, maps it to the OWASP LLM Top 10:2025 categories that actually apply, and writes a prioritized test plan so the team is not testing every category with the same depth regardless of risk.

  2. 2

    Write the eval suite against real failure modes

    The eval writer generates a DeepEval or promptfoo suite with golden examples, adversarial cases, and LLM-as-judge metrics. Claude seeds the suite with production failures the team has already seen, not synthetic ones, because synthetic evals catch synthetic bugs.

  3. 3

    Run injection testing before the feature ships

    Use the prompt injection tester to run the feature against 20+ single-turn and multi-turn attack patterns — direct injection, indirect injection via retrieved content, jailbreaks, role-play escalation. Claude reports the specific prompts that bypassed the defenses, not just a pass-fail score.

  4. 4

    Design guardrails at the inference boundary

    The guardrail designer adds runtime enforcement — input classification, output filtering, PII redaction, and structured output schema checks — at the points where evals cannot reach. Claude picks between Guardrails AI and NeMo Guardrails based on the host language and deployment topology.

  5. 5

    Review the full safety posture end to end

    Finally, the AI safety reviewer reads the eval suite, the guardrail config, and the deployment topology together and writes a posture review the compliance team can file. Claude flags gaps, not just findings, so the team sees which OWASP categories still have no coverage.

Outcome

A prioritized safety strategy, a real-world eval suite, documented injection defenses, runtime guardrails, and a written posture review — not a single 'safety layer' that falls apart under the first serious attack.

Compare the skills

SkillBest forComplexityPrimary use case
AI Testing StrategistPre-launch LLM features without a test planAdvancedOWASP-mapped prioritized strategies
Eval WriterTeams building DeepEval or promptfoo suitesIntermediateGolden examples, adversarial cases, LLM-as-judge
Prompt Injection TesterFeatures with user-controlled inputAdvancedSingle-turn and multi-turn attack patterns
Guardrail DesignerProduction LLM systems needing runtime enforcementIntermediateInput classification, output filters, PII redaction
AI Code ReviewerTeams using Claude to review LLM integration codeIntermediateFeature-specific LLM code quality checks
AI Safety ReviewerCompliance-facing safety posture reviewsAdvancedEnd-to-end safety posture audits

Skills in this topic

AI Safety Reviewer

Performs structured safety risk assessments for LLM-powered systems across five dimensions — harmful output, bias, data leakage, misuse potential, and failure modes. Use when launching an AI feature, auditing an existing system, or preparing for compliance review. AI safety audit, risk assessment, responsible AI.

AI Testing Strategist

Designs test strategies for AI features — quality dimensions, golden test sets, regression boundaries, LLM-as-judge pipelines, and CI-compatible eval automation. Use when planning how to test non-deterministic AI output, setting up eval infrastructure, or defining quality gates. AI testing, eval strategy, quality assurance, regression testing.

Guardrail Designer

Designs layered safety boundaries for AI systems — input validation, content filtering, output guardrails, PII detection, and escalation paths. Use when adding safety layers to a chatbot, content generator, or any AI feature that interacts with users. Guardrails, content filter, AI safety, input validation.

AI Code Reviewer

Reviews AI/LLM integration code for safety vulnerabilities, cost blowouts, reliability gaps, and architectural anti-patterns that general code reviewers miss. Use when reviewing code that calls AI models, processes AI outputs, or manages AI infrastructure. AI code review, LLM integration, prompt injection, cost audit.

Prompt Injection Tester

Generates adversarial test suites for prompt injection vulnerabilities — direct injection, indirect injection via documents, multi-turn escalation, tool abuse, and system prompt extraction. Use when red-teaming an AI system, testing guardrails, or preparing for security review. Prompt injection, jailbreak, adversarial testing, red team.

Eval Writer

Creates evaluation suites for AI features — scoring rubrics, test cases with coverage strategy, pass/fail criteria, and LLM-as-judge configurations. Use when measuring AI output quality, building eval pipelines, or writing test cases for prompt changes. Evals, evaluation, scoring rubric, AI quality, benchmarking.

Frequently asked questions

What are the OWASP LLM Top 10 risks?

Prompt injection sits at number one in the 2025 edition, followed by sensitive information disclosure, supply chain vulnerabilities, data and model poisoning, improper output handling, and excessive agency. The 2025 update added vector and embedding weaknesses, system prompt leakage, and unbounded consumption as net-new categories.

Can I fully prevent prompt injection?

No. Joint research from OpenAI, Anthropic, and DeepMind showed persistent attackers bypass published injection defenses more than 90% of the time. The goal is layered defense — input classification, output filtering, structured output constraints, and runtime guardrails — so no single failure compromises the system.

DeepEval versus OpenAI Evals versus promptfoo — which should I pick?

DeepEval is pytest-like and batteries-included, which fits teams already testing with pytest. OpenAI Evals is primitive and flexible, which fits teams building custom metrics. Promptfoo is the strongest option for red-teaming with 20+ attack patterns built in. The eval writer skill picks based on the team's existing test stack.

Is LLM-as-judge reliable enough for production evals?

Useful for scale, but it needs human calibration on a golden set first and should be paired with deterministic metrics like string match, schema validation, or exact-answer scoring. LLM-as-judge on its own drifts with model upgrades and can agree with itself about wrong answers, which makes it dangerous as a single signal.

How do I red-team an LLM feature?

Use promptfoo or DeepTeam with the built-in attack library — direct prompt injection, indirect injection via retrieved content, role-play escalation, jailbreaks, and PII exfiltration patterns. Multi-turn matters: most real attacks play out across three to five turns, not a single clever prompt. The prompt injection tester skill automates the multi-turn patterns.

What is the difference between evals and guardrails?

Evals measure quality offline against a test set before deploy. Guardrails enforce rules at inference time against real user traffic. Evals catch regressions you expect to see; guardrails catch attacks you do not. Production systems need both — evals without guardrails catch nothing at runtime, and guardrails without evals cannot prove coverage.