Part of LLM Development

Claude Code Skills for RAG & Retrieval

RAG sounds simple — retrieve context, stuff it in the prompt, get better answers. In practice, it's a retrieval engineering problem. Embedding strategy, chunking, reranking, hybrid search, context window management — each decision affects answer quality in ways that are hard to debug after the fact. These skills help you build retrieval pipelines that actually surface the right information, not just the most similar vectors.

Published by ClaudeVaultLast updated 4 skills

Key takeaway

ClaudeVault's RAG and retrieval skills give Claude Code a structured approach to the decisions that actually move answer quality — embedding model selection, chunking at semantic boundaries, hybrid BM25-plus-vector search, rerankers, and conversation memory that does not silently lose context mid-session. The focus is retrieval engineering as a design discipline, not a vector database configuration exercise.

At a glance

  • 4 skills covering RAG architecture, embedding strategy, retrieval pipeline optimization, and conversation memory design
  • Works across Pinecone, Weaviate, Chroma, and Qdrant, plus hybrid search using BM25 with semantic reranking
  • Targets the 200-1000 token chunk range with 10-20% overlap — the band most production RAG systems settle into
  • Handles long-context trade-offs: RAG query cost averages roughly 1,250x cheaper than pure long-context LLM calls
  • Covers Voyage 4, Cohere embed-v4, and text-embedding-3-large, plus the benchmark differences that matter at scale

When you reach for these skills

  • When citation accuracy matters because answers ship to end users and hallucinated sources are a legal risk

  • When the RAG pipeline retrieves plausible-looking chunks that never actually answer the question

  • When a chatbot loses context midway through a conversation because memory is just the last ten messages

  • When embedding costs are growing faster than traffic and the team needs to test a cheaper model without tanking relevance

How these skills work together

A typical Claude Code RAG pass chains these four skills from the pipeline decision down to the runtime memory design, because fixing retrieval without fixing memory just moves the bug.

  1. 1

    Decide the pipeline shape before choosing tools

    Start with the RAG advisor. Claude reviews the task — Q&A, summarization, agent tool call — and decides whether you need vanilla retrieval, hybrid search with BM25, a reranker, or GraphRAG. Most teams skip this step and end up with a reranker bolted onto a broken chunking strategy.

  2. 2

    Pick the embedding model for the actual workload

    The embedding strategy advisor benchmarks candidate models — Voyage 4, Cohere embed-v4, text-embedding-3-large — on your own data, not the vendor's benchmark set. Claude generates the eval harness and reports NDCG@10 so the model choice has numbers behind it instead of marketing copy.

  3. 3

    Optimize the retrieval pipeline end to end

    Use the retrieval pipeline optimizer to tune chunk size, overlap, hybrid weight, and reranker depth. Claude walks the pipeline, measures recall at each stage, and surfaces the specific stage dropping relevant documents — because most RAG performance problems are one stage, not six.

  4. 4

    Design conversation memory that does not silently expire

    Finally, the conversation memory designer handles the long-session case. Claude picks between buffer window, summary memory, and vector memory based on session length and reference frequency, then writes the expiry and re-summarization rules so memory stays bounded without quietly losing earlier context.

Outcome

A RAG pipeline where every stage is measured, an embedding model chosen on your own data, a retrieval pipeline tuned stage by stage, and a memory design that holds up across 50-turn conversations.

Compare the skills

SkillBest forComplexityPrimary use case
RAG AdvisorNew RAG systems and architecture reviewsAdvancedPipeline shape selection and pattern fit
Embedding Strategy AdvisorEmbedding cost or relevance regressionsAdvancedModel benchmarking on production data
Retrieval Pipeline OptimizerRAG systems where recall is the bottleneckAdvancedChunk, hybrid weight, and reranker tuning
Conversation Memory DesignerChatbots with long multi-turn sessionsIntermediateBuffer, summary, and vector memory design

Skills in this topic

RAG Advisor

Designs retrieval-augmented generation pipelines — document ingestion, chunking, embedding, vector storage, retrieval, reranking, and generation prompt structure. Use when building a RAG system from scratch or evaluating whether RAG is the right approach. RAG architecture, vector search, semantic search, knowledge base.

Conversation Memory Designer

Designs conversation memory systems for chatbots and agents — sliding window, summarization, semantic recall, entity memory, and hybrid architectures with explicit token budgets and eviction policies. Use when building multi-turn or multi-session conversational AI that needs to remember context. Conversation memory, chat history, session persistence.

Embedding Strategy Advisor

Designs complete embedding pipelines — model selection, chunking strategy, vector index configuration, and query-time processing for search, RAG, and similarity matching. Use when choosing embedding models, configuring vector databases, or designing chunking for a new corpus. Embeddings, vector search, HNSW, chunking.

Retrieval Pipeline Optimizer

Diagnoses and fixes underperforming RAG retrieval through systematic failure analysis — recall failures, precision failures, ranking failures — with targeted optimizations and eval loops. Use when an existing RAG pipeline returns wrong, irrelevant, or poorly-ranked results. Retrieval quality, search optimization, reranking.

Frequently asked questions

Is RAG dead in 2026 now that context windows are huge?

No. RAG is roughly 1,250x cheaper per query than stuffing everything into a long context, and long-context models suffer from position bias that degrades accuracy when the answer sits in the middle of the prompt. The current best practice is hybrid: RAG picks the evidence set, long context reasons over it.

What is the ideal chunk size for RAG?

Most production systems settle between 200 and 1,000 tokens with 10-20% overlap on semantic boundaries. Smaller chunks surface precision but lose surrounding context; larger chunks surface context but dilute the signal. The retrieval pipeline optimizer runs the sweep against your own corpus instead of guessing from a blog post.

Do I need a reranker in my RAG pipeline?

Yes for any production system. Hybrid search followed by a cross-encoder reranker is the current best practice because the reranker evaluates query-document pairs with full attention, which pure vector search cannot. The retrieval pipeline optimizer walks the stages and measures the delta the reranker actually provides.

Which embedding model should I pick in 2026?

It depends on your corpus. Voyage 4 currently leads the RTEB benchmark by roughly 14% over text-embedding-3-large, but your own data may not match the benchmark distribution. The embedding strategy advisor skill generates an eval harness that scores each candidate on your documents so the decision is defensible.

How do I handle chat memory for long conversations?

Use a hybrid design: a rolling buffer for the most recent turns, summary memory for everything older, and vector memory for sessions long enough that references cross the summary boundary. The conversation memory designer picks the split based on session length and reference density, not a framework default.

RAG versus long context — which is cheaper?

RAG is roughly 1,250x cheaper per query on average according to Meilisearch benchmarks, because long context pays for the entire prompt on every call while RAG only pays for the retrieved slice. The real decision is quality, not cost — long context wins on reasoning over retrieved evidence, RAG wins on evidence selection.