RAG 101: Retrieval-Augmented Generation Made Simple
đ Welcome to this edition!
In every issue, we break down one concept shaping the future of technologyâexplained in plain English, with examples you can try on your own.
This time, weâre diving into Retrieval-Augmented Generation (RAG)âthe approach that makes GenAI systems not only smarter, but also more trustworthy.
If LLMs are the brain, RAG is the memory. It retrieves the right information from your dataâdocs, FAQs, PDFsâand then lets the model generate an answer that is both relevant and grounded in facts.
From customer support bots that actually know your company policies to research assistants that cite sources, RAG is powering the next wave of useful AI.
Letâs explore what it is, why it matters, and how you can get started.
Why RAG Matters
You ask a chatbot, âWhatâs our latest refund policy?â
It answers correctly, quoting yesterdayâs update from your company wiki.
This isnât luck. Large Language Models (LLMs) are great at language, but they donât know your private docs or yesterdayâs edits. Retrieval-Augmented Generation (RAG) fills that gap: it fetches the right facts first, then generates the answer using those facts. The result is current, grounded, and explainable responses with links to sources.
RAG is the missing bridge between your data and your model. It reduces hallucinations, keeps answers fresh, and avoids expensive fine-tuning. If you want useful GenAI in the real worldâsupport bots, knowledge search, coding helpâRAG is the fastest, safest path.
What is RAG
RAG (Retrieval-Augmented Generation) is a technique that combines search + generation.
First, it retrieves relevant information from your own data (docs, PDFs, FAQs, knowledge bases).Then, it augments the userâs question with that information and sends it to a Large Language Model (LLM).Finally, the model generates an answer that is grounded in your data, not just in its pretraining.
đ In short: RAG makes AI smarter, more accurate, and up-to-date by giving it the right facts at the right time.
Core Concepts
1) RAG (Retrieval-Augmented Generation)
A two-step process: retrieve relevant context from your data, then ask the LLM to answer using that context.
Example: fetch 3 wiki pages about ârefundsâ, then generate a concise policy answer with citations.
2) Embedding
Turn text (or images/audio) into a vectorâa list of numbers that captures meaning.
Example: âreturn periodâ and ârefund windowâ embed to nearby vectors.
3) Vector
The numeric list (e.g., length 384/768/1024) representing content or a query.
Example: [0.12, -0.48, âŚ] stands for a paragraph about exchange policy.
4) Retriever
The component that finds the most relevant chunks for a query (often via a vector database).
Example: top-k nearest neighbors for ârefund timeline.â
5) Reranker
A lightweight model or rule that reorders retrieved chunks for better relevance.
Example: from 20 candidates, keep the 5 that mention âwindow: 30 daysâ.
6) Chunking
Split long docs into smaller pieces to keep meaning tight and retrieval precise.
Example: 300â500 token chunks with 10â20% overlap.
7) Prompt Template
The instruction scaffold that tells the LLM how to answer using the retrieved context.
Example: âUse only the context below. Cite sources. If missing, say âI donât know.ââ
8) Context Window
How much text the LLM can read at once. Larger windows fit more chunks but still need focus.
Example: pick the top 5â10 chunks, not 200.
9) Grounding & Citations
Attaching sources to the answer so readers can verify.
Example: âAccording to /policy/refunds.md (updated 2025-08-18)âŚâ.
10) Hybrid Search
Combine keyword (BM25) and vector search for best coverage.
Example: filter âpolicyâ by keyword, rank by semantic similarity to ârefundâ.
How RAG Works â Step by Step
đ If you want to build real skills in Generative AI â not just theory â
my bootcamp is the perfect starting point.
đš Step 1: Collect Data
You start with whatever knowledge base you want the AI to use. This could be:
Policy documents (PDFs, Word files)
FAQs or support tickets
Website/blog content
Code READMEs or API docs
đ Think of this as building a library. Without books, a librarian (LLM) has nothing useful to answer with.
đš Step 2: Chunk
LLMs canât process very long documents at once. So we split documents into smaller chunks, usually 300â500 tokens (2â3 paragraphs).
Small chunks capture focused meaning (e.g., one policy rule, one section of code).
Overlap (20â30 tokens) prevents cutting sentences in half or losing context.
đ This is like dividing a long textbook into digestible flashcards. Each flashcard carries one clear idea.
đš Step 3: Embed
Each chunk is passed into an embedding model. This converts text into a vector â a list of numbers that represent meaning.
Example: ârefund windowâ â [0.12, -0.44, 0.91, âŚ]
Similar phrases (âreturn periodâ) get vectors close by.
đ This is how the system âunderstandsâ meaning, not just keywords.
đš Step 4: Store
The vectors and their associated metadata (like title, filename, date updated, section) are stored in a vector database.
Metadata lets you filter results (e.g., only policies updated after 2024).
The DB builds indexes for fast search.
đ Imagine a card catalog in a library, but instead of just author/title, it also understands âtopics by meaning.â
đš Step 5: Query
A user types a question: âCan I get a refund after 35 days?â
This text is treated just like your documents â it will also be turned into a vector.
đ So now the question and the documents live in the same meaning space.
đš Step 6: Embed Query
The question is embedded using the same model you used for documents.
Same dimensions (e.g., 768 numbers).
Same meaning-space.
đ This ensures queries and chunks can be compared apples-to-apples.
đš Step 7: Retrieve
The query vector is compared to stored vectors using k-NN (k-nearest neighbors) search.
Finds the top k most similar chunks.
Uses distance metrics like cosine similarity.
Metadata filters can narrow results (e.g., only ârefundâ docs).
đ This is like asking the librarian: âWhich 5 flashcards are closest in meaning to my question?â
đš Step 8: Generate
The retrieved chunks are placed inside a prompt template and sent to an LLM.
The model is instructed: âAnswer using only this context. If missing, say âI donât know.ââ
The LLM blends natural language skill with factual context.
đ The LLM is no longer guessing â itâs grounded in your data.
đš Step 9: Return
The system outputs:
A natural-language answer.
Citations (which chunks/docs were used).
Logs for debugging (which chunks were retrieved, scores, etc.).
đ This builds trust because the answer isnât just textâitâs traceable to real sources.
đ§ž Mini Case Study: Refunds Guide Bot
Goal
Build a chatbot that answers refund questions for customers â with clear citations to the companyâs policy documents.
1. What Data We Use
Documents: refunds.md, exchanges.md, plus ~20 FAQ pages.
Metadata saved: document name (path), last updated date, product line.
2. How It Works
Prepare the text: Clean Markdown, split into smaller pieces (about 300â350 words). Add a little overlap so meaning isnât lost.
Turn into numbers: Each chunk is converted into a vector (embedding) using a small, fast model.
Save in a database: Store vectors + metadata in a vector database. Use an HNSW index to make searching fast.
Search smartly: When a user asks a question, first check keywords like ârefundâ or âexchangeâ (BM25). Then combine that with vector search to capture meaning.
Keep it fresh: If two results are equally relevant, prefer the one updated most recently.
Ask with guardrails: Use a prompt template that says:
âAnswer only from the context below. Cite document names. If missing, say âI donât know.ââ
3. Example Interaction
User question:
âCan I get a refund after 35 days?â
System steps:
Embeds the question into a vector.
Finds similar chunks in refunds.md and exchanges.md.
Builds a prompt with those chunks.
Sends to the LLM to generate a clear answer.
Botâs answer:
âOur policy allows refunds within 30 days of delivery (see /policy/refunds.md, updated 2025-08-18). After 30 days, exchanges are possible for unopened items; refunds are not available.â
4. Why This Works
Always up-to-date: No retraining needed; just re-embed if policies change.
Trustworthy: Every answer includes a citation.
Efficient: You only embed once; retrieval is instant.
Sample Working Example (Python, end-to-end)
This shows the full loop text â embeddings â index â retrieve â compose prompt.
(You can plug the final prompt into any LLMâOpenAI, Ollama, or a local model.)
# pip install sentence-transformers faiss-cpu numpy
from sentence_transformers import SentenceTransformer
import numpy as np, faiss, textwrap
# 1) Tiny corpora (pretend these are chunks)
docs = [
("refunds.md", "Refunds are allowed within 30 days of delivery with receipt."),
("exchanges.md", "Exchanges are allowed within 45 days if the product is unopened."),
("shipping.md", "Standard shipping takes 5-7 days. Express shipping takes 2-3 days."),
("giftcards.md", "Gift cards are non-refundable but can be reissued if lost with proof.")
]
# 2) Embed with a small, fast model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [d[1] for d in docs]
emb = model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
# 3) Build a cosine-sim index (inner product on normalized vectors)
dim = emb.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(emb)
# 4) Simple retriever
def retrieve(query, k=3):
qv = model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
scores, idx = index.search(qv, k)
return [(docs[i][0], docs[i][1], float(scores[0][j])) for j, i in enumerate(idx[0])]
# 5) Compose a grounded prompt
def compose_prompt(query, passages):
ctx = "\n\n".join([f"[{p[0]}] {p[1]}" for p in passages])
template = f"""You are a helpful support assistant.
Use only the context to answer. If the answer is not in the context, say "I don't know."
Question: {query}
Context:
{ctx}
Answer with a short explanation and cite sources by filename in brackets.
"""
return template
# 6) Try it
query = "Can I get a refund after 35 days?"
hits = retrieve(query, k=3)
prompt = compose_prompt(query, hits)
print("Top retrievals:")
for path, txt, sc in hits:
print(f"- {path} (score={sc:.3f}) â {txt}")
print("\nPrompt sent to LLM:\n")
print(textwrap.indent(prompt, " "))
What to do next
Paste the printed prompt into your LLM of choice.
Observe it answers âNo, refunds are within 30 daysâ and cites
refunds.md.Change the question to âWhat if the box is unopened after 40 days?â and watch it cite
exchanges.md.
Getting Started Roadmap
Pick data: 30â100 pages you understand (Markdown/PDF/HTML).
Chunk & embed:
sentence-transformers/all-MiniLM-L6-v2is a great start.Index: FAISS locally (Flat first); try HNSW when >50k chunks.
Queryâretrieve: top-k (5â10), cosine similarity.
Prompt: âUse only the context; cite sources; if missing, say âI donât know.ââ
Evaluate: prepare 20â50 test questions; check accuracy and citations.
Hybrid: add BM25 + recency filter; rerank to polish.
Scale: move to Qdrant/Weaviate/Milvus; add update pipelines and caching.
Use Cases You Can Build This Month
Internal Answer Bot: plug in company policies, handbooks, and runbook docs; answer with citations.
Developer Copilot (Docs-Only): feed API docs/READMEs; retrieve relevant sections; generate example snippets.
Customer Support Drafts: retrieve policy language and generate polite, consistent replies for agents to edit.
Compliance Search: retrieve the relevant clauses and explain them with source links.
Reader Q&A
Do I need a GPU?
No. CPU is fine for MiniLM embeddings and FAISS. A GPU helps for huge corpora or heavy rerankers.
Should I fine-tune or use RAG?
Start with RAG. Itâs cheaper and stays current. Fine-tune later if you need tone/style control.
Which vector DB should I pick?
Local experiments: FAISS. Managed: Qdrant, Weaviate, or Milvusâchoose what fits your stack and budget.
Quick Glossary
RAG: retrieve relevant context, then generate the answer with it.
Embedding: a numeric representation of meaning.
Vector DB: database optimized for vector search.
k-NN: find the k nearest vectors to a query.
Context Window: how much text the LLM can read at once.
Grounding: tying answers to real sources with citations.
Hallucination: confident but false outputs; RAG reduces this.
Reranker: second-pass model to improve ordering of retrieved chunks.
Summary & Next Steps
RAG = fetch first, then generate. It keeps answers factual, current, and explainable.
You only need a few basics: chunking, embeddings, a vector index, and a clear prompt template.
Start small, evaluate with real questions, then add hybrid search and reranking.
This weekâs mini-project
Ingest 50â100 of your own docs.
Build the tiny RAG loop above (retrieve â prompt â LLM).
Add citations and an âI donât knowâ fallback.
Share one win and one failure you observedâthen iterate on chunking and k.
You now have the foundation to make GenAI that actually knows your stuff.
Hope you enjoyed reading this article.
If you found it valuable, hit a like and consider subscribing for more such content every week.
If you have any questions or suggestions, leave a comment.
This post is public so feel free to share it.
Subscribe for free to receive new articles every week.
Thanks for reading Rockyâs Newsletter ! Subscribe for free to receive new posts and support my work.
I actively post coding, system design and software engineering related content on
Spread the word and earn rewards!
If you enjoy my newsletter, share it with your friends and earn a one-on-one meeting with me when they subscribe. Let's grow the community together.
I hope you have a lovely day!
See you soon,
Rocky






Long lessons summary and services to the same for the good đ
Good summary