918508c4217c4ddf8da2c12b45b38010

markdown 918508c4217c4ddf8da2c12b45b38010.md

Chunks (10)

Chunk #0 — # Retrieval-Augmented Generation (RAG) 51 tokens
# Retrieval-Augmented Generation (RAG) Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with language model generation to produce more accurate and grounded responses.
Chunk #1 — ## How RAG Works 142 tokens
## How RAG Works The RAG pipeline consists of three main stages: 1. **Indexing**: Documents are split into chunks, embedded into vectors, and stored in a vector database like Qdrant, Pinecone, or Weaviate. 2. **Retrieval**: When a user query arrives, it is embedded using the same model, and the most similar document chunks are retrieved via approximate nearest neighbor (ANN) search. 3. **Generation**: The retrieved chunks are passed as context to a large language model (LLM) like GPT-4 or Claude, which generates a response grounded in the provided information.
Chunk #2 — ## Key Benefits 130 tokens
## Key Benefits - **Reduced hallucination**: By grounding responses in retrieved documents, RAG significantly reduces the tendency of LLMs to generate incorrect information. - **Up-to-date knowledge**: Unlike fine-tuning, RAG allows models to access the latest information without retraining. - **Source attribution**: Each response can be traced back to specific source documents, enabling citation and verification. - **Domain specificity**: Organizations can build RAG systems over their proprietary knowledge bases.
Chunk #3 — ## Chunking Strategies 148 tokens
## Chunking Strategies Effective chunking is critical for RAG performance: - **Fixed-size chunking**: Split text into chunks of N tokens with overlap. Simple but may break semantic boundaries. - **Semantic chunking**: Split by paragraphs, sections, or sentences, respecting document structure. - **Recursive chunking**: Start with large chunks, recursively split those exceeding the size limit. - **Agentic chunking**: Use an LLM to determine optimal split points based on content. The optimal chunk size depends on the embedding model and use case, typically ranging from 256 to 1024 tokens.
Chunk #4 — ## Vector Databases 123 tokens
## Vector Databases Vector databases are purpose-built for storing and searching high-dimensional embeddings: - **Qdrant**: Open-source, supports filtering and payload storage, written in Rust. - **Pinecone**: Managed cloud service with simple API, scales automatically. - **Weaviate**: Open-source with built-in vectorization modules and GraphQL API. - **Milvus**: Open-source, designed for billion-scale vector search. - **ChromaDB**: Lightweight, designed for AI application development.
Chunk #5 — ## Advanced RAG Techniques 6 tokens
## Advanced RAG Techniques
Chunk #6 — ### Hybrid Search 39 tokens
### Hybrid Search Combining dense vector search with sparse keyword search (BM25) improves recall by capturing both semantic similarity and exact term matches.
Chunk #7 — ### Re-ranking 41 tokens
### Re-ranking After initial retrieval, a cross-encoder model re-ranks the results for better precision. Models like Cohere Rerank or BGE Reranker are commonly used.
Chunk #8 — ### Query Expansion 31 tokens
### Query Expansion Generating multiple reformulations of the user query helps retrieve a broader set of relevant documents.
Chunk #9 — ### Knowledge Graphs 42 tokens
### Knowledge Graphs Integrating knowledge graphs with RAG adds structured relationships between entities, enabling multi-hop reasoning and better context understanding.