Retrieval-Augmented Generation (RAG) is an AI architecture that combines:
- Information Retrieval: Finding relevant documents from a knowledge base
- Augmented Context: Using retrieved information to enhance prompts
- Generation: Producing accurate, contextual responses using LLMs
- β Costs money for every API call
- β Your data leaves your machine
- β Rate limits and quotas
- β Internet dependency
- β Privacy concerns
- β Zero cost after initial setup
- β 100% private - data never leaves your machine
- β No rate limits - unlimited queries
- β Works offline - no internet needed
- β Full control - customize everything
Documents are split into manageable chunks for processing:
- Chunk Size: 512 characters (configurable)
- Overlap: 50 characters to maintain context
- Smart Splitting: Respects paragraphs, sentences, and markdown structure
Convert text into numerical vectors for similarity search:
from src.embeddings_local import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text:latest")- Model:
nomic-embed-text(768 dimensions) - Speed: ~0.2s per document
- Quality: Excellent for most use cases
from src.embeddings_local import SentenceTransformerEmbeddings
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")- Model:
all-MiniLM-L6-v2(384 dimensions) - Speed: ~0.1s per document
- Quality: Good, slightly lower than Ollama
Store and search embeddings efficiently:
LanceDB - Our choice over ChromaDB:
- 10x faster for large datasets
- Zero-copy data access
- Native hybrid search (vector + keyword)
- Automatic versioning
- Better memory efficiency
from src.vector_store_lancedb import LanceDBVectorStore
store = LanceDBVectorStore(
collection_name="documents",
embedding_dim=768
)Generate responses using Ollama:
from src.llm_local import OllamaLLM
llm = OllamaLLM(model="mistral:latest")Model recommendations by RAM:
- 8GB:
mistral:7b- Good balance - 16GB:
llama2:13b- Better quality - 32GB+:
mixtral:8x7b- Best quality
Orchestrates the entire process:
from src.rag_pipeline_local import LocalRAGPipeline
# Initialize
rag = LocalRAGPipeline()
# Add documents
rag.add_documents(["Document text..."])
# Query
response = rag.query("Your question")Documents β Chunking β Embeddings β Vector Store
- Documents are split into chunks
- Each chunk is converted to an embedding vector
- Vectors are stored in LanceDB with metadata
Query β Embedding β Vector Search β Context Retrieval
- User query is converted to an embedding
- Similar vectors are found in the database
- Original text chunks are retrieved
Context + Query β LLM β Response
- Retrieved context is combined with the query
- LLM generates a response using the context
- Response includes source references
Embeddings are automatically cached to avoid recomputation:
# Cache location: ./data/embedding_cache/
# Cache key: MD5 hash of model:textDocuments are processed in batches for efficiency:
# Default batch size: 100 documents
# Configurable in vector_store_lancedb.pyCreate an ANN index for faster search:
store.create_index(metric="L2", nprobes=20)- OpenAI Embeddings: $0.13 per million tokens
- Anthropic Claude: $0.25-1.25 per million tokens
- Monthly cost: $10-1000+ depending on usage
- Initial setup: ~1 hour
- Model downloads: ~5-30GB storage
- Running cost: $0.00 forever
At just 100 queries/day, you save ~$11/month. Break-even time: Less than 3 months!
Combine vector similarity with keyword matching:
results = store.search(
query="Python programming",
hybrid_search=True,
top_k=5
)Filter results by metadata:
results = store.search(
query="Your query",
filter_metadata={"source": "documentation"}
)Customize system prompts for different use cases:
response = rag.query(
query="Explain this code",
system_prompt="You are a code tutor..."
)-
"Ollama not found"
- Solution: Install Ollama from https://ollama.ai
- Start service:
ollama serve
-
"Model not found"
- Solution: Pull models
ollama pull nomic-embed-text:latest ollama pull mistral:latest
-
"Out of memory"
- Solution: Use smaller models
phiinstead ofmistral- Reduce batch size
-
"Slow performance"
- Solution: Create index
- Use SSD for data directory
- Enable caching
- Experiment with Models: Try different models for your use case
- Add Your Data: Index your own documents
- Customize Chunking: Adjust chunk size for your content
- Build Applications: Create chatbots, Q&A systems, etc.
- Optimize Performance: Fine-tune for your hardware
- Local RAG is production-ready: This isn't a toy - it's real, usable technology
- Cost savings are massive: Literally $0 to run after setup
- Privacy is absolute: Your data never leaves your machine
- Performance is excellent: Often faster than cloud APIs
- Customization is unlimited: You control every aspect
Remember: The best RAG system is the one that costs nothing to run and keeps your data private. That's what we've built here.