A 100% local, ZERO-COST RAG system that replaces:
- OpenAI Embeddings ($0.13/million tokens) → Ollama/SentenceTransformers (FREE)
- Anthropic Claude ($0.25-1.25/million tokens) → Ollama Local LLMs (FREE)
- ChromaDB → LanceDB (10x faster)
- Python 3.9+ (via system or UV)
- 8GB RAM minimum (16GB recommended, 32GB optimal)
- OS: Windows 10/11, macOS 10.15+, Linux (Ubuntu 20.04+)
- Shell: PowerShell 5.1+ (Windows) or Bash (Mac/Linux)
.\setup_complete.ps1chmod +x setup_complete.sh
./setup_complete.sh.\setup_uv.ps1./setup_uv.sh
# Or directly:
curl -LsSf https://astral.sh/uv/install.sh | sh# Download and run installer:
# https://github.com/ollama/ollama/releases/latest/download/OllamaSetup.exe
# Or use script:
.\install_ollama_windows.ps1# Using Homebrew
brew install ollama
# Or using script
./install_ollama_unix.sh# Official installer
curl -fsSL https://ollama.ai/install.sh | sh
# Or using script
./install_ollama_unix.sh# Direct path
& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" serve
# Or if in PATH
ollama serve
# Or use batch file
.\run_ollama_direct.batollama serve
# Or in background
nohup ollama serve > /tmp/ollama.log 2>&1 &
# Or use script
./run_ollama_direct.shKeep this running in a separate terminal!
# Pull embedding model (274MB)
ollama pull nomic-embed-text
# Pull LLM based on your RAM:
# 8GB RAM - Mistral 7B (4.4GB)
ollama pull mistral
# 16GB RAM - Llama 2 13B
ollama pull llama2:13b
# 32GB+ RAM - Mixtral (best quality)
ollama pull mixtral$ollama = "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe"
& $ollama pull nomic-embed-text
& $ollama pull mistraluv pip install lancedb pyarrow sentence-transformerspip install lancedb pyarrow sentence-transformers# Run comprehensive test
python test_local_rag.py# List models
curl http://localhost:11434/api/tags
# Test embedding
curl -X POST http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"nomic-embed-text","prompt":"test"}'
# Test LLM
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model":"mistral","prompt":"Hello","stream":false}'RAG Shit/
├── src/
│ ├── vector_store_lancedb.py # LanceDB implementation
│ ├── embeddings_local.py # Ollama/ST embeddings
│ ├── llm_local.py # Ollama LLM wrapper
│ └── rag_pipeline_local.py # Complete local pipeline
├── config/
│ ├── local_rag_config.yaml # Configuration file
│ └── settings.py # Python settings
├── Windows Scripts/
│ ├── install_ollama_windows.ps1 # Ollama installer
│ ├── setup_ollama_models.ps1 # Model downloader
│ ├── setup_complete.ps1 # One-click setup
│ ├── run_ollama_direct.bat # Direct launcher
│ └── setup_uv.ps1 # UV installer
├── Unix Scripts/
│ ├── install_ollama_unix.sh # Ollama installer
│ ├── setup_ollama_models.sh # Model downloader
│ ├── setup_complete.sh # One-click setup
│ ├── run_ollama_direct.sh # Direct launcher
│ └── setup_uv.sh # UV installer
├── test_local_rag.py # Test suite
├── LOCAL_RAG_SETUP.md # Complete guide
├── QUICKSTART_LOCAL.md # Quick start
└── MIGRATION_GUIDE.md # Migration from APIs
Models are stored in: %USERPROFILE%\.ollama\models
Available models after setup:
nomic-embed-text- 768-dim embeddingsmistral:7b- 7B parameter LLM
Data stored in: ./data/lancedb/
Features enabled:
- Vector search with ANN index
- Hybrid search capability
- Automatic versioning
No API keys needed! But if you want fallback to APIs:
# .env file (OPTIONAL - only for fallback)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...from src.rag_pipeline_local import LocalRAGPipeline
# Initialize pipeline
rag = LocalRAGPipeline(
llm_model="mistral:7b",
embedding_model="nomic-embed-text"
)
# Add documents
docs = ["Your document text here..."]
rag.add_documents(docs)
# Query - ZERO COST!
response = rag.query("What is this about?")
print(f"Answer: {response.answer}")
print(f"Cost: ${response.cost}") # Always $0.00!rag = LocalRAGPipeline(
llm_model="mistral:7b",
embedding_model="all-MiniLM-L6-v2",
use_sentence_transformers=True
)Solution: Use full path or add to PATH:
$env:Path += ";$env:LOCALAPPDATA\Programs\Ollama"Solution: Start in new window:
Start-Process "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" -ArgumentList "serve"Solution: Pull the model:
& "$env:LOCALAPPDATA\Programs\Ollama\ollama.exe" pull mistralSolution: Run setup and restart PowerShell:
.\setup_uv.ps1
# Then restart PowerShellOn 32GB RAM system with Mistral 7B:
- Embedding generation: ~0.2s per document
- Vector search: ~0.05s for 10k documents
- LLM generation: ~2-5s for 500 tokens
- Total query time: ~3-6s
- Cost: $0.00
| RAM | Embedding Model | LLM Model | Quality |
|---|---|---|---|
| 4GB | all-MiniLM-L6-v2 (ST) | phi-2 | Basic |
| 8GB | nomic-embed-text | mistral:7b | Good |
| 16GB | nomic-embed-text | llama2:13b | Excellent |
| 32GB+ | mxbai-embed-large | mixtral:8x7b | Best |
| Operation | Old (APIs) | New (Local) | Savings |
|---|---|---|---|
| 1M embeddings | $130 | $0 | 100% |
| 1M LLM tokens | $250-1250 | $0 | 100% |
| Monthly (100 queries/day) | ~$11.40 | $0 | 100% |
| Yearly | ~$137 | $0 | 100% |
-
Optimize for your hardware:
- GPU? Install CUDA-enabled llama.cpp
- More RAM? Use larger models
- SSD? Move LanceDB to SSD for faster search
-
Production deployment:
- Set up Ollama as Windows service
- Add monitoring/logging
- Implement caching layer
-
Advanced features:
- Implement reranking
- Add hybrid search
- Use quantized models for speed
You now have a complete local RAG system that:
- Costs nothing to run
- Protects your data (100% local)
- Runs offline (no internet needed)
- Has no rate limits
- Performs comparably to cloud solutions
Stop paying for API calls. This is the way.