A production-ready RAG (Retrieval-Augmented Generation) pipeline designed to work with NebulaBlock's Inference API. This project demonstrates how to build a complete RAG system with document indexing, semantic search, state-of-the-art reranking, and answer generation.
- Production-Ready: Robust error handling, compression support, and browser-like headers
- State-of-the-Art Models: BAAI/bge-reranker-v2-m3 for superior reranking performance
- Lightweight: Minimal dependencies, no heavy ML frameworks
- Configurable: Environment-based configuration for all endpoints and models
- OpenAI-Compatible: Works with OpenAI-compatible APIs
- Complete Pipeline: Document splitting β embedding β retrieval β reranking β generation
- CLI Interface: Easy-to-use command-line interface with comprehensive options
- In-Memory Store: Fast vector similarity search with cosine similarity
- Compression Support: Handles Brotli and Gzip compression automatically
- Cloudflare Bypass: Browser-like headers to avoid security blocks
- Python 3.8+
- NebulaBlock API access
- Internet connection for API calls
# Clone the repository
git clone <repository-url>
cd rag-example
# Install in development mode
pip install -e .# Clone the repository
git clone <repository-url>
cd rag-example
# Install dependencies
pip install -r requirements.txt
# Run directly
python -m nebularag.cli.main --helpCreate a .env file in the project root with the following variables:
# Required
NEBULABLOCK_BASE_URL=https://inference.nebulablock.com/v1
NEBULABLOCK_API_KEY=sk-your-api-key-here
# Optional (defaults shown)
NEBULABLOCK_EMBEDDINGS_PATH=/embeddings
NEBULABLOCK_RERANK_PATH=/rerank
NEBULABLOCK_CHAT_PATH=/chat/completions
# Models (optimized for performance)
NEBULABLOCK_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B
NEBULABLOCK_RERANKER_MODEL=BAAI/bge-reranker-v2-m3
NEBULABLOCK_CHAT_MODEL=mistralai/Mistral-Small-3.2-24B-Instruct-2506- Embedding:
Qwen/Qwen3-Embedding-8B- High-quality 4096-dimensional embeddings - Reranker:
BAAI/bge-reranker-v2-m3- State-of-the-art reranking model for superior relevance scoring - Chat:
mistralai/Mistral-Small-3.2-24B-Instruct-2506- Powerful instruction-following model
rag-example/
βββ nebularag/ # Main package
β βββ cli/ # Command-line interface
β β βββ main.py # CLI entry point
β βββ clients/ # External API clients
β β βββ nebula_client.py # NebulaBlock API client
β βββ config/ # Configuration management
β β βββ settings.py # Environment settings
β βββ core/ # Core RAG components
β β βββ rag_pipeline.py # Main RAG pipeline
β β βββ vector_store.py # In-memory vector store
β βββ utils/ # Utility functions
β βββ file_utils.py # File operations
β βββ text_processing.py # Text splitting utilities
βββ tests/ # Test suite
β βββ test_api.py # API connectivity tests
βββ examples/ # Usage examples
β βββ basic_usage.py # Programmatic usage example
βββ docs/ # Sample documents
β βββ sample.md # Example markdown file
βββ setup.py # Package configuration
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ .gitignore # Git ignore rules
βββ README.md # This file
-
Prepare your documents: Place
.txt,.md, or.pdffiles in a directory (e.g.,docs/) -
Set up environment: Copy
.env.exampleto.envand fill in your API credentials -
Run the RAG pipeline:
python -m nebularag.cli.main --docs docs --question "Why machine learning with nebula block?"
# Custom chunk size and overlap
python -m nebularag.cli.main \
--docs docs \
--question "Why machine learning with nebula block?" \
--chunk-size 1000 \
--chunk-overlap 150 \
--top-k 15 \
--rerank-k 8| Option | Description | Default |
|---|---|---|
--docs |
Path to documents directory | Required |
--question |
Question to ask | Required |
--chunk-size |
Size of text chunks | 800 |
--chunk-overlap |
Overlap between chunks | 120 |
--top-k |
Number of candidates to retrieve | 12 |
--rerank-k |
Number of candidates after reranking | 6 |
from nebularag import RAGPipeline, NebulaBlockClient, read_text_files
from nebularag.config import get_settings
# Initialize the RAG pipeline
client = NebulaBlockClient()
rag = RAGPipeline(
client=client,
chunk_size=800,
chunk_overlap=120,
top_k=12,
rerank_k=6
)
# Load and index documents
docs = read_text_files('docs')
rag.index_texts(docs)
# Ask questions
result = rag.answer("What is the main topic?")
print(f"Answer: {result['answer']}")
print(f"Sources: {len(result['sources'])} chunks")Test your NebulaBlock API connection:
python tests/test_api.py-
Document Processing:
- Reads
.txt,.md, and.pdffiles from the specified directory - Extracts text content from PDFs using PyPDF2
- Splits documents into overlapping chunks (default: 800 chars, 120 overlap)
- Reads
-
Indexing:
- Generates embeddings for each chunk using Qwen/Qwen3-Embedding-8B
- Stores embeddings in an in-memory vector store with cosine similarity
-
Retrieval:
- Embeds the user question
- Retrieves top-K most similar chunks by cosine similarity
-
Reranking:
- Sends retrieved candidates to BAAI/bge-reranker-v2-m3
- Reranks based on relevance to the question with superior accuracy
- Selects top rerank-K candidates
-
Generation:
- Combines reranked chunks as context
- Sends context + question to Mistral-Small-3.2-24B-Instruct-2506
- Returns the generated answer with source citations
The client assumes OpenAI/Cohere-like JSON structures but keeps endpoints configurable:
- Embeddings:
POST /embeddingswith{"model": "...", "input": [...]} - Reranking:
POST /rerankwith{"model": "...", "query": "...", "documents": [...]} - Chat:
POST /chat/completionswith{"model": "...", "messages": [...]}
- Compression Support: Automatically handles Brotli and Gzip compression
- Cloudflare Bypass: Uses browser-like headers to avoid security blocks
- Error Handling: Comprehensive error handling with retries and fallbacks
- Unicode Support: Robust text encoding with UTF-8 and Latin-1 fallbacks
# With sample documents
python -m nebularag.cli.main \
--docs docs \
--question "Why machine learning with nebula block?"# Run the comprehensive demo
python examples/basic_usage.pyIf you prefer the official OpenAI client:
from openai import OpenAI
import os
client = OpenAI(
base_url=os.environ["NEBULABLOCK_BASE_URL"],
api_key=os.environ["NEBULABLOCK_API_KEY"]
)
# Embedding
response = client.embeddings.create(
model=os.environ["NEBULABLOCK_EMBEDDING_MODEL"],
input=["hello world"]
)
# Chat
response = client.chat.completions.create(
model=os.environ["NEBULABLOCK_CHAT_MODEL"],
messages=[{"role": "user", "content": "Hi"}]
)# Test API connectivity
python tests/test_api.py
# Test imports
python -c "from nebularag import NebulaBlockClient, RAGPipeline; print('Import successful!')"
# Run the full demo
python examples/basic_usage.py- New Vector Store: Implement the interface in
nebularag/core/vector_store.py - New Splitters: Add functions to
nebularag/utils/text_processing.py - New Clients: Extend
nebularag/clients/nebula_client.pyor create new client classes
- This is a production-ready implementation with robust error handling
- For production use, consider:
- Persistent vector databases (Pinecone, Weaviate, etc.)
- Semantic chunking strategies
- Caching mechanisms
- Rate limiting and retry logic
- The reranker uses BAAI/bge-reranker-v2-m3 for superior performance
- All API calls are synchronous; async support can be added for better performance
- Compression and encoding issues are handled automatically
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- ModuleNotFoundError: Make sure you've installed the package with
pip install -e . - API Key Error: Verify your
NEBULABLOCK_API_KEYis set correctly - Connection Error: Check your
NEBULABLOCK_BASE_URLand internet connection - Empty Results: Ensure your documents directory contains
.txt,.md, or.pdffiles - Compression Error: Install Brotli with
pip install brotli>=1.0.9 - Cloudflare Block: The client automatically uses browser-like headers to bypass this
- Check the Issues page
- Review the API documentation for NebulaBlock
- Test your API connection with
python tests/test_api.py
- Embedding Model: Qwen/Qwen3-Embedding-8B provides 4096-dimensional embeddings
- Reranker: BAAI/bge-reranker-v2-m3 offers state-of-the-art relevance scoring
- Chat Model: Mistral-Small-3.2-24B-Instruct-2506 delivers high-quality responses
- Vector Search: Cosine similarity with in-memory storage for fast retrieval
- Compression: Automatic Brotli/Gzip handling for efficient data transfer