Skip to content

Jerry-145/PDF_QA_Bot

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

146 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF Q&A Bot

RAG-based document question-answering app with:

  • Frontend: React app (frontend/)
  • Backend API: Node + Express (server.js)
  • RAG Service: FastAPI + Hugging Face + FAISS (rag-service/)
  • πŸ” Authentication: Role-based access control with JWT tokens

Upload a PDF, ask questions from its content, and generate a short summary. You can export the chat as CSV or TXT (plain text).


πŸš€ Important: Context Leakage Fix Implemented

Issue Resolved: The system previously showed content from old PDFs when answering questions about new PDFs. This has been completely fixed.

For testing and understanding the fix, see:


⚠️ SECURITY NOTE: As of v2.0.0, authentication is required for all PDF processing endpoints. See Authentication section below.

Architecture

  1. Frontend uploads file to Node backend (/upload)
  2. Node forwards file path to FastAPI (/process-pdf)
  3. FastAPI detects file format (.pdf, .docx, .txt, .md), loads and splits the document, builds vector index with embeddings
  4. For /ask and /summarize, FastAPI retrieves relevant chunks and generates output with a Hugging Face model
  5. πŸ” All API endpoints now require valid JWT authentication

Project Structure

.
β”œβ”€β”€ frontend/           # React UI
β”œβ”€β”€ rag-service/        # FastAPI RAG service
β”‚   β”œβ”€β”€ auth/           # Authentication system
β”‚   β”‚   β”œβ”€β”€ models.py   # User models & roles
β”‚   β”‚   β”œβ”€β”€ schemas.py  # Pydantic schemas
β”‚   β”‚   β”œβ”€β”€ security.py # JWT & password utils
β”‚   β”‚   β”œβ”€β”€ middleware.py # Auth middleware
β”‚   β”‚   └── router.py   # Auth endpoints
β”‚   β”œβ”€β”€ tests/          # Comprehensive test suite
β”‚   β”œβ”€β”€ database.py     # Database configuration
β”‚   └── main.py         # FastAPI app with auth integration
β”œβ”€β”€ server.js           # Node API gateway
β”œβ”€β”€ uploads/            # Uploaded files (runtime)
└── CONTRIBUTING.md

Prerequisites

  • Node.js 18+ (LTS recommended)
  • Python 3.10+
  • pip

1) Clone and Install Dependencies

From repository root:

npm install
cd frontend && npm install
cd ../rag-service && python -m pip install -r requirements.txt

2) Environment Variables

Create .env in repo root (or edit existing):

# Optional model override
HF_GENERATION_MODEL=google/flan-t5-base

# REQUIRED: Authentication Configuration
SECRET_KEY=your-super-secret-jwt-key-change-this-in-production
ACCESS_TOKEN_EXPIRE_MINUTES=30

# Optional: Database Configuration (defaults to SQLite)
DATABASE_URL=sqlite:///./pdf_qa_bot.db
# For PostgreSQL: DATABASE_URL=postgresql://user:password@localhost/dbname
# For MySQL: DATABASE_URL=mysql://user:password@localhost/dbname

πŸ” Security Configuration Notes:

  • SECRET_KEY: MUST be changed in production! Generate with:
    python -c "from secrets import token_urlsafe; print(token_urlsafe(32))"
  • DATABASE_URL: Defaults to SQLite for development. Use PostgreSQL/MySQL for production.
  • Keep real secrets out of git.

Authentication

Overview

The PDF QA Bot now includes comprehensive authentication and authorization:

  • JWT-based Authentication: Secure token-based authentication
  • Role-based Access Control: User and Admin roles with extensible permission system
  • Password Security: bcrypt hashing with industry-standard practices
  • API Protection: All PDF processing endpoints require authentication

User Roles & Permissions

Role Permissions
User Upload PDFs, Ask questions, Summarize documents, View documents
Admin All user permissions + Manage users, Delete documents, Compare documents, View all data

Quick Start Guide

1. Create Admin User

After starting the application, create your first admin user:

# Register admin user via API
curl -X POST "http://localhost:5000/auth/register" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "email": "admin@yourcompany.com",
    "password": "your_secure_password_123",
    "full_name": "System Administrator",
    "role": "admin"
  }'

2. Login & Get Token

# Login to get JWT token
curl -X POST "http://localhost:5000/auth/login" \
  -H "Content-Type: application/json" \
  -d '{
    "username": "admin",
    "password": "your_secure_password_123"
  }'

# Response:
{
  "access_token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9...",
  "token_type": "bearer",
  "expires_in": 1800,
  "user": {
    "id": 1,
    "username": "admin",
    "email": "admin@yourcompany.com",
    "role": "admin",
    "is_active": true
  }
}

3. Use Protected Endpoints

All PDF processing endpoints now require the Authorization header:

# Upload PDF with authentication
curl -X POST "http://localhost:5000/upload" \
  -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9..." \
  -F "file=@your-document.pdf"

# Ask question with authentication
curl -X POST "http://localhost:5000/ask" \
  -H "Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9..." \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What is the main topic?",
    "doc_ids": ["doc-id-from-upload"]
  }'

Authentication API Endpoints

Public Endpoints (No Auth Required)

  • POST /auth/register - Register new user
  • POST /auth/login - Login and get JWT token

User Endpoints (Auth Required)

  • GET /auth/me - Get current user profile
  • PUT /auth/me - Update current user profile
  • POST /auth/change-password - Change password

Admin Endpoints (Admin Role Required)

  • GET /auth/users - List all users
  • GET /auth/users/{user_id} - Get user by ID
  • PUT /auth/users/{user_id} - Update user by ID
  • DELETE /auth/users/{user_id} - Delete user
  • POST /auth/users/{user_id}/activate - Activate user
  • POST /auth/users/{user_id}/deactivate - Deactivate user

Legacy Support

For backward compatibility, a deprecated anonymous upload endpoint is available:

  • POST /upload/anonymous - Upload without authentication (⚠️ Deprecated, will be removed)

Frontend Integration

The React frontend automatically handles authentication:

  1. Login Flow: Users must login before accessing features
  2. Token Management: Automatic JWT token storage and refresh
  3. Route Protection: Authenticated routes with role-based access
  4. User Management: Admin interface for managing users

Database Schema

The authentication system uses these database tables:

-- Users table
CREATE TABLE users (
    id INTEGER PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL,
    hashed_password VARCHAR(255) NOT NULL,
    full_name VARCHAR(100),
    role VARCHAR(20) DEFAULT 'user',
    is_active BOOLEAN DEFAULT true,
    is_verified BOOLEAN DEFAULT false,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

Security Best Practices

βœ… Implemented:

  • JWT tokens with configurable expiration
  • bcrypt password hashing
  • Role-based permission system
  • Input validation and sanitization
  • Rate limiting on authentication endpoints
  • SQL injection prevention with SQLAlchemy ORM

πŸ”’ Production Recommendations:

  • Use PostgreSQL/MySQL instead of SQLite
  • Configure HTTPS/TLS encryption
  • Set up proper CORS policies
  • Implement session management
  • Add API rate limiting
  • Monitor authentication logs
  • Regular security audits HF_GENERATION_MODEL=google/flan-t5-base

Notes:

- `OPENAI_API_KEY` is not required for current Hugging Face RAG flow.
- Keep real secrets out of git.

## 3) Run the App (3 terminals)

### Terminal A β€” RAG service (port 5000)

```bash
cd rag-service
uvicorn main:app --host 0.0.0.0 --port 5000 --reload

Terminal B β€” Node backend (port 4000)

# from the repository root (where server.js lives)
cd <your-repo-directory>
node server.js

Terminal C β€” Frontend (port 3000)

# navigate into the frontend subfolder from the repo root
cd frontend
npm start

Open: http://localhost:3000

Docker Deployment

For production deployment or simplified development setup, you can use Docker Compose to run all services with proper health checks and dependency management.

Quick Start with Docker

# Build and start all services
docker-compose up --build

# Or run in background
docker-compose up --build -d

# Check service status
docker-compose ps

Health Check Verification

All services include health and readiness endpoints for monitoring and debugging:

Gateway Service (Node.js)

# Basic health check
curl http://localhost:4000/healthz

# Readiness check (includes RAG service connectivity)
curl http://localhost:4000/readyz

RAG Service (FastAPI)

# Basic health check
curl http://localhost:5000/healthz

# Readiness check (includes model and component status)
curl http://localhost:5000/readyz

Docker Health Status

# View service health in Docker Compose
docker-compose ps

# View health check logs
docker-compose logs gateway
docker-compose logs rag-service

Service Health States

  • 🟒 Healthy: Service is up and responding
  • 🟑 Starting: Service is starting up (grace period)
  • πŸ”΄ Unhealthy: Service failed multiple health checks

The Docker Compose setup ensures:

  • RAG service starts and is healthy before gateway starts
  • Gateway waits for RAG service to be ready before accepting requests
  • Frontend waits for gateway to be healthy before starting
  • Automatic restart on health check failures

API Endpoints

Node backend (http://localhost:4000)

  • POST /upload (multipart form-data, field: file) - πŸ” Auth Required
  • POST /ask ({ "question": "..." }) - πŸ” Auth Required
  • POST /summarize ({}) - πŸ” Auth Required

FastAPI RAG service (http://localhost:5000)

PDF Processing Endpoints (πŸ” Authentication Required)

  • POST /upload - Upload and process PDF
  • POST /process-pdf - Process PDF from file path
  • POST /ask - Ask questions about documents
  • POST /summarize - Summarize documents
  • POST /compare - Compare multiple documents
  • GET /documents - List processed documents
  • GET /similarity-matrix - Get document similarity matrix
  • POST /upload (multipart form-data, field: file) β€” accepts .pdf, .docx, .txt, .md
  • POST /ask ({ "question": "..." })
  • POST /summarize ({})

Authentication Endpoints (Public)

  • POST /auth/register - Register new user
  • POST /auth/login - Login and get JWT token

User Management Endpoints (πŸ” Authentication Required)

  • GET /auth/me - Get current user profile
  • PUT /auth/me - Update profile
  • POST /auth/change-password - Change password

Admin Endpoints (πŸ‘‘ Admin Role Required)

  • GET /auth/users - List all users
  • GET /auth/users/{user_id} - Get user details
  • PUT /auth/users/{user_id} - Update user
  • DELETE /auth/users/{user_id} - Delete user
  • POST /auth/users/{user_id}/activate - Activate user
  • POST /auth/users/{user_id}/deactivate - Deactivate user

Legacy Endpoints (⚠️ Deprecated)

  • POST /upload/anonymous - Upload without auth (will be removed)

Interactive API docs: http://localhost:5000/docs

Troubleshooting

General Issues

  • Cannot POST /upload from frontend

    • Restart frontend dev server after config changes: npm start
    • Ensure Node backend is running on port 4000
  • Upload fails / connection refused

    • Ensure FastAPI is running on port 5000
  • Slow first request

    • Hugging Face model downloads on first run (can take time)
  • Port already in use

    • Stop old processes or change ports consistently in frontend/backend/service

Authentication Issues

  • 401 Unauthorized errors

    • Ensure you're logged in and have a valid JWT token
    • Check that Authorization header is properly formatted: Bearer <token>
    • Token may have expired (default: 30 minutes) - login again
  • 403 Forbidden errors

    • Your user role may not have sufficient permissions
    • Contact admin to check your user role and permissions
    • Some endpoints require admin role
  • Database connection errors

    • Check DATABASE_URL in .env file
    • Ensure database file has proper permissions (SQLite)
    • For PostgreSQL/MySQL: verify connection credentials
  • JWT Secret Key warnings

    • Change SECRET_KEY in .env file from default value
    • Use a securely generated random key for production
  • User registration fails

    • Username/email may already exist
    • Password must be at least 8 characters
    • Check required fields are filled

Testing Authentication

Run the test suite to verify authentication is working:

cd rag-service
pip install pytest pytest-asyncio httpx pytest-mock
pytest tests/ -v

Reset Database (Development)

If you need to reset the user database:

cd rag-service
rm -f pdf_qa_bot.db  # Remove SQLite database
# Database will be recreated on next startup

Development Notes

  • RAG index is in-memory (rebuilds after restart)
  • Summarization and QA use retrieved context from the last processed PDF
  • πŸ” Authentication: SQLite database persists users across restarts
  • JWT tokens expire after 30 minutes by default (configurable)
  • Role-based permissions are extensible for future role types

Testing

The project includes comprehensive test coverage for authentication and API endpoints.

Running Tests

cd rag-service

# Install test dependencies
pip install pytest pytest-asyncio httpx pytest-mock

# Run all tests
pytest tests/ -v

# Run specific test categories
pytest tests/test_auth_endpoints.py -v  # Authentication tests
pytest tests/test_security.py -v       # Security utilities tests
pytest tests/test_middleware.py -v     # Middleware tests
pytest tests/test_models.py -v         # Database model tests
pytest tests/test_protected_endpoints.py -v  # API protection tests

# Run with coverage
pip install pytest-cov
pytest tests/ --cov=. --cov-report=html

Test Categories

  • Authentication Tests: User registration, login, token validation
  • Authorization Tests: Role-based access control, permissions
  • Security Tests: Password hashing, JWT token management
  • API Protection Tests: Endpoint security, rate limiting
  • Database Tests: User models, relationships, queries
  • Integration Tests: End-to-end authentication flows

Test Configuration

Tests use a separate SQLite database (test_pdf_qa_bot.db) that's automatically created and cleaned up. No configuration needed for basic testing.

Security Considerations

For Development

  • Default SECRET_KEY is acceptable for local development
  • SQLite database is fine for testing and development
  • CORS is configured permissively for frontend integration

For Production

⚠️ IMPORTANT: Before deploying to production:

  1. Change SECRET_KEY: Generate a secure random key
  2. Use Production Database: PostgreSQL or MySQL recommended
  3. Configure CORS: Restrict allowed origins to your domain
  4. Enable HTTPS: Use TLS encryption for all communications
  5. Environment Variables: Use proper secrets management
  6. Update Dependencies: Keep all packages up to date
  7. Rate Limiting: Configure appropriate rate limits
  8. Monitoring: Set up authentication and error logging

Extensibility

The authentication system is designed for future extension:

  • Additional Roles: Easy to add MODERATOR, PREMIUM_USER, etc.
  • Custom Permissions: Granular permission system expandable
  • OAuth Integration: Architecture supports OAuth2/OIDC providers
  • Multi-tenant: Database schema ready for organization support
  • Audit Logging: Framework in place for comprehensive audit trails

Advanced Issues

See ADVANCED_ISSUES.md for critical security, performance, and architecture issues that need attention before production deployment.

Contributing

Refer to CONTRIBUTING.md for detailed instructions on creating a branch, naming conventions, committing changes, and submitting pull requests.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 58.4%
  • JavaScript 33.3%
  • CSS 4.7%
  • HTML 2.4%
  • Dockerfile 1.2%