The Common Knowledge Base provides multiple API interfaces for different purposes:
- External API (Ruuter): Public endpoints with authentication
- Internal API (Ruuter Internal): Service-to-service communication
- Microservice APIs: Direct service endpoints for specialized operations
Base URL: http://localhost:8080/ckb
Authentication: JWT Bearer token required
Documentation: Available at /docs (OpenAPI/Swagger)
Each microservice exposes its own API with auto-generated documentation:
| Service | Port | Docs URL | Purpose |
|---|---|---|---|
| File Processing | 8888 | http://localhost:8888/docs |
File operations and storage |
| Scrapper | 8000 | http://localhost:8000/docs |
Web scraping tasks |
| Cleaning | 8001 | http://localhost:8001/docs |
Content cleaning |
| Scheduler | 8003 | http://localhost:8003/docs |
Task scheduling |
| Data Export | 8002 | http://localhost:8002/docs |
Data export operations |
| Search | 3000 | - | Content search and indexing |
POST /ckb/auth/login
Content-Type: application/json
{
"username": "string",
"password": "string"
}Response:
{
"access_token": "eyJ...",
"token_type": "bearer",
"expires_in": 3600
}Authorization: Bearer eyJ...GET /ckb/auth/jwt/userinfo
Authorization: Bearer eyJ...List all agencies.
Response:
[
{
"id": "uuid",
"base_id": "uuid",
"name": "Agency Name",
"sector": "Healthcare",
"external_id": "ext_123",
"created_at": "2025-01-01T00:00:00Z",
"updated_at": "2025-01-01T00:00:00Z",
"type": "client",
"data_hash": "sha1_hash"
}
]Get specific agency details.
Query Parameters:
base_id(UUID): Agency base ID
Create new agency.
Request Body:
{
"name": "string",
"sector": "string",
"external_id": "string"
}Update agency information.
Request Body:
{
"base_id": "uuid",
"name": "string",
"sector": "string"
}Delete agency.
Request Body:
{
"base_id": "uuid"
}List all sources.
Query Parameters:
agency_base_id(UUID, optional): Filter by agency
Response:
[
{
"id": "uuid",
"base_id": "uuid",
"agency_base_id": "uuid",
"url": "https://example.com",
"subsector": "Public Health",
"type": "url_to_scrape",
"status": "running",
"update_automatically": true,
"cron_schedule": "0 */6 * * *",
"last_scraped_at": "2025-01-01T12:00:00Z",
"next_scrapping_at": "2025-01-01T18:00:00Z"
}
]List API sources specifically.
Get specific source details.
Query Parameters:
base_id(UUID): Source base ID
Create new data source.
Request Body:
{
"agency_base_id": "uuid",
"url": "https://example.com",
"subsector": "string",
"type": "url_to_scrape",
"update_automatically": true,
"cron_schedule": "0 */6 * * *"
}Update scraping frequency.
Request Body:
{
"base_id": "uuid",
"cron_schedule": "0 */12 * * *"
}Update source subsector.
Request Body:
{
"base_id": "uuid",
"subsector": "New Subsector"
}Trigger source scraping.
Request Body:
{
"base_id": "uuid"
}Stop source processing.
Request Body:
{
"base_id": "uuid"
}Delete source.
Request Body:
{
"base_id": "uuid"
}List source files.
Query Parameters:
agency_base_id(UUID, optional): Filter by agencysource_base_id(UUID, optional): Filter by sourcetype(string, optional): Filter by file type
Response:
[
{
"id": "uuid",
"base_id": "uuid",
"source_base_id": "uuid",
"agency_base_id": "uuid",
"url": "https://example.com/page",
"page_title": "Document Title",
"file_name": "document.pdf",
"type": "scraped_file",
"status": "finished",
"subsector": "Healthcare",
"is_excluded": false,
"original_data_url": "s3://bucket/raw/file",
"cleaned_data_url": "s3://bucket/cleaned/file",
"created_at": "2025-01-01T00:00:00Z",
"last_scraped_at": "2025-01-01T12:00:00Z"
}
]Add uploaded files to a source.
Request Headers:
Cookie: Contains JWT with user information for tracking uploader
Request Body:
{
"agencyId": "uuid",
"sourceId": "uuid",
"files": [
{
"base_id": "uuid",
"file_name": "document.pdf",
"original_data_url": "s3://bucket/uploads/file",
"subsector": "Legal",
"file_size": 13264
}
]
}Response:
[
{
"id": "uuid",
"url": null,
"hash": "",
"original_data_url": "s3://bucket/uploads/file",
"path": "s3://bucket/uploads/file"
}
]Note: The uploaded_by field is automatically populated from the JWT cookie (user's idCode).
Get presigned upload URLs.
Request Body:
{
"source_base_id": "uuid",
"file_names": ["file1.pdf", "file2.docx"]
}Response:
{
"upload_urls": [
{
"file_name": "file1.pdf",
"upload_url": "https://s3.amazonaws.com/presigned-url",
"expires_at": "2025-01-01T01:00:00Z"
}
]
}Exclude files from processing.
Request Body:
{
"base_ids": ["uuid1", "uuid2"]
}Edit file metadata.
Request Body:
{
"base_id": "uuid",
"page_title": "New Title",
"subsector": "New Subsector"
}Trigger file reprocessing.
Request Body:
{
"base_ids": ["uuid1", "uuid2"]
}Delete files.
Request Body:
{
"base_ids": ["uuid1", "uuid2"]
}List processing reports.
Query Parameters:
agency_base_id(UUID, optional): Filter by agencysource_base_id(UUID, optional): Filter by source
Response:
[
{
"id": "uuid",
"base_id": "uuid",
"agency_base_id": "uuid",
"source_base_id": "uuid",
"agency_name": "Agency Name",
"url": "https://example.com",
"scraping_started_at": "2025-01-01T12:00:00Z",
"scraping_finished_at": "2025-01-01T12:30:00Z",
"errors": 2,
"scraping_log_url": "s3://bucket/logs/scraping.log",
"cleaning_log_url": "s3://bucket/logs/cleaning.log"
}
]Get detailed processing logs.
Delete processing reports.
Request Body:
{
"base_ids": ["uuid1", "uuid2"]
}Generate download URL for a file.
Query Parameters:
blob_storage_path(string): Path to file in blob storage
Response:
{
"download_url": "https://s3.amazonaws.com/presigned-url",
"expires_at": "2025-01-01T01:00:00Z"
}OpenAPI Docs: http://localhost:8888/docs
POST /upload-urls- Generate presigned upload URLsPOST /upload- Async file upload with task trackingPOST /upload-sync- Synchronous file uploadPOST /upload-file-content- Direct content uploadGET /upload/{task_id}- Upload task status
POST /download-urls- Generate download URLsPOST /download-files-to-volume- Download to local storagePOST /download-files-to-volume-async- Async downloadGET /download-task/{task_id}- Download task status
POST /delete-files- Delete from blob storagePOST /delete-files-async- Async deletionPOST /move-files- Move files in storagePOST /move-files-async- Async file movePOST /zip-and-upload-folders- Create archives
OpenAPI Docs: http://localhost:8000/docs
POST /specified-pages-scrapper-task- Scrape specific URLsPOST /entire-source-scrapper-task- Comprehensive site scrapingPOST /sitemap-collect-scrapper-task- Sitemap-based scrapingPOST /eesti-scrapper-task- Estonian government sitesPOST /specified-api-files-scrapper-task- Process API filesPOST /uploaded-file- Process uploaded filesPOST /generate-edited-metadata- Generate file metadata
OpenAPI Docs: http://localhost:8001/docs
POST /clean_file- Clean and extract text from files
OpenAPI Docs: http://localhost:8003/docs
POST /render_next_run- Calculate next execution time
OpenAPI Docs: http://localhost:8002/docs
GET /exports- List available export tasksPOST /exports- Trigger all exportsGET /exports/{task_name}- Get specific export taskPOST /exports/{task_name}- Trigger specific export
POST /index/{sourceId}/bulk- Bulk index documentsGET /search/{sourceId}- Search within sourceGET /document/{sourceId}/{documentId}- Get full documentDELETE /index/{sourceId}- Delete source indexDELETE /documents/{sourceId}/{sourceFileId}- Delete documentsGET /stats/{sourceId}- Get source statisticsGET /health- Service health check
{
"error": "Error description",
"code": "ERROR_CODE",
"details": "Additional error details"
}- 200: Success
- 400: Bad Request - Invalid parameters
- 401: Unauthorized - Missing or invalid authentication
- 403: Forbidden - Insufficient permissions
- 404: Not Found - Resource does not exist
- 500: Internal Server Error - Server processing error
{
"id": "uuid",
"base_id": "uuid",
"name": "string",
"sector": "string",
"external_id": "string",
"type": "client | api",
"zip_dirty": "boolean",
"is_zipping": "boolean",
"data_hash": "string",
"created_at": "timestamp",
"updated_at": "timestamp"
}{
"id": "uuid",
"base_id": "uuid",
"agency_base_id": "uuid",
"url": "string",
"subsector": "string",
"type": "url_to_scrape | file | api",
"status": "new | running | finished | failed",
"update_automatically": "boolean",
"cron_schedule": "string",
"last_scraped_at": "timestamp",
"next_scrapping_at": "timestamp",
"is_stopping": "boolean",
"created_at": "timestamp",
"updated_at": "timestamp"
}{
"id": "uuid",
"base_id": "uuid",
"source_base_id": "uuid",
"agency_base_id": "uuid",
"url": "string",
"page_title": "string",
"file_name": "string",
"type": "scraped_file | uploaded_file | api_file",
"status": "scraping | cleaning | finished | not_found | failed",
"subsector": "string",
"is_excluded": "boolean",
"original_data_url": "string",
"cleaned_data_url": "string",
"edited_data_url": "string",
"original_metadata_url": "string",
"cleaned_metadata_url": "string",
"edited_metadata_url": "string",
"original_data_hash": "string",
"external_id": "string",
"created_at": "timestamp",
"updated_at": "timestamp",
"last_scraped_at": "timestamp",
"originally_scraped": "timestamp"
}{
"id": "uuid",
"base_id": "uuid",
"agency_base_id": "uuid",
"source_base_id": "uuid",
"agency_name": "string",
"url": "string",
"scraping_started_at": "timestamp",
"scraping_finished_at": "timestamp",
"errors": "integer",
"scraping_log_url": "string",
"cleaning_log_url": "string",
"created_at": "timestamp",
"updated_at": "timestamp"
}# 1. Login and get token
TOKEN=$(curl -X POST http://localhost:8080/ckb/auth/login \
-H "Content-Type: application/json" \
-d '{"username": "admin", "password": "password"}' \
| jq -r '.access_token')
# 2. Create agency
AGENCY_RESPONSE=$(curl -X POST http://localhost:8080/ckb/agency/add \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "Test Agency", "sector": "Healthcare"}')
AGENCY_ID=$(echo $AGENCY_RESPONSE | jq -r '.base_id')
# 3. Add data source
SOURCE_RESPONSE=$(curl -X POST http://localhost:8080/ckb/source/add \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"agency_base_id\": \"$AGENCY_ID\", \"url\": \"https://example.com\", \"type\": \"url_to_scrape\"}")
SOURCE_ID=$(echo $SOURCE_RESPONSE | jq -r '.base_id')
# 4. Trigger scraping
curl -X POST http://localhost:8080/ckb/source/refresh \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"base_id\": \"$SOURCE_ID\"}"
# 5. Monitor progress
curl -H "Authorization: Bearer $TOKEN" \
"http://localhost:8080/ckb/reports/all?source_base_id=$SOURCE_ID"
# 6. List collected files
curl -H "Authorization: Bearer $TOKEN" \
"http://localhost:8080/ckb/source-file/all?source_base_id=$SOURCE_ID"
# 7. Search content (if search service is running)
curl "http://localhost:3000/search/$SOURCE_ID?q=healthcare&page=1&size=10"# 1. Get upload URLs
UPLOAD_RESPONSE=$(curl -X POST http://localhost:8080/ckb/source-file/get-upload-urls \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"source_base_id\": \"$SOURCE_ID\", \"file_names\": [\"document.pdf\"]}")
UPLOAD_URL=$(echo $UPLOAD_RESPONSE | jq -r '.upload_urls[0].upload_url')
# 2. Upload file to S3
curl -X PUT "$UPLOAD_URL" \
-H "Content-Type: application/pdf" \
--data-binary @document.pdf
# 3. Add file to source
curl -X POST http://localhost:8080/ckb/source-file/add-uploaded-files \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d "{\"source_base_id\": \"$SOURCE_ID\", \"files\": [{\"file_name\": \"document.pdf\", \"original_data_url\": \"uploads/path/document.pdf\"}]}"- Authentication: 100 requests/minute per user
- Data operations: 1000 requests/hour per user
- File uploads: 50 uploads/hour per user
- Service calls: No rate limiting (internal network)
- Resource limits: Based on infrastructure capacity
- External API: Versioned via URL path (
/v1/,/v2/) - Internal APIs: Backward compatibility maintained
- Database Schema: Liquibase migration versioning
- Service APIs: Semantic versioning in OpenAPI specs
- External API: New version for breaking changes
- Internal APIs: Gradual migration approach
- Database: Liquibase rollback support
- Client Libraries: Version compatibility matrix
- JWT Tokens: RS256 signature algorithm
- Token Expiry: 1 hour default, refresh tokens available
- Scopes: Role-based access control
- User Roles: Admin, operator, viewer
- Resource Access: Agency-based data isolation
- API Permissions: Endpoint-level access control
- HTTPS Only: All external communications encrypted
- Parameter Validation: Input sanitization and validation
- SQL Injection Prevention: Parameterized queries via Resql
- File Security: Presigned URLs with expiration
Each FastAPI service automatically generates interactive API documentation:
# File Processing API
open http://localhost:8888/docs
# Scrapper API
open http://localhost:8000/docs
# Cleaning API
open http://localhost:8001/docs
# Scheduler API
open http://localhost:8003/docs
# Data Export API
open http://localhost:8002/docs- Interactive Testing: Test endpoints directly in browser
- Request/Response Examples: Complete API examples
- Schema Validation: Parameter and response validation
- Authentication Testing: JWT token integration
- External → Internal: User operations trigger internal workflows
- Internal → Services: Pipeline coordination via service APIs
- Service → Service: Direct API calls for processing workflows
- Database: All data access via Resql query engine
- Graceful Degradation: Continue processing when non-critical services fail
- Retry Logic: Automatic retry for transient failures
- Circuit Breakers: Prevent cascade failures
- Error Propagation: Proper error context preservation