Production-Grade Multi-Tenant FastAPI RAG Engine

The Engineering Challenge: To build a production-grade, multi-tenant RAG system with full user isolation, secure authentication, and a highly concurrent pipeline — all without relying on heavy abstraction layers like LangChain. The goal was to prove that first-principles engineering with raw async Python, strict data contracts, and real database testing yields a faster, more maintainable, and fully verifiable AI backend.

Core Architecture:

Authentication & Multi-Tenancy: OAuth2 Password Bearer with JWT access tokens and bcrypt password hashing. OTP-based email verification via async SMTP (aiosmtplib) for registration and password reset. Every document and vector is scoped to a user_id — queries are isolated at the database level.
Document Ingestion: SHA-256 file fingerprinting for deduplication, non-blocking PyMuPDF parsing offloaded via asyncio.to_thread, zero-dependency O(N) sliding-window chunking that snaps to natural punctuation boundaries, and bulk insert via asyncpg.executemany over binary protocol.
Retrieval & Generation: HNSW vector search with cosine similarity via pgvector (<=> operator), grounded system prompts constraining the LLM to provided context, and token-by-token streaming via StreamingResponse.
Observability: structlog JSON logging with rotating file handler and optional Better Stack cloud aggregation.

Key Technical Implementations:

Secure Authentication & Multi-Tenancy: Complete auth system with JWT access tokens (PyJWT), bcrypt password hashing, and OTP-based email verification for both registration and password reset flows — all fully async.
Zero-Blocking Document Ingestion: CPU-bound PyMuPDF C-extension execution offloaded to background thread pools (asyncio.to_thread), safeguarding the event loop during massive file uploads.
Pure Python Semantic Chunking: Custom O(N) sliding-window text chunker that snaps to natural punctuation boundaries, preserving semantic meaning without third-party ML orchestration libraries.
Cryptographic Deduplication: SHA-256 fingerprinting on incoming document byte streams instantly rejects duplicate files, saving GPU embedding compute and preventing vector database bloat.
Lightning-Fast Vector Search & Streaming: HNSW indexing and cosine distance (<=>) in Postgres for millisecond retrieval. Retrieved chunks are compiled into a grounded system prompt, and the LLM response is streamed token-by-token directly to the client.
Enterprise Observability: structlog JSON logging with rotating file handler, asynchronously piped into Better Stack for real-time cloud log aggregation.
Exhaustive Test Suite: 468 tests across unit, integration, and API route layers achieving 100% line coverage. Integration tests run against real ephemeral pgvector containers in CI — no mocking SQL.
CI/CD Pipeline: GitHub Actions enforcing Ruff lint + format gates before running the full test suite against a real database.

What This Solved:

Framework Bloat: Eliminated the overhead, latency, and unpredictability of high-level wrappers (LangChain/LlamaIndex) to gain total, deterministic control over the execution flow.
AI Infrastructure Latency: Drastically reduced perceived user latency from ~15 seconds down to sub-second levels via asynchronous HTTP streaming and highly optimized database connection pooling.
Security & Isolation: Full ownership of auth flows (no vendor lock-in), with per-user data isolation ensuring no cross-tenant data leakage.
Confidence in Production: 100% test coverage with real database integration tests eliminates entire classes of bugs that mocked tests would miss.

Tech Stack: Python 3.12+, FastAPI, Uvicorn, PostgreSQL 16 (Neon), pgvector, asyncpg, Ollama, httpx, PyJWT, bcrypt, aiosmtplib, PyMuPDF, Pydantic v2, pydantic-settings, Docker, GitHub Actions, pytest, pytest-asyncio, pytest-cov, structlog, Better Stack, uv.

System Architecture:

Figure: System architecture of the Production-Grade Multi-Tenant FastAPI RAG Engine.

Md Samshad Rahman

System Architecture: