AI Document Intelligence Platform
Built a production-grade RAG system to process large PDF datasets, extract structured insights, and generate strategic summaries using FastAPI, LangChain, and pgvector.
Year
2024
Category
AI / RAG
Stack
FastAPI, LangChain, pgvector

Key Results
96%
Query Accuracy
on benchmark set
8 min
Report Time
down from 3–4 hrs
500+
Daily Queries
at sub-2s latency
3×
Team Expansion
within 2 months
01
The Challenge
A fast-growing edtech company had over 40,000 PDF documents spread across shared drives with zero searchability. Analysts were spending 3–4 hours per report manually reading and extracting data. Leadership needed a system that could handle high document volumes and return structured, citable answers — without hallucinating.
02
What We Built
We architected a full RAG pipeline using FastAPI for the backend API layer, LangChain for orchestration, and pgvector on PostgreSQL for vector storage. Documents are chunked, embedded, and indexed on upload. At query time, the system retrieves the top-k relevant chunks and passes them to the LLM with a structured prompt that enforces source citation and confidence scoring.
03
Results
Report generation time dropped from 3–4 hours to under 8 minutes. The system now handles 500+ document queries per day with sub-2s average response times. The client expanded the platform to three additional departments within two months, eliminating an estimated 1,200 analyst hours per quarter.
04
Before & After
Before
Analysts spent 3-4 hours manually reading PDFs to produce a single report
40,000+ documents were unsearchable across shared drives
Answers required senior analysts — a bottleneck on every project
No audit trail — impossible to verify where answers came from
After
Reports generated in under 8 minutes with cited, structured output
Every document instantly queryable with sub-2s response times
Any team member can query the system independently, 24/7
Every answer includes source citations with page-level references
05
How We Built It
We broke the build into four sequential phases, each with a clear deliverable before moving to the next.
Data audit & pipeline design
Mapped all 40k documents, assessed quality, and designed the chunking and embedding strategy. Validated pgvector latency targets before committing to the stack.
Ingestion pipeline
Built the S3 upload flow, PyMuPDF text extraction, 512-token chunking with overlap, and the Celery worker that manages embedding jobs asynchronously.
Query API & LLM integration
Implemented FastAPI query endpoints, LangChain LCEL pipeline, similarity search via PostgreSQL functions, and structured prompt with citation enforcement.
Flutter frontend & streaming UI
Built the Flutter Web interface with SSE token streaming, inline citation chips, document upload flow, and query history with exportable report generation.
06
System Architecture
The system splits into two flows: an async ingestion pipeline and a synchronous query pipeline. Ingestion moves through S3, a Celery worker, and pgvector. Queries are synchronous FastAPI endpoints that hit pgvector, assemble context, and call OpenAI.
Ingestion Layer
S3 + Celery + PyMuPDF
PDFs land in S3. A Celery worker extracts text with PyMuPDF, chunks into 512-token segments with 64-token overlap, and dispatches embedding jobs.
Embedding Service
OpenAI text-embedding-3-small
Each chunk is embedded via the OpenAI embeddings API. Embeddings are 1536-dimensional vectors stored with chunk metadata in pgvector.
Vector Store
PostgreSQL + pgvector
HNSW index on the embedding column enables approximate nearest-neighbour search at sub-50ms latency across 2M+ vectors.
Query API
FastAPI + LangChain LCEL
Receives a user query, embeds it, runs similarity search, assembles top-k chunks into a prompt, and streams the LLM response back to the client.
LLM Layer
OpenAI GPT-4o
Receives a structured prompt with retrieved context, a system instruction enforcing citation, and a JSON output schema for structured responses.
Frontend
Flutter Web
Streams the response token-by-token using SSE. Renders citations as inline source chips that expand to show the original document chunk.
07
Tech Stack
Backend
AI / ML
Database
Infrastructure
Frontend
08
How We Approached the Problem
Before writing code we mapped the full data lifecycle: parse, chunk, embed, store, retrieve. Each stage had its own failure modes. We ran a two-day spike to validate pgvector latency targets before committing — it passed, saving infrastructure cost and operational complexity.
Alternatives considered & rejected
Pinecone as vector store
Added an external dependency and egress cost. pgvector on the existing Postgres instance met p95 latency targets at a fraction of the price.
LlamaIndex instead of LangChain
LlamaIndex had better document loaders but weaker chain composition. LangChain LCEL made it easier to build testable, swappable pipeline steps.
Async queue for every query
Queries needed to feel synchronous in the UI. We reserved the queue only for ingestion, keeping query paths as fast synchronous endpoints.
09
Data Modelling
Two core tables: documents (metadata and processing status) and document_chunks (text segments with vector embeddings). Keeping them separate means we can re-embed all chunks for a document if the embedding model changes without touching metadata.
10
API Layer
FastAPI handles document upload (async) and query (synchronous streaming). LangChain LCEL composes the query pipeline as typed, testable steps — embed → retrieve → prompt → stream.
11
Database Functions
Similarity search and chunk retrieval moved into PostgreSQL functions rather than application code. This keeps heavy lifting close to the data and makes org-level filtering easy without touching application logic.
12
Frontend Connection
All network logic lives in a DocumentRepository class injected via Riverpod. The streamQuery method returns a Stream<String> — the widget listens and appends tokens as they arrive.
13
Lessons Learned
Chunk overlap matters more than chunk size
512 tokens with no overlap left answers that spanned chunk boundaries incomplete. Adding 64-token overlap improved completeness measurably with negligible storage cost.
Validate the embedding model before you build
Switching from ada-002 to text-embedding-3-small mid-project required re-embedding 2M+ chunks. A model evaluation spike upfront would have saved two days.
PostgreSQL functions beat ORM for vector ops
Moving similarity search into a SQL function was significantly faster than composing through SQLAlchemy, which generated suboptimal query plans that bypassed the HNSW index.
Stream everything LLM-related to the client
A 4-second blank screen waiting for the full response felt like an error. SSE streaming with a typing indicator transformed perceived responsiveness even though total latency was identical.
Start a project

