AI / RAG2024

AI Document Intelligence Platform

Built a production-grade RAG system to process large PDF datasets, extract structured insights, and generate strategic summaries using FastAPI, LangChain, and pgvector.

Year

2024

The Challenge

A fast-growing edtech company had over 40,000 PDF documents spread across shared drives with zero searchability. Analysts were spending 3–4 hours per report manually reading and extracting data. Leadership needed a system that could handle high document volumes and return structured, citable answers — without hallucinating.

What We Built

We architected a full RAG pipeline using FastAPI for the backend API layer, LangChain for orchestration, and pgvector on PostgreSQL for vector storage. Documents are chunked, embedded, and indexed on upload. At query time, the system retrieves the top-k relevant chunks and passes them to the LLM with a structured prompt that enforces source citation and confidence scoring.

Results

Report generation time dropped from 3–4 hours to under 8 minutes. The system now handles 500+ document queries per day with sub-2s average response times. The client expanded the platform to three additional departments within two months, eliminating an estimated 1,200 analyst hours per quarter.

Before & After

Before

Analysts spent 3-4 hours manually reading PDFs to produce a single report

40,000+ documents were unsearchable across shared drives

Answers required senior analysts — a bottleneck on every project

No audit trail — impossible to verify where answers came from

After

Reports generated in under 8 minutes with cited, structured output

Every document instantly queryable with sub-2s response times

Any team member can query the system independently, 24/7

Every answer includes source citations with page-level references

How We Built It

We broke the build into four sequential phases, each with a clear deliverable before moving to the next.

Data audit & pipeline design

Mapped all 40k documents, assessed quality, and designed the chunking and embedding strategy. Validated pgvector latency targets before committing to the stack.

Ingestion pipeline

Built the S3 upload flow, PyMuPDF text extraction, 512-token chunking with overlap, and the Celery worker that manages embedding jobs asynchronously.

Query API & LLM integration

Implemented FastAPI query endpoints, LangChain LCEL pipeline, similarity search via PostgreSQL functions, and structured prompt with citation enforcement.

Flutter frontend & streaming UI

Built the Flutter Web interface with SSE token streaming, inline citation chips, document upload flow, and query history with exportable report generation.

System Architecture

The system splits into two flows: an async ingestion pipeline and a synchronous query pipeline. Ingestion moves through S3, a Celery worker, and pgvector. Queries are synchronous FastAPI endpoints that hit pgvector, assemble context, and call OpenAI.

Ingestion Layer

S3 + Celery + PyMuPDF

PDFs land in S3. A Celery worker extracts text with PyMuPDF, chunks into 512-token segments with 64-token overlap, and dispatches embedding jobs.

Embedding Service

OpenAI text-embedding-3-small

Each chunk is embedded via the OpenAI embeddings API. Embeddings are 1536-dimensional vectors stored with chunk metadata in pgvector.

Vector Store

PostgreSQL + pgvector

HNSW index on the embedding column enables approximate nearest-neighbour search at sub-50ms latency across 2M+ vectors.

Query API

FastAPI + LangChain LCEL

Receives a user query, embeds it, runs similarity search, assembles top-k chunks into a prompt, and streams the LLM response back to the client.

LLM Layer

OpenAI GPT-4o

Receives a structured prompt with retrieved context, a system instruction enforcing citation, and a JSON output schema for structured responses.

Frontend

Flutter Web

Streams the response token-by-token using SSE. Renders citations as inline source chips that expand to show the original document chunk.

Tech Stack

Backend

FastAPIPythonLangChainCelery

AI / ML

OpenAI GPT-4otext-embedding-3-smallpgvector

Database

PostgreSQLRedis

Infrastructure

AWS S3DockerAWS ECS

Frontend

Flutter WebDart

How We Approached the Problem

Before writing code we mapped the full data lifecycle: parse, chunk, embed, store, retrieve. Each stage had its own failure modes. We ran a two-day spike to validate pgvector latency targets before committing — it passed, saving infrastructure cost and operational complexity.

Alternatives considered & rejected

Pinecone as vector store

Added an external dependency and egress cost. pgvector on the existing Postgres instance met p95 latency targets at a fraction of the price.

LlamaIndex instead of LangChain

LlamaIndex had better document loaders but weaker chain composition. LangChain LCEL made it easier to build testable, swappable pipeline steps.

Async queue for every query

Queries needed to feel synchronous in the UI. We reserved the queue only for ingestion, keeping query paths as fast synchronous endpoints.

Data Modelling

Two core tables: documents (metadata and processing status) and document_chunks (text segments with vector embeddings). Keeping them separate means we can re-embed all chunks for a document if the embedding model changes without touching metadata.

API Layer

FastAPI handles document upload (async) and query (synchronous streaming). LangChain LCEL composes the query pipeline as typed, testable steps — embed → retrieve → prompt → stream.

Database Functions

Similarity search and chunk retrieval moved into PostgreSQL functions rather than application code. This keeps heavy lifting close to the data and makes org-level filtering easy without touching application logic.

Frontend Connection

All network logic lives in a DocumentRepository class injected via Riverpod. The streamQuery method returns a Stream<String> — the widget listens and appends tokens as they arrive.

Lessons Learned

Chunk overlap matters more than chunk size

512 tokens with no overlap left answers that spanned chunk boundaries incomplete. Adding 64-token overlap improved completeness measurably with negligible storage cost.

Validate the embedding model before you build

Switching from ada-002 to text-embedding-3-small mid-project required re-embedding 2M+ chunks. A model evaluation spike upfront would have saved two days.

PostgreSQL functions beat ORM for vector ops

Moving similarity search into a SQL function was significantly faster than composing through SQLAlchemy, which generated suboptimal query plans that bypassed the HNSW index.

Stream everything LLM-related to the client

A 4-second blank screen waiting for the full response felt like an error. SSE streaming with a typing indicator transformed perceived responsiveness even though total latency was identical.

More Work

LLM / Search

Enterprise GPT Knowledge Assistant

SaaS / AWS

Subscription Streaming Platform

Start a project

Have something similar in mind?

Get in touch All work