AI / RAG2024

AI Sales Intelligence Platform

Built a production-grade multi-step RAG system to process city-level healthcare distribution reports, distill them into ranked priority signals, and surface expansion recommendations through a role-based dashboard — helping regional teams decide where to push next.

Year

2024

The Challenge

A mid-sized pharma distribution company was expanding into new cities across India. Their strategy team had commissioned hundreds of city-level market reports — hospital density, chemist network coverage, competitor presence, demographic data — all locked inside PDFs across shared drives. Regional managers were spending 3–4 hours per city manually reading through reports to judge whether to push distribution there. Leadership needed a system that could reason across all of it and tell the team which cities to prioritise — with a clear, auditable signal instead of a gut call.

What We Built

We built a multi-step RAG pipeline where city-level documents are chunked, embedded, and stored in pgvector. At query time we run a three-stage LangChain LCEL chain rather than a single LLM call: the first prompt extracts key market signals from retrieved chunks, the second reasons across those signals to produce a structured city assessment, and a final distillation prompt collapses everything into a 4-word priority label — 'Expand Now', 'High Potential', 'Needs Nurturing', or 'Deprioritise'. These labels populate a sortable priority table on the dashboard. Role-based auth ensures each company sees only their own city data, and regional leads can flag cities and assign them to field reps directly from the table.

Results

City analysis time dropped from 3–4 hours to under 8 minutes. The pipeline handles 500+ queries daily at sub-2s latency. Within two months the client rolled the dashboard out to three regional teams, replacing spreadsheet-based tracking entirely and giving leadership a live view of expansion priorities across 200+ cities.

Before & After

Before

Regional managers spent 3–4 hours per city reading PDFs to assess expansion potential

Hundreds of city reports were unsearchable across shared drives

Prioritisation depended on whoever had read the reports most recently

No audit trail — impossible to verify why a city was flagged as high priority

After

Each city analysed and labelled in under 8 minutes with a cited 4-word priority signal

Every report instantly queryable with sub-2s response times

Any team member can query the system independently, 24/7, with consistent output

Every label traces back to source citations with document and page references

How We Built It

We broke the build into four sequential phases, each with a clear deliverable before moving to the next.

Data audit & pipeline design

Catalogued all city reports, assessed quality and structure variance across different report formats, and designed the chunking strategy. Validated pgvector latency targets before committing to the stack.

Ingestion pipeline

Built the S3 upload flow, PyMuPDF text extraction, 512-token chunking with 64-token overlap, and the Celery worker that manages embedding jobs asynchronously across hundreds of documents.

Multi-step query chain

Implemented the three-stage LangChain LCEL chain — signal extraction, structured city reasoning, and 4-word priority distillation — with citation enforcement at every step so every label is traceable.

Next.js dashboard & role-based auth

Built the Next.js frontend with shadcn/ui and Tailwind CSS featuring a sortable priority table, SSE token streaming for city deep-dives, expandable citation chips, and role-based access so each company sees only their own regional data.

System Architecture

The system splits into two flows: an async ingestion pipeline and a synchronous multi-step query pipeline. Ingestion moves through S3, a Celery worker, and pgvector. Queries are synchronous FastAPI endpoints that run a three-stage LangChain chain and stream the final output back to the dashboard.

Ingestion Layer

S3 + Celery + PyMuPDF

City reports land in S3. A Celery worker extracts text with PyMuPDF, chunks into 512-token segments with 64-token overlap, and dispatches embedding jobs.

Embedding Service

OpenAI text-embedding-3-small

Each chunk is embedded via the OpenAI embeddings API. Embeddings are 1536-dimensional vectors stored with chunk metadata in pgvector.

Vector Store

PostgreSQL + pgvector

HNSW index on the embedding column enables approximate nearest-neighbour search at sub-50ms latency across millions of vectors.

Multi-Step Query Chain

FastAPI + LangChain LCEL

Three chained prompts run in sequence: (1) extract market signals from retrieved chunks, (2) reason across signals into a structured city assessment, (3) distil the assessment into a 4-word priority label with confidence score.

LLM Layer

OpenAI GPT-4o

Each prompt stage uses GPT-4o with a structured output schema. The final stage is constrained to four possible labels to eliminate hallucinated verdicts.

Frontend

Next.js + shadcn/ui + Tailwind CSS

Streams city assessment tokens via SSE. Renders the distilled priority label in a sortable table. Citation chips built with shadcn Badge and Popover expand to show the original document chunk each signal was drawn from.

Tech Stack

Backend

FastAPIPythonLangChainCelery

AI / ML

OpenAI GPT-4otext-embedding-3-smallpgvector

Database

PostgreSQLRedis

Infrastructure

AWS S3DockerAWS ECS

Frontend

Next.jsshadcn/uiTailwind CSSTypeScript

How We Approached the Problem

Before writing code we mapped the full reasoning chain the system needed to replicate: what signals matter for a city, how those signals combine into a judgement, and what the final output needs to look like for a field rep to act on it immediately. We deliberately chose a three-stage chain over a single large prompt — each stage is independently testable and the intermediate outputs are logged, making it easy to debug where the chain goes wrong on edge-case cities.

Alternatives considered & rejected

Single-shot prompt for the full analysis

A single prompt asking for extraction, reasoning, and distillation in one pass produced inconsistent label quality. Breaking it into three stages with intermediate validation improved accuracy significantly.

Pinecone as vector store

Added an external dependency and egress cost. pgvector on the existing Postgres instance met p95 latency targets at a fraction of the price.

LlamaIndex instead of LangChain

LlamaIndex had better document loaders but weaker chain composition. LangChain LCEL made it easier to build testable, swappable pipeline steps across the three prompt stages.

Data Modelling

Three core tables: companies (RBAC anchor), documents (city report metadata per company), and document_chunks (text segments with embeddings). Keeping company_id on both documents and chunks means every similarity search is automatically scoped — no application-layer filtering needed.

API Layer

The query endpoint runs a three-stage LangChain LCEL chain. Each stage is a typed Runnable — independently testable and swappable. The final stage is constrained to four labels via an enum in the output schema, eliminating free-text hallucinations in the priority signal.

Database Functions

Similarity search is scoped to the calling company at the SQL level — not in application code. This means even if a bug bypasses the application auth layer, cross-company data leakage is structurally impossible.

Frontend Connection

A custom hook manages the SSE stream and assembles the three-stage chain output as it arrives. The priority table updates optimistically — the label renders as soon as the distillation stage streams back, without waiting for the full response to complete.

Lessons Learned

Single-shot prompts break on complex reasoning

Asking one prompt to extract signals, reason across them, and produce a label in one pass gave inconsistent results on edge-case cities. Breaking it into three stages with typed intermediate outputs fixed accuracy and made debugging trivial.

Constrain the final output to an enum

Leaving the priority label as free text produced creative variations like 'Expand Cautiously' that broke the frontend sort logic. Enforcing a Literal type in the Pydantic schema eliminated this entirely.

Scope at the database level, not just the application

Putting company_id filtering inside a SQL function means cross-company data leakage is structurally impossible — not just dependent on application code being correct.

Stream intermediate stages, not just the final answer

Showing 'Extracting signals... Assessing city... Distilling priority...' as each stage completed made a 6-second chain feel fast. A blank screen for the same duration felt broken.

More Work

LLM / Search

Enterprise GPT Knowledge Assistant

SaaS / AWS

Subscription Streaming Platform

Start a project

Have something similar in mind?

Get in touch All work