LLM / Search2024

Enterprise GPT Knowledge Assistant

Designed and deployed a secure document-aware chatbot with vector search, embeddings, and context-aware responses for internal knowledge workflows.

Year

2024

The Challenge

A 200-person professional services firm needed employees to query internal knowledge — HR policies, project templates, client SOPs — without bothering senior staff or digging through SharePoint. Knowledge was siloed, new hires took weeks to ramp up, and senior staff fielded the same questions repeatedly.

What We Built

A secure internal chatbot with role-based document access. Django powers the API and auth layer, Pinecone handles vector storage with one namespace per permission role, and Auth0 provides SSO. The Next.js frontend streams responses token-by-token and surfaces citations inline.

Results

Internal support tickets dropped 41% in the first month. Onboarding shortened by an estimated 60%. The assistant handles over 1,200 queries per week with a 94% satisfaction rate. The firm rolled out to two partner offices within the first quarter.

Before & After

Before

Employees emailed HR or senior staff for every policy question — answers took hours or days

New hires spent weeks shadowing colleagues just to learn where information lived

Knowledge was siloed by department — no cross-team visibility into SOPs or templates

No audit trail — impossible to know if employees were getting accurate answers

After

Any employee gets an instant cited answer from the assistant in seconds

Onboarding cut by 60% — new hires query the assistant independently from day one

A single interface surfaces relevant knowledge across all permitted departments

Every response is grounded in source documents with traceable citations

How We Built It

We designed the permission model before touching AI. Getting access control right first meant we never had to retrofit security onto a working system.

Permission model & Auth0 setup

Designed the role-to-namespace mapping, configured Auth0 SSO, extended Django's user model, and validated that JWT claims could drive Pinecone namespace resolution end-to-end.

Document ingestion pipeline

Built Celery workers for chunking, OpenAI embedding, and role-scoped Pinecone upserts. Established the convention that Auth0 role names mirror Pinecone namespaces exactly.

Django query API with SSE streaming

Implemented the DRF query view, QueryService with multi-namespace retrieval and re-ranking, and the StreamingHttpResponse SSE pipeline.

Next.js chat interface

Built the useChatStream hook, streaming chat UI, citation chip rendering, and server-component conversation history loading. Deployed to Vercel.

System Architecture

A three-tier architecture with an AI retrieval layer between the API and LLM. Django owns business logic and permission enforcement. Pinecone owns vector search. Next.js owns the UI.

Auth Layer

Auth0 + Django

Auth0 handles SSO and issues JWTs. Django validates the token on every request and resolves the user's role and Pinecone namespaces before any query reaches Pinecone.

Document Ingestion

Django + Celery + OpenAI

Uploaded documents are queued in Celery, chunked into 400-token segments, embedded via OpenAI, and upserted into the role-scoped Pinecone namespace.

Vector Store

Pinecone

One namespace per permission role. Metadata on each vector stores document_id, chunk_index, and page. Queries filter by namespace before similarity scoring.

Query API

Django REST Framework

Receives a question, resolves the user's namespaces, queries Pinecone for top-k chunks, assembles the prompt, and streams the OpenAI response as SSE.

LLM Layer

OpenAI GPT-4o

Structured system prompt instructs the model to answer only from provided context and cite sources. Temperature 0 for deterministic responses.

Frontend

Next.js + TypeScript

Server components fetch conversation history. A client component connects to the Django SSE endpoint and renders tokens as they stream, with citation chips inline.

Tech Stack

Backend

DjangoDjango REST FrameworkPythonCelery

AI / ML

OpenAI GPT-4otext-embedding-3-small

Vector Store

Pinecone

Database

PostgreSQLRedis

Auth

Auth0

Frontend

Next.jsTypeScriptVercel

How We Approached the Problem

The core constraint was access control — employees in different roles could not see each other's documents. We designed the permission model first. Every Pinecone vector carries metadata with the document's permission scope, and every query filters on that scope server-side.

Alternatives considered & rejected

Single shared vector namespace

A single namespace makes per-user filtering possible but expensive at scale. Namespace-per-role gave hard isolation and faster retrieval.

FastAPI instead of Django

The firm's user and permission tables were already in a Django app. DRF let us reuse ORM models, permission classes, and admin panel without rebuilding auth.

Vercel AI SDK for streaming

The Vercel AI SDK assumes a Next.js API route as the streaming origin. Our backend was Django, so we implemented SSE directly with StreamingHttpResponse.

Data Modelling

Django models own the relational side: users, roles, documents, conversations. Pinecone owns the vector side. The document model stores Pinecone vector IDs so we can delete or re-embed without a full index rebuild.

API Layer

Django REST Framework handles the query endpoint. The view resolves namespaces from the user's roles, delegates to QueryService which retrieves from Pinecone, and streams the OpenAI response via StreamingHttpResponse.

Database Functions

PostgreSQL functions handle conversation context fetching and bulk document status updates after ingestion.

Frontend Connection

A useChatStream hook manages the full lifecycle — sending a message, reading the SSE stream, appending tokens to the buffer, and handling errors. The component never touches fetch directly.

Lessons Learned

Namespace-per-role beats metadata filtering at scale

A single namespace with role metadata became the bottleneck at 500k+ vectors. Per-role namespaces dropped p95 query latency by 60% since each query scanned a fraction of the index.

Django ORM is a liability for bulk vector operations

Upserting thousands of embeddings through the ORM was slow — one INSERT per object. We bypassed it using psycopg2 execute_values and batched Pinecone upserts of 100 vectors.

Auth0 roles should mirror Pinecone namespaces exactly

A custom role-to-namespace mapping table was a bug source when roles were renamed. Enforcing the convention that Auth0 role name IS the namespace eliminated the mapping layer entirely.

Multi-turn context needs a hard token budget

Full conversation history hit the context window unexpectedly on long sessions. We cap at the last 10 messages and summarise older turns into a single system message.

More Work

AI / RAG

AI Sales Intelligence Platform

SaaS / AWS

Subscription Streaming Platform

Start a project

Have something similar in mind?

Get in touch All work