March 5, 2026
Build a Local/Private RAG System with LlamaIndex and Python
A practical guide to implementing RAG on local documents using LlamaIndex. Load files, create embeddings, build a vector index, and query your data with Python.
π Goal: Build a local RAG(Retrieval-Augmented Generation) system over files using LlamaIndex.
RAG sounds complicated. In reality itβs simple. You give the system documents. It chops them into pieces, finds the relevant ones, and the model answers using those pieces. No guessing. No hallucinating wild stuff. Just answers grounded in your files. π
Perfect for contracts, internal docs, research notes, or anything that lives in folders on your machine.
π§° Setup
Install Python π
Download Python from the official site and install it.
Check it in terminal:
python --version
If it prints a version number, youβre good. If not, your terminal is silently judging you.
Β
Create a project folder π
mkdir rag-project cd rag-project
This keeps everything clean instead of scattering Python files around your computer like digital confetti.
Create a virtual environment π§ͺ
python -m venv venv
Activate it.
macOS / Linux
source venv/bin/activate
Windows
venv\Scripts\activate
Your terminal should now show (venv). Congratulations, you entered Pythonβs tiny universe.

Add your documents π
Create a folder called data.
Put your PDF, TXT, DOCX files there.
rag-project βββ data β βββ contract.pdf β βββ notes.txt β βββ handbook.docx
These are the knowledge base your RAG system will search.
Optional: Add an API key π
If you plan to use hosted models like OpenAI/Anthropic, set the key.
macOS / Linux
export OPENAI_API_KEY="your_api_key"
Windows
setx OPENAI_API_KEY "your_api_key"
If you plan to run fully local models later, you can skip this.
Install dependencies π¦
pip install llama-index openai
Two packages and youβre already building a document AI system. Five years ago this needed a research lab.
π§± Build the Index
Create app.py.
Load documents and build a vector index.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader documents = SimpleDirectoryReader("./data").load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine()
Thatβs it. Your documents are now searchable by meaning instead of keywords.
π Ask Questions
Now query the data.
response = query_engine.query("Summarize the contract terms") print(response)
The system retrieves the relevant chunks and feeds them to the model. The answer comes directly from your documents.
No guessing. No Wikipedia energy.
π§ What happens behind the scenes
- Documents are split into chunks
- Chunks are converted into embeddings
- Relevant chunks are retrieved
- The LLM answers using those chunks
This is the core idea of Retrieval-Augmented Generation.
π Control embeddings
You can control which embedding model is used.
from llama_index.embeddings.openai import OpenAIEmbedding from llama_index.core import Settings Settings.embed_model = OpenAIEmbedding( model="text-embedding-3-small" )
Good embeddings = better retrieval.
Bad embeddings = your AI suddenly forgetting where things are.
πΎ Persist the index
Rebuilding an index every run is annoying. Save it once.
index.storage_context.persist( persist_dir="./storage" )
Reload later.
from llama_index.core import StorageContext, load_index_from_storage storage_context = StorageContext.from_defaults( persist_dir="./storage" ) index = load_index_from_storage(storage_context)
Now your RAG system boots instantly.
β Where this is useful
Document AI shows up everywhere:
- internal knowledge assistants
- contract review
- research search tools
- support documentation bots
Clean chunking and good embeddings matter more than complicated pipelines.
Everything else is just plumbing.
Β
Β
Β
Β