March 5, 2026

Build a Local/Private RAG System with LlamaIndex and Python

A practical guide to implementing RAG on local documents using LlamaIndex. Load files, create embeddings, build a vector index, and query your data with Python.

📌 Goal: Build a local RAG(Retrieval-Augmented Generation) system over files using LlamaIndex.

RAG sounds complicated. In reality it’s simple. You give the system documents. It chops them into pieces, finds the relevant ones, and the model answers using those pieces. No guessing. No hallucinating wild stuff. Just answers grounded in your files. 📂

Perfect for contracts, internal docs, research notes, or anything that lives in folders on your machine.

🧰 Setup

Install Python 🐍

Download Python from the official site and install it.

Check it in terminal:


python --version

If it prints a version number, you’re good. If not, your terminal is silently judging you.

Create a project folder 📁


mkdir rag-project
cd rag-project

This keeps everything clean instead of scattering Python files around your computer like digital confetti.

Create a virtual environment 🧪


python -m venv venv

Activate it.

macOS / Linux


source venv/bin/activate

Windows


venv\Scripts\activate

Your terminal should now show (venv). Congratulations, you entered Python’s tiny universe.

Add your documents 📄

Create a folder called data.

Put your PDF, TXT, DOCX files there.


rag-project
 ├── data
 │   ├── contract.pdf
 │   ├── notes.txt
 │   └── handbook.docx

These are the knowledge base your RAG system will search.

Optional: Add an API key 🔑

If you plan to use hosted models like OpenAI/Anthropic, set the key.

macOS / Linux


export OPENAI_API_KEY="your_api_key"

Windows


setx OPENAI_API_KEY "your_api_key"

If you plan to run fully local models later, you can skip this.

Install dependencies 📦


pip install llama-index openai

Two packages and you’re already building a document AI system. Five years ago this needed a research lab.

🧱 Build the Index

Create app.py.

Load documents and build a vector index.


from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("./data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

That’s it. Your documents are now searchable by meaning instead of keywords.

🔎 Ask Questions

Now query the data.


response = query_engine.query("Summarize the contract terms")

print(response)

The system retrieves the relevant chunks and feeds them to the model. The answer comes directly from your documents.

No guessing. No Wikipedia energy.

🧠 What happens behind the scenes

Documents are split into chunks

Chunks are converted into embeddings

Relevant chunks are retrieved

The LLM answers using those chunks

This is the core idea of Retrieval-Augmented Generation.

🎛 Control embeddings

You can control which embedding model is used.


from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small"
)

Good embeddings = better retrieval.

Bad embeddings = your AI suddenly forgetting where things are.

💾 Persist the index

Rebuilding an index every run is annoying. Save it once.


index.storage_context.persist(
    persist_dir="./storage"
)

Reload later.


from llama_index.core import StorageContext, load_index_from_storage

storage_context = StorageContext.from_defaults(
    persist_dir="./storage"
)

index = load_index_from_storage(storage_context)

Now your RAG system boots instantly.

✅ Where this is useful

Document AI shows up everywhere:

internal knowledge assistants

contract review

research search tools

support documentation bots

Clean chunking and good embeddings matter more than complicated pipelines.

Everything else is just plumbing.

Official Sources :

Docs by LangChainInstall LangChain - Docs by LangChain

Install LangChain - Docs by LangChain