LLM RAG System - Young Woo Song

Project Overview

The Self-Correcting RAG System is a locally deployed assistant built using Python, Ollama (with llama3.1), and ChromaDB, designed to deliver factually grounded answers from regulatory PDFs like the Auckland Unitary Plan. It routes questions through a custom state graph that selects between local document retrieval and Tavily-powered web search. Responses are generated via a local LLM, then validated through relevance grading, hallucination detection, and reflection loops—regenerating when necessary. This fully automated system ensures accuracy and completeness, continuously improving responses with each interaction without manual oversight.

Key Capabilities

Autonomous Decision-Making – Dynamically chooses vector retrieval or web search based on zoning and regulatory relevance.
Self-Correcting Answer Generation – Uses reflection and hallucination grading to refine responses without user input.
StateGraph-Controlled Workflow – Executes modular nodes for retrieval, generation, grading, reflection, and routing.
Chroma-Backed Relevance Retrieval – Retrieves and filters document chunks from ChromaDB using Ollama embeddings.
Iterative Output Validation – Detects missing info or hallucinations and triggers regeneration from updated context.
Fully Local & Privacy-First – Runs entirely offline with local LLMs and PDF data, preserving user privacy and control.

Self-Correction Workflow

1. Intelligent Routing via Indicator Node

A custom routing node determines whether the query is best answered via local ChromaDB retrieval or external Tavily web search, using topic-aware logic instead of keyword rules.

2. Filtered Retrieval with Ollama Embeddings

Chunks are retrieved from ChromaDB using `mxbai-embed-large` embeddings and then passed through a relevance grading process to filter out off-topic results.

3. Structured Generation and Reflection Loop

A local LLM (llama3.1 via Ollama) generates an initial response, which is then reflected upon to determine if any key content was omitted or could be improved contextually.

4. Hallucination Detection and Iterative Correction

If hallucination is detected or the answer is incomplete, the system reuses or augments the retrieved documents and regenerates a grounded response, looping until completion.

Advanced Capabilities

Self-Correction Engine

Performs automated reflection and hallucination validation on locally generated answers, triggering regeneration using the same or revised context when needed.

StateGraph-Based Orchestration

Uses a modular LangGraph state machine to control routing, retrieval, validation, and regeneration steps for robust, deterministic flow control.

PDF-Centric Retrieval

Uses ChromaDB with `mxbai-embed-large` embeddings to retrieve semantically relevant chunks from parsed regulatory PDFs (e.g., Auckland Unitary Plan).

Grounding & Confidence Filter

Combines document grading, hallucination detection, and reflection to assess factual grounding before accepting a response as complete and reliable.

System Architecture

Explore the simplified architecture components enabling self-correction and intelligent local retrieval using a state-graph workflow.

Vector Store

Local persistent database for storing and retrieving document embeddings.

Custom Ollama embeddings using mxbai-embed-large
Stored and accessed via ChromaDB (PersistentClient)

Retrieval Engine

Fetches relevant PDF chunks using semantic similarity and filters with relevance grading.

Vector similarity search using custom wrapper class
Includes document-level relevance grading via local LLM

LLM Interface

Local LLM interface using llama3.1 via Ollama for both generation and classification.

PromptTemplate-based structured generation
Used for answer generation, grading, and reflection

Validation Layer

Ensures answer quality by checking factual grounding and content completeness.

Hallucination detection via local model scoring
Reflection module compares generation against documents

Correction Module

Implements self-correction through a LangGraph state machine with conditional transitions.

Reflection → Hallucination check → Conditional regeneration
Fully rule-based; no feedback learning or fine-tuning

Web Search Integration

Fallback mechanism using Tavily API when local documents are insufficient or irrelevant.

Triggered by topic-based indicator or low relevance in retrieved documents
Search results converted into synthetic documents for downstream generation

Tools & Technologies

Category	Technology
Models	Local LLM (llama3.1) via Ollama for generation, grading, and validation
Vector Database	Chroma DB — local persistent storage using `PersistentClient` with collection-based access
Embeddings	Custom embedding with `mxbai-embed-large` model served via Ollama
Framework	Pure Python with LangGraph for state machine control; no LangChain chains used
Search Engine	Optional Tavily Search API — invoked when relevance or context fails locally
ML Libraries	Ollama, ChromaDB, and LangGraph; Hugging Face used only for auxiliary tasks