Vision Model RAG - Young Woo Song

Project Overview

CodeVision NZ is a fully local, multimodal Retrieval-Augmented Generation (RAG) application designed to make the New Zealand Building Code accessible through intelligent, chat-based search. Leveraging Ollama-powered local LLMs and ChromaDB vector storage, the system processes official NZBC PDFs to extract structured text and high-resolution architectural diagrams using unstructured. Users can then query the building code via a sleek Streamlit interface, receiving grounded answers drawn from both clauses and figures—with context and citations included. All processing occurs locally, ensuring data privacy and fast retrieval without the cloud.

Key Objectives

Multimodal Retrieval – Extract both text and images using vision and language models.
Precise, Cited Answers – Return accurate responses with references to relevant clauses and diagrams.
Natural Language Querying – Ask questions in plain English and get clear, contextual answers.
Local & Secure Deployment – Run entirely offline using open-source tools to ensure full privacy.
Faster Compliance Checks – Accelerate code research for architects, engineers, and consultants.

System Workflow

1. PDF Content Extraction

Extracts both text and diagrams from NZ Building Code PDFs using high-resolution parsing via unstructured. Images are converted to base64 and stored with metadata.

2. Query Embedding

User questions are embedded with mxbai-embed-large via Ollama to represent intent for both text and image retrieval.

3. Text & Image Retrieval

ChromaDB returns top-matching clauses, tables, and image metadata—using separate collections for documents and image descriptions.

4. Answer Generation

A local LLM (e.g. LLaMA 3 via Ollama) synthesizes the retrieved content into a concise, clause-backed response with references.

Key Features

Visual Understanding

Extracts and processes diagrams from NZ Building Code PDFs using LLaMA3.2-Vision to generate searchable base64 metadata and image descriptions.

Intelligent Retrieval

Dual-mode RAG retrieves relevant clauses and image metadata from ChromaDB using custom embeddings generated via Ollama.

Natural Language Interface

Ask any building code question in plain English. Responses are generated locally by a conversational LLM with grounded references.

Compliance Focused

Quickly identifies relevant measurements, constraints, and figures for NZ Building Code compliance validation.

Real-Time Local Pipeline

Everything—from parsing to retrieval to LLM reasoning—runs offline with no cloud APIs, ensuring privacy and instant response.

NZBC-Focused Database

All extracted clauses, diagrams, and metadata are stored and indexed locally in ChromaDB for fast, structured access.

Tools & Technologies

Category	Technology
Vision Models	LLaMA 3.2 Vision via Ollama for image captioning, metadata extraction, and diagram understanding
Language Models	LLaMA 3.1/3.2 via Ollama for local question answering and final answer generation
Embeddings	`mxbai-embed-large` via Ollama for text and image embedding
Vector Database	ChromaDB for storing and retrieving embeddings (text and image metadata)
Framework	LangGraph for workflow logic and LangChain Agents for modular tool-based reasoning
Image Processing	Pillow (PIL) for base64 encoding, previews, and Unstructured PDF image handling
PDF Parsing	`unstructured` with hi-res + OCR fallback for extracting text, tables, and figures
User Interface	Streamlit for local chat interface with styled chat bubbles and file uploads