Production RAG Pipeline for Document Intelligence
Building an enterprise-grade Retrieval-Augmented Generation system with dbt, Qdrant, and LangChain
The Problem
Enterprise knowledge is trapped in PDFs, wikis, and Slack threads. Traditional search returns documents — but users want answers. This project builds a RAG (Retrieval-Augmented Generation) pipeline that ingests 10K+ documents and provides intelligent Q&A with citations.
Architecture Overview
Documents → Chunking → Embeddings → Vector DB → Retrieval → Reranking → LLM → Answer
Each component is carefully designed for production reliability.
The Data Pipeline
Document Ingestion
- Unstructured.io handles complex PDFs with tables, images, and multi-column layouts
- Extracted text is cleaned and normalized before chunking
Chunking Strategy
After extensive testing, I settled on recursive character splitting with:
- 512 tokens per chunk
- 50 token overlap
- Preserved metadata: source file, page number, section headers
dbt Transformation Layer
One of the key production additions was a dbt transformation layer for:
- Document metadata aggregation (chunks per doc, avg chunk length)
- Evaluation metrics computation (faithfulness scores over time)
- Data quality checks (null chunks, encoding issues)
Retrieval: Hybrid is the Answer
Pure semantic search misses exact keyword matches. Pure BM25 misses semantic similarity. I implemented hybrid retrieval:
- Dense retrieval: Sentence-Transformers embeddings → Qdrant ANN search
- Sparse retrieval: BM25 keyword matching
- Fusion: Reciprocal Rank Fusion (RRF) to combine results
Cross-Encoder Reranking
The hybrid retriever returns 50 candidates. A cross-encoder (ms-marco-MiniLM) reranks them to the final top 5, improving answer precision by 25%.
Evaluation: RAGAS Framework
You can’t improve what you don’t measure. I integrated RAGAS for automated evaluation:
- Faithfulness: Does the answer stay grounded in context?
- Answer Relevance: Does it actually address the question?
- Context Precision: Are the retrieved chunks useful?
This runs nightly, catching regressions before they reach users.
Key Learnings
- Chunking is underrated — bad chunks = bad retrieval = bad answers
- Hybrid search beats pure dense for enterprise docs with technical terminology
- dbt for ML pipelines works surprisingly well for metrics and metadata management
Tech Stack
LangChain Qdrant dbt Sentence-Transformers FastAPI RAGAS Airflow