Back to Projects

Retrieval Augmented Generation from YouTube for Long-Form QA

LangChain, LLMs, Python, RAG — Transform Video Data into Searchable Knowledge

RAG YouTube Long-Form QA

📋 Project Overview

Built a RAG pipeline transforming unstructured video data into searchable knowledge bases using transcription (OpenAI Whisper), chunking, embedding (LLaMA), and ranked retrieval refined with GPT. Reduced hallucinations and improved factual grounding by combining vector search with LLM-based reasoning.

Delivered a production-ready QA system capable of answering complex, multi-hour content queries.

⚡ Key Highlights

  • Video-to-Knowledge Pipeline: Transcription, chunking, embedding, and ranked retrieval
  • OpenAI Whisper: High-quality transcription for video content
  • LLaMA Embeddings: Dense vector representations for semantic search
  • GPT Refinement: LLM-based reasoning for grounded answers
  • Reduced Hallucinations: Vector search + LLM reasoning for factual grounding
  • Production-Ready: Capable of answering complex, multi-hour content queries

Skills Demonstrated

LangChain RAG LLMs OpenAI Whisper GPT LLaMA Vector Search Python NLP

More details and images coming soon.