El‑Mal El‑Halal Arabic Podcast Semantic Search
Date: August 2025
Objective
Build a reproducible pipeline that converts every episode of the popular Arabic podcast El‑Mal El‑Halal into a searchable knowledge base. The system should fetch audio automatically, transcribe speech to text, create semantic embeddings, and provide sub‑second similarity search with deep‑links to the original audio.
Approach
Data acquisition: Episodes are downloaded directly from the RSS feed using yt‑dlp with automatic MP3 extraction.
Audio preprocessing: Files are normalized and converted to WAV through PyDub + FFmpeg, then chunked for efficient ASR.
Transcription: Whisper‑large‑v3 (multilingual) is forced into Arabic decoding to produce high‑quality transcripts.
cleaning: A tiny [ALLaM-AI/ALLaM-7B-Instruct-preview] prompt removes filler, speaker tags, “uhh”s, and Whisper hallucinations before indexing.
Segmentation & embedding: Transcripts are sentence‑segmented with NLTK, grouped into 3‑4 sentence paragraphs, and embedded via the multilingual MiniLM model (omarelshehy/Arabic‑STS‑Matryoshka‑V2).
Indexing & search: Paragraph vectors are L2‑normalized and inserted into a FAISS IndexFlatIP for cosine similarity search; metadata is stored in a Hugging Face Dataset alongside paragraph text and start‑time deep‑links.
Results & Metrics
- Episodes processed: 18 (≈ 20 hrs audio)
- ASR segments: 13 970
- Paragraph embeddings: ≈ 7652
- Median search latency: 50 ms on CPU
How It Works
[Audio RSS]
│ yt‑dlp
▼
(MP3 files)
│ PyDub / FFmpeg (mono • 16 kHz)
▼
{Silence‑based chunks}
│ Whisper‑large‑v3‑turbo (ar)
▼
[Raw transcripts]
│ ALLaM‑7B cleanup
▼
[Polished sentences]
│ NLTK + heuristics
▼
[Paragraphs (semantic similarity)]
│ Sentence‑Transformers (omarelshehy/Arabic‑STS‑Matryoshka‑V2)
▼
(768‑D embeddings)
│ FAISS FlatIP
▼
(Index + metadata)
│ Gradio UI
▼
Semantic search & deep‑links
Project Visualizations
Live Demo & Repository
Technical Skills
Learnings/Takeaways
- Whisper provides strong Arabic ASR out‑of‑the‑box
- Paragraph‑level embeddings balance context and granularity, boosting retrieval quality.
- Pairing FAISS with lightweight metadata yields production‑ready search on commodity hardware.
- Meticulous cleaning of ASR output (batch post‑processing + LLM enhancement) significantly reduces WER and improves user trust.