El‑Mal El‑Halal Podcast Semantic Search

Objective

Build a reproducible pipeline that converts every episode of the popular Arabic podcast El‑Mal El‑Halal into a searchable knowledge base. The system should fetch audio automatically, transcribe speech to text, create semantic embeddings, and provide sub‑second similarity search with deep‑links to the original audio.

Approach

Data acquisition: Episodes are downloaded directly from the RSS feed using yt‑dlp with automatic MP3 extraction.
Audio preprocessing: Files are normalized and converted to WAV through PyDub + FFmpeg, then chunked for efficient ASR.
Transcription: Whisper‑large‑v3 (multilingual) is forced into Arabic decoding to produce high‑quality transcripts.
cleaning: A tiny [ALLaM-AI/ALLaM-7B-Instruct-preview] prompt removes filler, speaker tags, “uhh”s, and Whisper hallucinations before indexing.
Segmentation & embedding: Transcripts are sentence‑segmented with NLTK, grouped into 3‑4 sentence paragraphs, and embedded via the multilingual MiniLM model (omarelshehy/Arabic‑STS‑Matryoshka‑V2).
Indexing & search: Paragraph vectors are L2‑normalized and inserted into a FAISS IndexFlatIP for cosine similarity search; metadata is stored in a Hugging Face Dataset alongside paragraph text and start‑time deep‑links.

Results & Metrics

Episodes processed: 18 (≈ 20 hrs audio)
ASR segments: 13 970
Paragraph embeddings: ≈ 7652
Median search latency: 50 ms on CPU

How It Works

[Audio RSS]
    │   yt‑dlp
    ▼
(MP3 files)
    │   PyDub / FFmpeg  (mono • 16 kHz)
    ▼
{Silence‑based chunks}
    │   Whisper‑large‑v3‑turbo  (ar)
    ▼
[Raw transcripts]
    │   ALLaM‑7B cleanup
    ▼
[Polished sentences]
    │   NLTK  +  heuristics
    ▼
[Paragraphs  (semantic similarity)]
    │   Sentence‑Transformers  (omarelshehy/Arabic‑STS‑Matryoshka‑V2)
    ▼
(768‑D embeddings)
    │   FAISS FlatIP
    ▼
(Index + metadata)
    │   Gradio UI
    ▼
Semantic search  &  deep‑links

Project Visualizations

Live Demo & Repository

Live Demo Dataset Card

Technical Skills

PyTorch Transformers Whisper SentenceTransformers FAISS Hugging Face Datasets NLP ASR

Learnings/Takeaways

- Whisper provides strong Arabic ASR out‑of‑the‑box
- Paragraph‑level embeddings balance context and granularity, boosting retrieval quality.
- Pairing FAISS with lightweight metadata yields production‑ready search on commodity hardware.
- Meticulous cleaning of ASR output (batch post‑processing + LLM enhancement) significantly reduces WER and improves user trust.

El‑Mal El‑Halal Arabic Podcast Semantic Search