
Building an AI Profanity Filter with Vocal Separation
I built an online tool that automatically detects and bleeps profanity in video and audio files. Here's the high-level architecture. The problem Manual profanity censoring takes 45+ minutes for a 10-minute video. You have to listen through, find each word, razor the audio, drop a beep effect. For songs, it's nearly impossible without destroying the music. The solution AI speech recognition + neural vocal separation. How it works User uploads a file or pastes a YouTube URL Audio is extracted with FFmpeg AI speech-to-text transcribes the audio (AssemblyAI / Deepgram) Profanity is detected using morphological analysis (lemmatization) Each word is replaced with beep/silence/custom sound via FFmpeg For songs: Demucs AI separates vocals from instruments first Song mode — the hard part Demucs by Meta AI does the heavy lifting — splitting audio into vocal and instrumental tracks. Profanity detection runs only on the vocal track, then the censored vocals are mixed back with the original instrum
Continue reading on Dev.to Python
Opens in a new tab




