emSad - Speech activity detector

Abut the tool

What is it good for? What does it do?

The Speech Activity Detection (SAD) module carries out speech segmentation on audio files. Three kinds of segments are defined: speech, silence and noise. A fájlokat háromféle szegmensre bontja: beszéd, csend és zaj. Speech Activity Detection is the first step preceding any further speech processing.

What is the input?

An audio file in either .wav, .mp3 or .raw format In case of a .raw file one has provide the appripriate parameters. (16 kHz, 16 bit little endian).

What is the output?

The module can create three kinds of output: segment file in SHOUT format (listing segments and their length), audio file cut into segments and three files merging segments according to their type: a merged speech- , a noise- and a silence file.

An example.

Input: radio broadcast
Output: SPEAKER SpeechNonSpeech 5 1.220 1.040 <NA> <NA> SPEECH <NA> SPEAKER SpeechNonSpeech 5 2.260 3.950 <NA> <NA> SOUND <NA> SPEAKER SpeechNonSpeech 5 6.210 0.750 <NA> <NA> SPEECH <NA>


For developers

Source https://github.com/juditacs/hunspeech/blob/master/speech_activity_detection/sad.py
Source language Python 3
Input .wav, .mp3, or any other audio format supported by the SoX (Sound Exchange) tool.
Output Two RTTM-compatible (Rich Transcription Time Marked) files created as the output of the SHOUT tool, and/or one audio file (.wav) per segment, and/or one merged audio file per segment type (.wav).
Execution python3 sad.py -i input.wav -m shout.sad (see also --help)
Licence GPL