emDia - Speaker diariser
About the tool
What is it good for? What does it do?
When analysing a recording with multiple speakers, the program emDia can tell "who is speaking when". This s called speaker diarisation. It can tell the difference between speech sounds and recognises when speakers are taking turns.
What is the input?
A sound file (e.g. in .wav or .mp3 format).
What is the output?
A text file conforming the standards used in this field (RTTM (Rich Transcription Time Marked) format), which contains information on which speaker is talking in a given section of the recording, listed row by row. The algorithm only detects speaker changes, not speaker identities.
An example for an output file part (speaker change at the 47th second of the recording, a new speaker taking turn):
SPEAKER SpeechNonSpeech 1 46.670 0.300 <NA> <NA> SPK01 <NA>
SPKR-INFO SpeechNonSpeech 1 <NA> <NA> <NA> unknown SPK16 <NA>
SPEAKER SpeechNonSpeech 1 46.970 2.220 <NA> <NA> SPK16 <NA>
|Input||.wav, .mp3, or any other audio format supported by the SoX (Sound Exchange) tool.|
|Output||Two RTTM-compatible files created as the output of the SHOUT tool, which contain information on speech-silence-noise, and the different audio segments assigned to the respective speakers.|
python em-dia.py [-h] [-m SHOUT_MODEL] [-s SAD_FN] input_fn output_dir shout_dir
The meanings of the arguments can be accessed via the em-dia.py --help command.