emDia - Speaker diariser

About the tool

What is it good for? What does it do?

When analysing a recording with multiple speakers, the program emDia can tell "who is speaking when". This s called speaker diarisation. It can tell the difference between speech sounds and recognises when speakers are taking turns.

What is the input?

A sound file (e.g. in .wav or .mp3 format).

What is the output?

A text file conforming the standards used in this field (RTTM (Rich Transcription Time Marked) format), which contains information on which speaker is talking in a given section of the recording, listed row by row. The algorithm only detects speaker changes, not speaker identities.

An example:

An example for an output file part (speaker change at the 47th second of the recording, a new speaker taking turn):

SPEAKER SpeechNonSpeech 1 46.670 0.300 <NA> <NA> SPK01 <NA>
SPKR-INFO SpeechNonSpeech 1 <NA> <NA> <NA> unknown SPK16 <NA>
SPEAKER SpeechNonSpeech 1 46.970 2.220 <NA> <NA> SPK16 <NA>


For developers:

Source https://github.com/juditacs/hunspeech/blob/master/speaker_diarization/em-dia.py
Source language Python
Input .wav, .mp3, or any other audio format supported by the SoX (Sound Exchange) tool.
Output Two RTTM-compatible files created as the output of the SHOUT tool, which contain information on speech-silence-noise, and the different audio segments assigned to the respective speakers.
Execution python em-dia.py [-h] [-m SHOUT_MODEL] [-s SAD_FN] input_fn output_dir shout_dir
The meanings of the arguments can be accessed via the em-dia.py --help command.
Licence GPL