Dataset Description

The VOiCES corpus is a collaboration between SRI International and Lab41, In-Q-Tel, presenting audio recorded in acoustically challenging conditions. Recordings took place in real rooms of various sizes, capturing different background and reverberation profiles for each room. Various types of distractor noise were simultaneously played with clean speech. Audio was recorded at a distance using various microphones placed throughout the room. To imitate human behavior during conversation, the foreground loudspeaker was placed on a motorized platform that rotated over a range of angles during recordings.

Three hundred distinct speakers from LibriSpeech’s “clean” data subset were selected as the source audio, ensuring a 50-50 female-male split. In preparation for upcoming data challenges, the first release of the VOiCES corpus will include 200 speakers only. The remaining 100 speakers will be reserved for model validation; the full corpus (300 speakers) will be released once the data challenge is closed.

Source audio references

Source audio references, per LibriSpeech, are provided in three different tables as follows:

Information on the speaker ID, book ID, and chapter ID

Lab41-SRI-VOiCES-speaker-book-chapter.tbl

Speaker ID, gender, and LibriSpeech data subset

Lab41-SRI-VOiCES-speaker-gender-dataset.tbl

Orthographic transcription of all audio files

Lab41-SRI-VOiCES.refs

Data format

Audio files are available in WAV format with 16 kHz sample rate with 16-bit precision. All files begin with the corpus name Lab41-SRI-VOiCES. Source audio files specify speaker, chapter, and chapter segment identification number. The file naming format sample is shown below:

Lab41-SRI-VOiCES-src-sp< speaker_ID >-ch< chapter_ID >-sg< segment_ID >.wav

The naming convention for audio recorded at a distance includes all the above information, with additional descriptors for room, distractor noise, microphone type, microphone location, and position of foreground loudspeaker in degrees. The file naming format is shown below:

Lab41-SRI-VOiCES-< room >-< distractor_noise >-sp< speaker_ID >-ch< chapter_ID >-seg< segment_ID >-mc< mic_ID >-< mic_type >-< mic_location >-dg< degree >.wav

Audio files to characterize the room response are also available:

Lab41-SRI-VOiCES-< room >-< signal >-mc< mic_ID >-< mic_type >-< mic_location >.wav

As are recordings of distractor noise only or ambient room background only:

Lab41-SRI-VOiCES-< distractor_noise >-mc< mic_ID >-< mic_type >-< mic_location >.wav


Possible descriptors for room, distractor noise, microphone type, and microphone location, are show in the table below.

File Code Type Definition
rm1 Room Room-1: dimensions 146” x 107” (x 107” height)
rm2 Room Room-2: dimensions 225” x 158” (x 109” height)
scr Source audio Source audio for foreground speaker
none Distractor noise No distractor noise played
musi Distractor noise Music distractor noise played
tele Distractor noise Television distractor noise played
babb Distractor noise Babble distractor noise played
stu Mic type Cardioid dynamic studio microphone
lav Mic type Omnidirectional condenser lavalier microphone
clo Mic location Closest to foreground speaker- on table
mid Mic location Mid-distance to foreground speaker- on table
far Mic location Farthest to foreground speaker- on stand
beh Mic location Behind foreground speaker- on stand
cec Mic location Overhead on ceiling, clear
ceo Mic location Overhead on ceiling, fully obstructed
tbo Mic location Partially obstructed - table
wal Mic location Fully obstructed - wall
impulse Signal Two seconds with transient sound in middle, for room response
swoop Signal Rising tone for 20 seconds, for room response
tone signal Steady tone for 15 seconds, for room response


All the data is contained in two main folders: distant-16k, containing all the audio recordings, and source-16k, containing the audio files used from LibriSpeech, corrected for DC offset and normalized to each file’s peak amplitude. The WAV files for the source audio are organized in subdirectories by speaker ID. The distant-16k has three main subdirectories:

The directory hierarchy is shown below :


Microphone Details

Microphone identification numbers are unique to a specific microphone location and type, defined below.

Mic_ID Location Model Type
01 clo SHURE SM58 stu
02 clo AKG 417L lav
03 mid SHURE SM58 stu
04 mid AKG 417L lav
05 far SHURE SM58 stu
06 far AKG 417L lav
07 beh SHURE SM58 stu
08 beh AKG 417L lav
09 tbo AKG 417L lav
10 cec AKG 417L lav
11 ceo AKG 417L lav
12 wal SHURE SM11 lav


Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings.

Foreground Distractor 1 Distractor 2 Distractor 3 Floor
Mic_ID rm-1 rm-2 rm-1 rm-2 rm-1 rm-2 rm-1 rm-2 rm-1 rm-2
01 38 80 71 112 71 84 53 64 42 39
02 38 80 71 112 71 84 53 64 42 39
03 72 131 35 81 56 58 52 95 42 39
04 72 131 35 81 56 58 52 95 42 39
05 119 228 72 101 33 104 83 186 70 70
06 119 228 72 101 33 104 83 186 70 70
07 29 29 115 193 133 170 94 94 70 70
08 29 29 115 193 133 170 94 94 70 70
09 58 109 64 98 60 65 49 82 28 25
10 75 128 90 107 108 103 106 104 105 105
11 75 128 90 107 108 103 106 104 106 106
12 130 116 861 116 40 115 81 164 12 10

Licensing

VOiCES is publicly available released under Creative Commos BY 4.0, free for commercial, academic, and government use. Please do reference VOiCES if using the data in publications.