The VOiCES corpus is a collaboration between SRI International and Lab41, In-Q-Tel, presenting audio recorded in acoustically challenging conditions. Recordings took place in real rooms of various sizes, capturing different background and reverberation profiles for each room. Various types of distractor noise were simultaneously played with clean speech. Audio was recorded at a distance using various microphones placed throughout the room. To imitate human behavior during conversation, the foreground loudspeaker was placed on a motorized platform that rotated over a range of angles during recordings.
Three hundred distinct speakers from LibriSpeech’s “clean” data subset were selected as the source audio, ensuring a 50-50 female-male split. In preparation for upcoming data challenges, the first release of the VOiCES corpus will include 200 speakers only. The remaining 100 speakers will be reserved for model validation; the full corpus (300 speakers) will be released once the data challenge is closed.
Source audio references
Source audio references, per LibriSpeech, are provided in three different tables as follows:
Information on the speaker ID, book ID, and chapter ID
Speaker ID, gender, and LibriSpeech data subset
Orthographic transcription of all audio files
Audio files are available in WAV format with 16 kHz sample rate with 16-bit precision. All files begin with the corpus name Lab41-SRI-VOiCES. Source audio files specify speaker, chapter, and chapter segment identification number. The file naming format sample is shown below:
Lab41-SRI-VOiCES-src-sp< speaker_ID >-ch< chapter_ID >-sg< segment_ID >.wav
The naming convention for audio recorded at a distance includes all the above information, with additional descriptors for room, distractor noise, microphone type, microphone location, and position of foreground loudspeaker in degrees. The file naming format is shown below:
Lab41-SRI-VOiCES-< room >-< distractor_noise >-sp< speaker_ID >-ch< chapter_ID >-seg< segment_ID >-mc< mic_ID >-< mic_type >-< mic_location >-dg< degree >.wav
Audio files to characterize the room response are also available:
Lab41-SRI-VOiCES-< room >-< signal >-mc< mic_ID >-< mic_type >-< mic_location >.wav
As are recordings of distractor noise only or ambient room background only:
Lab41-SRI-VOiCES-< distractor_noise >-mc< mic_ID >-< mic_type >-< mic_location >.wav
Possible descriptors for room, distractor noise, microphone type, and microphone location, are show in the table below.
|rm1||Room||Room-1: dimensions 146” x 107” (x 107” height)|
|rm2||Room||Room-2: dimensions 225” x 158” (x 109” height)|
|scr||Source audio||Source audio for foreground speaker|
|none||Distractor noise||No distractor noise played|
|musi||Distractor noise||Music distractor noise played|
|tele||Distractor noise||Television distractor noise played|
|babb||Distractor noise||Babble distractor noise played|
|stu||Mic type||Cardioid dynamic studio microphone|
|lav||Mic type||Omnidirectional condenser lavalier microphone|
|clo||Mic location||Closest to foreground speaker- on table|
|mid||Mic location||Mid-distance to foreground speaker- on table|
|far||Mic location||Farthest to foreground speaker- on stand|
|beh||Mic location||Behind foreground speaker- on stand|
|cec||Mic location||Overhead on ceiling, clear|
|ceo||Mic location||Overhead on ceiling, fully obstructed|
|tbo||Mic location||Partially obstructed - table|
|wal||Mic location||Fully obstructed - wall|
|impulse||Signal||Two seconds with transient sound in middle, for room response|
|swoop||Signal||Rising tone for 20 seconds, for room response|
|tone||signal||Steady tone for 15 seconds, for room response|
All the data is contained in two main folders: distant-16k, containing all the audio recordings, and source-16k, containing the audio files used from LibriSpeech, corrected for DC offset and normalized to each file’s peak amplitude. The WAV files for the source audio are organized in subdirectories by speaker ID. The distant-16k has three main subdirectories:
- distractors : distractor noise recordings with no foreground audio for all rooms
- room-response : recorded sound to determine room-response for all rooms
- speech : for each room, recordings of foreground audio with babble, music, television or no distractor noise, arranged by speaker ID in each subfolder.
The directory hierarchy is shown below :
Microphone identification numbers are unique to a specific microphone location and type, defined below.
Distance (inches) between microphones and loudspeakers or floor, for Room-1 and Room-2 recordings.
|Foreground||Distractor 1||Distractor 2||Distractor 3||Floor|
VOiCES is publicly available released under Creative Commos BY 4.0, free for commercial, academic, and government use. Please do reference VOiCES if using the data in publications.