Conrad Sanderson - VidTIMIT dataset

VidTIMIT Audio-Video Dataset

overview
examples
downloads
related datasets
related publications

Overview

The VidTIMIT dataset is comprised of video and corresponding audio recordings of 43 people, reciting short sentences. It can be useful for research on topics such as automatic lip reading, multi-view face recognition, multi-modal speech recognition and person identification.
The dataset was recorded in 3 sessions, with a mean delay of 7 days between Session 1 and 2, and 6 days between Session 2 and 3. The sentences were chosen from the test section of the TIMIT corpus. There are 10 sentences per person. The first six sentences (sorted alpha-numerically by filename) are assigned to Session 1. The next two sentences are assigned to Session 2 with the remaining two to Session 3.

The first two sentences for all persons are the same, with the remaining eight generally different for each person.

In addition to the sentences, each person performed a head rotation sequence in each session. The sequence consists of the person moving their head to the left, right, back to the center, up, then down and finally return to center.

The recording was done in an office environment using a broadcast quality digital video camera. The video of each person is stored as a numbered sequence of JPEG images with a resolution of 512 x 384 pixels. 90% quality setting was used during the creation of the JPEG images. The corresponding audio is stored as a mono, 16 bit, 32 kHz WAV file.

Examples

Session ID Sentence ID
or
Head rotation ID Sentence text Examples

Session 1
head
• MPEG1 video preview [320x240]
• JPEG image sequence (.tar.gz)
• JPEG image sequence (.zip)

sa1
She had your dark suit
in greasy wash water all year
• MPEG1 video preview [320x240]
• WAV audio
• JPEG image sequence (.tar.gz)
• JPEG image sequence (.zip)

sa2 Don't ask me to carry
an oily rag like that

si1398 Do they make
class-biased decisions?

si2028 He took his mask from
his forehead and threw it,
unexpectedly, across the deck

si768 Make lid for sugar bowl
the same as jar lids,
omitting design disk

sx138
The clumsy customer spilled
some expensive perfume

Session 2
head2

sx228 The viewpoint
overlooked the ocean • MPEG1 video preview [320x240]
• WAV audio
• JPEG image sequence (.tar.gz)
• JPEG image sequence (.zip)

sx318 Please dig my
potatoes up before frost

Session 3
head3

sx408 I'd ride the subway,
but I haven't enough change • MPEG1 video preview [320x240]
• WAV audio
• JPEG image sequence (.tar.gz)
• JPEG image sequence (.zip)

sx48 Grandmother outgrew her
upbringing in petticoats

Downloads

PLEASE READ BEFORE DOWNLOADING

LICENSE
The VidTIMIT dataset is Copyright © 2001 Conrad Sanderson.
Distribution and research usage of this dataset is permitted under the following conditions:

This notice is left intact and not modified in any way.

The dataset is provided as is. There is no warranty as to the fitness for any particular purpose.

The author of the dataset is not responsible for any direct or indirect losses resulting from the use of the dataset.

Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of VidTIMIT must cite the following paper:

C. Sanderson and B.C. Lovell
Multi-Region Probabilistic Histograms for Robust and Scalable Identity Inference.
Lecture Notes in Computer Science (LNCS), Vol. 5558, pp. 199-208, 2009.

NOTES

The VidTIMIT dataset is comprised of 44 files, in total taking up about 3 Gb. Each zip is on average 71 Mb

Please download only one file at a time -- this is so the server is not overloaded

FILES

vidtimit_documentation.pdf

fadg0.zip

faks0.zip

fcft0.zip

fcmh0.zip

fcmr0.zip

fcrh0.zip

fdac1.zip

fdms0.zip

fdrd1.zip

fedw0.zip

felc0.zip

fgjd0.zip

fjas0.zip

fjem0.zip

fjre0.zip

fjwb0.zip

fkms0.zip

fpkt0.zip

fram1.zip

mabw0.zip

mbdg0.zip

mbjk0.zip

mccs0.zip

mcem0.zip

mdab0.zip

mdbb0.zip

mdld0.zip

mgwt0.zip

mjar0.zip

mjsw0.zip

mmdb1.zip

mmdm2.zip

mpdf0.zip

mpgl0.zip

mrcz0.zip

mreb0.zip

mrgg0.zip

mrjo0.zip

msjs1.zip

mstk0.zip

mtas1.zip

mtmr0.zip

mwbt0.zip

Related Datasets

DeepfakeTIMIT (modified VidTIMIT where faces are swapped between people via deep learning / GAN-based approach)

VB100 Bird Dataset (for experiments in fine-grained classification)

ChokePoint Dataset (for experiments in person recognition under real-world video surveillance conditions)

LFW-crop (cropped version of Labeled Faces in the Wild)

Related Publications

C. Sanderson. Biometric Person Recognition: Face, Speech and Fusion. VDM-Verlag, 2008. ISBN 978-3-639-02769-3.

P.S. Aleksic, A.K. Katsaggelos. Audio-Visual Biometrics. Proceedings of the IEEE. Vol. 94, No. 11, 2006.

R. Goecke. Current Trends in Joint Audio-Video Signal Processing: A Review. IEEE 8th International Symposium on Signal Processing and its Applications, Sydney, 2005.

G. Chetty, M. Wagner. Liveness Detection Using Cross-Modal Correlations in Face-Voice Person Authentication. Proc. 9th European Conference on Speech Communication and Technology, Lisboa, 2005.

N.A. Fox. Audio and Video Based Person Identification. PhD Thesis. University College Dublin, 2005.

J. Ortega-Garcia, C. Bousono-Crespo. Report on existing biometric databases. BioSecure Deliverable D1.1.1, European Comission, 2005.

G. Chetty, M. Wagner. Audio-Video Person Authentication Based on 3D Facial Feature Warping. Proc. Digital Image Computing: Techniques and Applications, Carins, 2005.

T. Lehn-Schioler, L.K. Hansen, J. Larse. Mapping from Speech to Images Using Continuous State Space Models. Proc. International Workshop on Machine Learning for Multimodal Interaction, Martigny, Switzerland, 2004.

C. Sanderson, K.K. Paliwal. Identity Verification Using Speech and Face Information. Digital Signal Processing Vol. 14, No. 5, 2004.

K.M. Kryszczuk, A. Drygajlo. Color Correction for Face Detection Based on Human Visual Perception Metaphor. Proc. Workshop on Mult-Modal User Authentication, Santa Barbara, 2003.

M. Grgic, K. Delac. Face Recognition Homepage (Databases)

C. Sanderson. Automatic Person Verification Using Speech and Face Information. PhD Thesis, Griffith University, 2003.