Face2 Module¶

Multimodal Extraction Pipeline¶

This module provides a robust pipeline for synchronized extraction of facial regions and audio features (MFCC) from video files. It is designed to scale using multiprocessing and handles different operational modes: Video-only, Audio-only, or Synchronized Multimodal extraction.

Key Components:¶

Face Detection: Uses MTCNN to locate faces, selecting the largest face per frame with optional color space conversion (RGB, Gray, HSV, etc.).
Audio Processing: Extracts audio using MoviePy and generates spectrogram-like MFCC features using torchaudio’s Kaldi integration.
Sync Logic: When both modes are active, it ensures each saved face has a corresponding audio snippet centered on the frame’s timestamp.

Operational Workflow:¶

Scans input directory for video files.
Initializes a pool of workers; MTCNN is only loaded if video is enabled.
Processes videos in parallel, saving outputs as compressed .npz files.

face2.init_worker(video_enabled)[source]¶

Initializes a worker process for parallel execution.

Parameters:: video_enabled – Boolean flag. If True, loads the MTCNN detector into global memory.

face2.get_audio_waveform(video_path, target_sr=16000)[source]¶

Extracts the raw audio signal from a video file.

Parameters:

video_path – Path to the source video file.
target_sr – Target sample rate for the output audio.

Returns:

A tuple of (audio_array, sample_rate). Returns (None, 0) on failure.

face2.generate_mfcc(audio_array, target_dtype=numpy.uint8)[source]¶

Transforms a raw audio waveform into a standardized MFCC spectrogram.

The process includes signal normalization, Kaldi-based MFCC extraction, statistical standardization (mean/std), and channel replication to create a 3-channel image-like tensor.

Parameters:

audio_array – 1D NumPy array of the audio signal.
target_dtype – Desired NumPy data type for the output.

Returns:

A 3D NumPy array (H, W, 3) representing the MFCC spectrogram.

face2.process_video_wrapper(video_path, output_root, flags, params)[source]¶

Orchestrates the extraction process for a single video.

Depending on the flags, it extracts either the full audio MFCC, specific video frames (face crops), or a synchronized combination where each face is paired with a corresponding audio window.

Parameters:

video_path – Path to the video file to be processed.
output_root – Directory where the .npz files will be stored.
flags – Dictionary containing ‘video’ and ‘audio’ boolean activation flags.
params – Dictionary containing ‘frames’, ‘size’, ‘color’, and ‘audio_dtype’.

Returns:

A tuple containing (number_of_faces_saved, number_of_audios_saved).

face2.main()[source]¶