Face2 Module¶
Multimodal Extraction Pipeline¶
This module provides a robust pipeline for synchronized extraction of facial regions and audio features (MFCC) from video files. It is designed to scale using multiprocessing and handles different operational modes: Video-only, Audio-only, or Synchronized Multimodal extraction.
Key Components:¶
Face Detection: Uses MTCNN to locate faces, selecting the largest face per frame with optional color space conversion (RGB, Gray, HSV, etc.).
Audio Processing: Extracts audio using MoviePy and generates spectrogram-like MFCC features using torchaudio’s Kaldi integration.
Sync Logic: When both modes are active, it ensures each saved face has a corresponding audio snippet centered on the frame’s timestamp.
Operational Workflow:¶
Scans input directory for video files.
Initializes a pool of workers; MTCNN is only loaded if video is enabled.
Processes videos in parallel, saving outputs as compressed .npz files.
- face2.init_worker(video_enabled)[source]¶
Initializes a worker process for parallel execution.
- Parameters:
video_enabled – Boolean flag. If True, loads the MTCNN detector into global memory.
- face2.get_audio_waveform(video_path, target_sr=16000)[source]¶
Extracts the raw audio signal from a video file.
- Parameters:
video_path – Path to the source video file.
target_sr – Target sample rate for the output audio.
- Returns:
A tuple of (audio_array, sample_rate). Returns (None, 0) on failure.
- face2.generate_mfcc(audio_array, target_dtype=numpy.uint8)[source]¶
Transforms a raw audio waveform into a standardized MFCC spectrogram.
The process includes signal normalization, Kaldi-based MFCC extraction, statistical standardization (mean/std), and channel replication to create a 3-channel image-like tensor.
- Parameters:
audio_array – 1D NumPy array of the audio signal.
target_dtype – Desired NumPy data type for the output.
- Returns:
A 3D NumPy array (H, W, 3) representing the MFCC spectrogram.
- face2.process_video_wrapper(video_path, output_root, flags, params)[source]¶
Orchestrates the extraction process for a single video.
Depending on the flags, it extracts either the full audio MFCC, specific video frames (face crops), or a synchronized combination where each face is paired with a corresponding audio window.
- Parameters:
video_path – Path to the video file to be processed.
output_root – Directory where the .npz files will be stored.
flags – Dictionary containing ‘video’ and ‘audio’ boolean activation flags.
params – Dictionary containing ‘frames’, ‘size’, ‘color’, and ‘audio_dtype’.
- Returns:
A tuple containing (number_of_faces_saved, number_of_audios_saved).