Face2 Module

Multimodal Extraction Pipeline

This module provides a robust pipeline for synchronized extraction of facial regions and audio features (MFCC) from video files. It is designed to scale using multiprocessing and handles different operational modes: Video-only, Audio-only, or Synchronized Multimodal extraction.

Key Components:

  • Face Detection: Uses MTCNN to locate faces, selecting the largest face per frame with optional color space conversion (RGB, Gray, HSV, etc.).

  • Audio Processing: Extracts audio using MoviePy and generates spectrogram-like MFCC features using torchaudio’s Kaldi integration.

  • Sync Logic: When both modes are active, it ensures each saved face has a corresponding audio snippet centered on the frame’s timestamp.

Operational Workflow:

  1. Scans input directory for video files.

  2. Initializes a pool of workers; MTCNN is only loaded if video is enabled.

  3. Processes videos in parallel, saving outputs as compressed .npz files.

face2.init_worker(video_enabled)[source]

Initializes a worker process for parallel execution.

Parameters:

video_enabled – Boolean flag. If True, loads the MTCNN detector into global memory.

face2.get_audio_waveform(video_path, target_sr=16000)[source]

Extracts the raw audio signal from a video file.

Parameters:
  • video_path – Path to the source video file.

  • target_sr – Target sample rate for the output audio.

Returns:

A tuple of (audio_array, sample_rate). Returns (None, 0) on failure.

face2.generate_mfcc(audio_array, target_dtype=numpy.uint8)[source]

Transforms a raw audio waveform into a standardized MFCC spectrogram.

The process includes signal normalization, Kaldi-based MFCC extraction, statistical standardization (mean/std), and channel replication to create a 3-channel image-like tensor.

Parameters:
  • audio_array – 1D NumPy array of the audio signal.

  • target_dtype – Desired NumPy data type for the output.

Returns:

A 3D NumPy array (H, W, 3) representing the MFCC spectrogram.

face2.process_video_wrapper(video_path, output_root, flags, params)[source]

Orchestrates the extraction process for a single video.

Depending on the flags, it extracts either the full audio MFCC, specific video frames (face crops), or a synchronized combination where each face is paired with a corresponding audio window.

Parameters:
  • video_path – Path to the video file to be processed.

  • output_root – Directory where the .npz files will be stored.

  • flags – Dictionary containing ‘video’ and ‘audio’ boolean activation flags.

  • params – Dictionary containing ‘frames’, ‘size’, ‘color’, and ‘audio_dtype’.

Returns:

A tuple containing (number_of_faces_saved, number_of_audios_saved).

face2.main()[source]