Vision and Language Models for Enhanced Archive Video Management
Archival video collections contain a wealth of historical and cultural information. Managing and analyzing this data can be challenging due to the lack of metadata and inconsistent formatting across different sources. In particular, identifying and separating individual stories within a single archived tape is critical for efficient indexing, analysis and retrieval. However, manual segmentation is time-consuming and prone to human error. To address this challenge, we propose a novel approach that combines vision and language models to automatically detect transition frames and segment archive videos into distinct stories. A vision model is used to cluster frames of the video. Using recent robust automatic speech recognition and large language models, a transcript, a summary and a title for the story are generated. By leveraging computed features from the previous transition frames detection, we also propose a fine-grained chaptering of the segmented stories. We conducted experiments on a dataset consisting of 50 hours of archival video footage. The results demonstrated a high level of accuracy in detecting and segmenting videos into distinct stories. Specifically, we achieved a precision of 93% for an Intersection over Union threshold set at 90%. Furthermore, our approach has shown to have significant sustainability benefits as it is able to filter and remove approximately 20% of the content from the 50 hours of videos tested. This reduction in the amount of data that needs to be managed, analyzed and stored can lead to substantial cost savings and environmental benefits by reducing energy consumption and carbon emissions associated with data processing and storage.
Khalil Guetari, Yannis Tevissen, Frederic Petitpont | Moments Lab Research | Boulogne-Billancourt, France
$15.00