Video context classification by VideoBERT and DA-Ada adapters
DOI:
https://doi.org/10.15276/ict.02.2025.19Keywords:
Video context classification, encoder, decoder, neural networks, adapter, transformerAbstract
This paper proposes an architecture for the task of contextual video classification that combines the strengths of the pre-trained video-speech encoder VideoBERT, the adaptive module DA-Ada (Domain-Aware Adapter), and the auto-regressive transformer decoder. The main goal is to build a system capable of generating text descriptions of actions in videos with a high degree of generalization to new domains. The architecture is designed taking into account the requirements for scalability, flexible adaptation, and reducing the time spent on pre-training the model in the future. The input video is divided into a sequence of frames, each frame is converted into a feature vector using ResNet-50 pre-trained on ImageNet. Then, the frame vectors are projected into the visual token space and passed to the VideoBERT module. This encoder, built on the basis of the transformer architecture BERT, performs contextualization of features across the entire video sequence, modeling long-term temporal dependencies between frames. All VideoBERT parameters remain frozen, which reduces the need for additional training resources. After encoding, each representation is passed to the DA-Ada adaptation module, which consists of two parallel branches: DIA (Domain-Invariant Adapter) and DSA (Domain-Specific Adapter). DIA learns to filter out common, invariant features that are characteristic of most videos. DSA focuses on detecting features that are inherent in a specific domain (e.g., household scenes, industrial objects, etc.). The output representations of both adapters are combined using a scalar coefficient that determines the balance between universality and specialization. The result of this fusion is a sequence of adapted vectors, which is fed to the transformer to generate action descriptions. The generation is carried out by a transformer decoder, which consists of six layers, including a self-attention mechanism for working with partially formed text, a cross-attention mechanism for the video context, as well as standard feedforward blocks. Starting from the token, the decoder gradually forms a text description of the action, completing the process when the token is generated or the maximum length is reached. The proposed architecture provides modularity, a limited number of parameters to be trained, and the possibility of using it in different domains. In further work, it is planned to implement a full cycle of model training based on the Something-Something V2 dataset.