Boosting Memory Network for Video Object Segmentation in Complex Scenes

Title:Boosting Memory Network for Video Object Segmentation in Complex Scenes

Conference:CGI 2025

Tags:Global and Local Modeling, Memory Management, Multi-scale Memory Reading, Space-time Memory Network and Video Object Segmentation

Abstract:

Memory-based methods are the leading solutions for video object segmentation. However, due to incorrect feature matching, single-scale memory reading, and inefficient memory management, these methods often struggle in complex scenarios with similar distractors, small objects, or large deformations. In this paper, we propose a novel boosting memory network (BMN), which consists of a context-aware module (CAM), a multi-scale memory readout module (MMRM), and a reinforcement learning-based memory refinement module (RL-MRM) to enhance segmentation performance in complex scenarios. Specifically, we introduce the CAM into the encoding process, which employs transformer blocks that contain global and local branches to effectively model long-range and short-range spatial dependencies. Additionally, we employ the MMRM to perform efficient multi-scale memory readout, thereby generating multi-scale memory features that deliver strong cues for the final segmentation. Furthermore, during inference, we utilize the RL-MRM to construct a query-specific refined memory to provide precise segmentation guidance for each query frame, effectively mitigating temporal drift and avoiding error accumulation. Extensive experiments on five challenging benchmarks demonstrate that our BMN can achieve competitive performance compared to state-of-the-art methods. Our source code and models are available at: https://github.com/csustYyh/BMN.