Download PDFOpen PDF in browserDistillation-Based Model Compression Framework for Swin TransformerEasyChair Preprint 154337 pages•Date: November 16, 2024AbstractVision Transformers (ViTs) have gained significant attention in computer vision due to their exceptional model capabilities. However, most ViT models suffer from high complexity, with a large number of parameters that demand considerable memory and inference time, limiting their applicability on resource-constrained devices. To address this issue, we propose a distillation-based framework for compressing large models for smaller datasets. The framework leverages fine-tuning and knowledge distillation to accelerate the training process of compressed models. To evaluate its effectiveness, two compressed Swin Transformer models, Swin-N and Swin-M, were introduced and tested on the CIFAR-100 dataset. Experimental results demonstrate that when trained using the proposed framework, both Swin-N and Swin-M exhibit significant improvements in accuracy compared to their counterparts trained from scratch, with Swin-N achieving an 18.89% increase and Swin-M showing a 20.10% improvement. Additionally, Swin-M closely approximates the accuracy of the Swin-T teacher model, further validating the effectiveness of the framework. Keyphrases: Knowledge Distillation, Model Compression, Swin, ViT, fine-tuning
|