The Need for Speed: Pruning Transformers with One Recipe

1University of Toronto
ICLR 2024
Main Pruning Pipeline

OPTIN is able to prune transformers in one-shot across multiple modalities.

Abstract

We introduce the One-shot Pruning Technique for Interchangeable Networks (OPTIN) framework as a tool to increase the efficiency of pre-trained transformer architectures, across many domains, without requiring re-training. Recent works have explored improving transformer efficiency, however often incur computationally expensive re-training procedures or depend on architecture-specific characteristics, thus impeding practical wide-scale adoption across multiple modalities. To address these shortcomings, the OPTIN framework leverages intermediate feature distillation, capturing the long-range dependencies of model parameters (coined trajectory), to produce state-of-the-art results on natural language, image classification, transfer learning, and semantic segmentation tasks. Our motivation stems from the need for a generalizable model compression framework that scales well across different transformer architectures and applications. Given a FLOP constraint, the OPTIN framework will compress the network while maintaining competitive accuracy performance and improved throughput. Particularly, we show a ≤ 2% accuracy degradation from NLP baselines and a 0.5% improvement from state-of-the-art methods on image classification at competitive FLOPs reductions. We further demonstrate the generalization of tasks and architecture with comparative performance on Mask2Former for semantic segmentation and cnn-style networks. OPTIN presents one of the first one-shot efficient frameworks for compressing transformer architectures that generalizes well across multiple class domains, in particular: natural language and image-related tasks, without re-training.

Method

OPTIN effectively quantifies parameter importance by analyzing the downstream (depth-wise) effect on the model. By leveraging a proxy distillation loss on both the manifold and final logits, the OPTIN Framework provides a gradient-free method for estimating parameter importance through forward passes. Finally, the mask search phase partitions the search space by adding parameters in descending importance (≥ polynomial-time).

OPTIN Algorithm

Results

We introduce implementation and evaluation details to ensure reproducibility and benchmark our method with state of art in natural language and image classification to illustrate the potential of our one-shot framework. In particular, the majority of our experiments explore using the OPTIN Framework to improve off-the-shelf models without re-training. We further investigate the applications in transfer learning, alternate architectures, and downstream tasks to show the generalizability of our method across tasks and architectures.

Natural Language Processing

For Natural Language Processing, OPTIN is evaluated on the GLUE Benchmark using the BERT-Base architecture. Despite the added re-training phase in other methods, the OPTIN Framework is able to retain competitive test performance over a variety of compression ratios thus establishing a compelling argument for retraining-free pipelines.

Natural Language Results

Image Classification

For Image Classification, ImageNet-1K is used to benchmark DeiT-Ti/S architectures, demonstrating the OPTIN Framework's robustness to multiple modalities. To further benchmark our performance in perspective of a wider FLOPs spectrum and more model compression methods, we introduce the below figure which benchmarks against some of the most recent model compression techniques for transformers; some of which retain re-training or training-adjacent artifacts. Despite our lack of re-training, the OPTIN framework produces competitive results over various flop ratios.

Image Classification Results

Semantic Segmentation

To demonstrate the OPTIN framework's generalizability to complex architectures and downstream tasks, we apply model compression to the Mask2Former Architecture with the Swin-Tiny backbone on the Cityscapes dataset. Qualitatively we can see a strong resemblance between the original and compressed network, with a small discrepancy in predictions towards the bottom right of the frame in an already difficult-to-segment region (as evidenced by the unclear segmentation in the original prediction) and on the traffic sign towards the top left.

Semantic Segmentation

BibTeX (Updated Version Coming Soon)


      @InProceedings{Khaki_2024_ICLR,
      author    = {Khaki, Samir and Plataniotis, Konstantinos N.},
      title     = {The Need for Speed: Pruning Transformers with One Recipe},
      booktitle = {Proceedings of the Twelfth International Conference on Learning Representations},
      year      = {2024},
      url       = {https://openreview.net/forum?id=MVmT6uQ3cQ}
    }