Dataset Pruning: Enhancing Deep Learning Efficiency on Large Scale Datasets

data-centric-ai
dataset-pruning
Author

Yiming Lin

Published

2023-08-30

Dataset pruning (Yang et al. 2023; Xia et al. 2023) is an emerging technique for data-efficient learning of deep neural networks. The idea of dataset pruning revolves identifying and removing redundant samples from the training, creating a reduced training set (the coreset) that can be used to train a model with similar performance to the model trained on the original dataset. This is a very useful technique in many scenarios, such as reducing the training time, reducing the storage cost, and even improving the model performance. In this article, we will introduce the basic concepts of dataset pruning, and review the recent works in this area.

Dataset Pruning Image Credit: (Yang et al. 2023)

Introduction

Large scale datasets have been the key to the success of deep learning. However, the large scale datasets also bring many challenges. For example, the training time of deep neural networks is usually very long, and the storage cost of large scale datasets is also very high. In addition, the large scale datasets also bring many challenges to the privacy and security of the data. Furthermore, noisy samples, repeated samples, skewed class distributions, and other issues are also common as the dataset size increases. Therefore, it is very important to reduce the size of the training set to an appropriate level while maintaining the performance of the model.

Dataset pruning, also known as data selection or coreset selection, has been proposed to mitigate the abovementioned issues. Dataset pruning methods typically calculate a scalar score for each training example, and select the subset of training samples whose scores meet certain criteria.

Different Dataset Pruning Strategies

They can be categorized into the following categories based on the score criteria (Yang et al. 2023):

Geometric-based Score Criteria

The geometric-based dataset pruning methods use geometric distances in the feature space to score the training samples. A recent work is Moderate Coreset (Xia et al. 2023) measures the distance of a data point to its class center, and those data points with distances close to the median distances are selected as a coreset.

Confidence-based Score Criteria

The confidence-based dataset pruning methods use the predition confidence/uncertainty of the model probabilistic outputs as the score. The coreset typicall consists of samples that the model has the least confidence in (Coleman et al. 2020), or those that lie near the decision boundary where prediction variability is high (Margatina et al. 2021; Chang, Learned-Miller, and McCallum 2017).

Loss-based Score Criteria

The loss-based dataset pruning methods focus on the samples that contribute the most to the loss function. For example, the loss-based dataset pruning methods can be used to select the samples with the largest loss values, or the samples with the largest gradients. Representative methods are GraNd and EL2N (Paul, Ganguli, and Dziugaite 2021), Forgetting(Toneva et al. 2019; Wei et al. 2020; Jiang et al. 2018),

References

Chang, Haw-Shiuan, Erik Learned-Miller, and Andrew McCallum. 2017. “Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples.” In Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/hash/2f37d10131f2a483a8dd005b3d14b0d9-Abstract.html.
Coleman, Cody, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. 2020. “Selection via Proxy: Efficient Data Selection for Deep Learning.” In ICLR. https://openreview.net/forum?id=HJg2b0VYDr.
Jiang, Lu, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels.” In Proceedings of the 35th International Conference on Machine Learning, 2304–13. PMLR. https://proceedings.mlr.press/v80/jiang18c.html.
Margatina, Katerina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. “Active Learning by Acquiring Contrastive Examples.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 650–63. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.51.
NeurIPS 2023 Machine Unlearning Challenge.” n.d. NeurIPS 2023 Machine Unlearning Challenge. Accessed August 30, 2023. https://unlearning-challenge.github.io/unlearning-challenge.github.io/.
Paul, Mansheej, Surya Ganguli, and Gintare Karolina Dziugaite. 2021. “Deep Learning on a Data Diet: Finding Important Examples Early in Training.” In Advances in Neural Information Processing Systems, 34:20596–607. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2021/hash/ac56f8fe9eea3e4a365f29f0f1957c55-Abstract.html.
Toneva, Mariya, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2019. “An Empirical Study of Example Forgetting During Deep Neural Network Learning.” In International Conference on Learning Representations. https://openreview.net/forum?id=BJlxm30cKm.
Wei, Hongxin, Lei Feng, Xiangyu Chen, and Bo An. 2020. “Combating Noisy Labels by Agreement: A Joint Training Method with Co-Regularization.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13726–35. https://openaccess.thecvf.com/content_CVPR_2020/html/Wei_Combating_Noisy_Labels_by_Agreement_A_Joint_Training_Method_with_CVPR_2020_paper.html.
Xia, Xiaobo, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. 2023. “Moderate Coreset: A Universal Method of Data Selection for Real-World Data-Efficient Deep Learning.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=7D5EECbOaf9.
Yang, Shuo, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. 2023. “Dataset Pruning: Reducing Training Data by Examining Generalization Influence.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=4wZiAXD29TQ.
Yu, Ruonan, Songhua Liu, and Xinchao Wang. 2023. “Dataset Distillation: A Comprehensive Review.” arXiv. https://doi.org/10.48550/arXiv.2301.07014.
Zhang, Haibo, Toru Nakamura, Takamasa Isohara, and Kouichi Sakurai. 2023. “A Review on Machine Unlearning.” SN Computer Science 4 (4): 337. https://doi.org/10.1007/s42979-023-01767-4.
Zhao, Bo, Konda Reddy Mopuri, and Hakan Bilen. 2020. “Dataset Condensation with Gradient Matching.” In International Conference on Learning Representations. https://openreview.net/forum?id=mSAKhLYLSsl.

Citation

BibTeX citation:
@online{lin2023,
  author = {Lin, Yiming},
  title = {Dataset {Pruning:} {Enhancing} {Deep} {Learning} {Efficiency}
    on {Large} {Scale} {Datasets}},
  date = {2023-08-30},
  url = {https://yiminglin-ai.github.io//blog/posts/dataset-pruning},
  langid = {en}
}
For attribution, please cite this work as:
Lin, Yiming. 2023. “Dataset Pruning: Enhancing Deep Learning Efficiency on Large Scale Datasets.” August 30, 2023. https://yiminglin-ai.github.io//blog/posts/dataset-pruning.