Dataset Pruning: Enhancing Deep Learning Efficiency on Large Scale Datasets
data-centric-ai
dataset-pruning
Author
Yiming Lin
Published
2023-08-30
Dataset pruning (Yang et al. 2023; Xia et al. 2023) is an emerging technique for data-efficient learning of deep neural networks. The idea of dataset pruning revolves identifying and removing redundant samples from the training, creating a reduced training set (the coreset) that can be used to train a model with similar performance to the model trained on the original dataset. This is a very useful technique in many scenarios, such as reducing the training time, reducing the storage cost, and even improving the model performance. In this article, we will introduce the basic concepts of dataset pruning, and review the recent works in this area.
Large scale datasets have been the key to the success of deep learning. However, the large scale datasets also bring many challenges. For example, the training time of deep neural networks is usually very long, and the storage cost of large scale datasets is also very high. In addition, the large scale datasets also bring many challenges to the privacy and security of the data. Furthermore, noisy samples, repeated samples, skewed class distributions, and other issues are also common as the dataset size increases. Therefore, it is very important to reduce the size of the training set to an appropriate level while maintaining the performance of the model.
Dataset pruning, also known as data selection or coreset selection, has been proposed to mitigate the abovementioned issues. Dataset pruning methods typically calculate a scalar score for each training example, and select the subset of training samples whose scores meet certain criteria.
Different Dataset Pruning Strategies
They can be categorized into the following categories based on the score criteria (Yang et al. 2023):
Geometric-based Score Criteria
The geometric-based dataset pruning methods use geometric distances in the feature space to score the training samples. A recent work is Moderate Coreset (Xia et al. 2023) measures the distance of a data point to its class center, and those data points with distances close to the median distances are selected as a coreset.
Confidence-based Score Criteria
The confidence-based dataset pruning methods use the predition confidence/uncertainty of the model probabilistic outputs as the score. The coreset typicall consists of samples that the model has the least confidence in (Coleman et al. 2020), or those that lie near the decision boundary where prediction variability is high (Margatina et al. 2021; Chang, Learned-Miller, and McCallum 2017).
Loss-based Score Criteria
The loss-based dataset pruning methods focus on the samples that contribute the most to the loss function. For example, the loss-based dataset pruning methods can be used to select the samples with the largest loss values, or the samples with the largest gradients. Representative methods are GraNd and EL2N (Paul, Ganguli, and Dziugaite 2021), Forgetting(Toneva et al. 2019; Wei et al. 2020; Jiang et al. 2018),
Related Fields
Don’t confuse dataset pruning with the following related fields:
Model Pruning: Model pruning aims to reduce the size of the model by removing the redundant parameters.
Dataset Distillation(Yu, Liu, and Wang 2023) or dataset condensation(Zhao, Mopuri, and Bilen 2020): Dataset distillation aims to synthesize a compact but informative dataset such that the models trained on these samples have similar test performance to those trained on the original dataset.
Machine Unlearning(Zhang et al. 2023; “NeurIPS 2023 Machine Unlearning Challenge” n.d.): Machine unlearning aims to remove the influence of a subset (the forget set) of training samples on the trained model. A naive way to achieve this is to retrain the model on the remaining samples.
Coleman, Cody, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia. 2020. “Selection via Proxy: Efficient Data Selection for Deep Learning.” In ICLR. https://openreview.net/forum?id=HJg2b0VYDr.
Jiang, Lu, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. 2018. “MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels.” In Proceedings of the 35th International Conference on Machine Learning, 2304–13. PMLR. https://proceedings.mlr.press/v80/jiang18c.html.
Margatina, Katerina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. 2021. “Active Learning by Acquiring Contrastive Examples.” In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 650–63. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.51.
Toneva, Mariya, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. 2019. “An Empirical Study of Example Forgetting During Deep Neural Network Learning.” In International Conference on Learning Representations. https://openreview.net/forum?id=BJlxm30cKm.
Xia, Xiaobo, Jiale Liu, Jun Yu, Xu Shen, Bo Han, and Tongliang Liu. 2023. “Moderate Coreset: A Universal Method of Data Selection for Real-World Data-Efficient Deep Learning.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=7D5EECbOaf9.
Yang, Shuo, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, and Ping Li. 2023. “Dataset Pruning: Reducing Training Data by Examining Generalization Influence.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=4wZiAXD29TQ.
Zhao, Bo, Konda Reddy Mopuri, and Hakan Bilen. 2020. “Dataset Condensation with Gradient Matching.” In International Conference on Learning Representations. https://openreview.net/forum?id=mSAKhLYLSsl.
Citation
BibTeX citation:
@online{lin2023,
author = {Lin, Yiming},
title = {Dataset {Pruning:} {Enhancing} {Deep} {Learning} {Efficiency}
on {Large} {Scale} {Datasets}},
date = {2023-08-30},
url = {https://yiminglin-ai.github.io//blog/posts/dataset-pruning},
langid = {en}
}