Analysing Semi-Supervised ConvNet Model Performance with Computation Processes
Abstract
:1. Introduction
- We propose a framework method for connecting contemporary DNNs with SSML and illustrate a computation process with highlights of the most critical functions.
- We outline essential preprocessing steps when implementing these networks across varying labeled/unlabeled data ratios.
- We synergize the preprocessing-focused SSCP (PF-SSCP) with the classifier-focused SSCP (CF-SSCP), offering insights to maximize performance.
- We present SSML algorithms’ anatomy, furnishing guidelines for future model design rooted in this anatomy’s understanding.
2. Related Work
3. Survey Analysis of SOTA Models with Modular Computation Components
3.1. Temporal Ensembling
- Since the network undergoes evaluation once for each input, the training process is expedited.
- With a training target characterized by reduced noise compared to the model, Temporal Ensembling can potentially realize superior accuracy performance.
- It necessitates the storage of auxiliary data and the introduction of a new hyperparameter, , thereby demanding additional memory resources.
- Similar to the model, the Temporal Ensembling model lacks dynamic learning capabilities, necessitating system retraining for effective functionality.
3.2. Model
- Dropout regularization, infusing an element of randomness into these twin architectures.
- The types of augmentations employed, which expand the dataset and render identical labeled and unlabeled data as distinct inputs.
- The network output is computed twice for the same input data, resulting in additional computational costs.
- The model is vulnerable to errors when fed incorrectly labeled data, negatively impacting its accuracy.
- This model cannot learn dynamically as it relies on batch-learning, in contrast with online learning neural networks.
3.3. Mean Teacher
- specifies the parent weights;
- signifies the parent weights of the previous run;
- specifies the child weights; and
- specifies the smoothing parameter.
- Both the supervised networks perform forward propagation using different parameters. However, the student model is the only one with backward propagation to compute the gradients.
- The Mean Teacher model utilizes much greater memory resources compared to the other two models since the dropout conditions ( and ) must be kept and preserved during each epoch.
3.4. MixMatch
- p is some input categorical distribution;
- T is a hyperparameter that needs to be tuned; and
- L is the number of labeled classes.
3.5. ReMixMatch
4. SSML Model Analysis Framework
- One DNN: Here, a single DNN is employed in the classifier, conducting both backward and forward propagations.
- Two DNNs: This configuration utilizes two nearly identical DNNs, differing slightly in their dropout conditions and parameter initialization. Forward propagation occurs in both DNNs; however, backward propagation might not occur in one of these DNNs in some designs.
- Three or more DNNs: Here, three or more DNNs are employed, potentially with varying initialization conditions. Additionally, dropout conditions may or may not be integrated into these DNNs.
- Unlabeled data: For unlabeled data, the consistency cost among two or more DNN instances is calculated, multiplied by the number of classes and the inverse of the batch size.
- Labeled data: In this scenario, the consistency cost among two or more DNN instances and the classification cost between one or more models are computed.
- The design’s emphasis is placed on preprocessing steps rather than the network classifier.
- The classification cost is consistently calculated due to the pseudo-labels on the unlabeled datasets.
- represents and denotes the function’s training parameter.
- denotes , and represents the function’s training parameter.
- is the batch size.
- signifies the time-dependent weighting function (discussed subsequently).
- C represents the number of classes, with (e.g., cat, tree, dog, car, etc., for CIFAR-10).
- specifies the and specifies the function training parameter.
- represents the batch size.
- L includes the labeled input indices.
- is the label for input .
- specifies the time-dependent weighting function (discussed later).
- C specifies the number of different classes (cat, tree, dogs, car, etc..)
5. Empirical Evaluation Method
5.1. Datasets
5.2. Preprocessing
5.3. SSML and DNN Combination
- Performance, gauged through accuracy and loss metrics.
- The duration of the training process.
- Constraints imposed by hardware.
- The initial selection of hyperparameters.
- The appropriateness of the loss function.
5.4. Ramp-Up and Ramp-Down Functions
5.5. SSML Performance Measurements
- Performance accuracy and training loss, evaluated against varying proportions of unlabeled to labeled datasets.
- The duration of training necessitated by each amalgamation of SSML with diverse network classifiers, inclusive of the time expended during the preprocessing and feature selection stages.
- Utilization of parameters and the complexity intrinsic to various SSML algorithms, with an exposition on the distinct parameters requisite in each scenario and their subsequent influence on complexity.
5.6. Experimental Design and Framework Specifications
- DenseNet-121 (Dense Convolutional Network—121 layers): Known for its dense connectivity, DenseNet-121 optimizes parameter efficiency and facilitates feature propagation and reuse. Its design is particularly beneficial for learning with limited labeled data, which is a common challenge in semi-supervised learning scenarios [29].
- Shake-Shake Regularization Model (Shake-Shake26): The Shake-Shake regularization approach, exemplified by the Shake-Shake26 model, introduces stochasticity [30] into the training process. This method has been shown to enhance generalization on image classification tasks, presenting a unique advantage in semi-supervised learning frameworks [31].
- Wide Residual Networks (WRN-40-2): The WRN-40-2 architecture augments the network’s width, offering an optimal trade-off between depth and width. This expanded network capacity allows it to represent more complex functions and data relationships, benefiting from the additional unlabeled data in semi-supervised learning setups [8].
- WRN-28-10: The WRN-28-10 extends the width of traditional residual networks even further, targeting the rigorous demands of high-complexity classification tasks. The architecture is designed to capitalize on the unlabeled data that are more prevalent in semi-supervised learning contexts [8].
6. Evaluation Results and Analysis
6.1. CF-SSCP with Various Network Classifiers
6.1.1. Results of Temporal Ensembling
Summary:
6.1.2. Results of Model and Mean Teacher
- WideResNet Core Network with Mean Teacher: Figure 11 depicts the accuracy achieved by the Mean Teacher model using WideResNet as our core network. The graph indicates that the highest validation accuracy was achieved using the Stochastic Gradient Descent (SGD) [32] algorithm with 44,000 labels, the maximum in these experiments. Notably, accuracy diminishes as the label count decreases. This trend was anticipated since accuracy correlates with the total label count, assuming unchanged hyperparameters.However, the primary observation here concerns the discrepancy between the SGD and Adam optimizers. Clearly, the SGD optimizer [33] surpassed Adam in terms of implementation accuracy.Examining the loss in Figure 12, the Adam optimizer performs superiorly, with less test loss. Another notable aspect is both networks’ initial struggle to smoothly reduce loss, resulting in a highly noisy transition of the loss function. This is attributed to the unsuitability of maintaining consistent hyperparameter values across different volumes of labeled data.Another critical point pertains to initialization. Different initialization parameters are crucial, evident from the 10,000-label dataset starting significantly lower than the 44,000-label set, causing the latter higher initial loss and hindering performance.
- WideResNet core network with model: For the Adam optimizer, accuracy does not directly correlate with the number of labels, as distinctly seen in Figure 11 and Figure 13.Here, high training accuracy is noted with 1000, 4000, and 44,000 labels. Among these, the 1000-label dataset provides the highest accuracy. However, the 10,000-label dataset’s training accuracy is 20% lower than the others, yet its testing accuracy, at 88.56%, surpasses the rest. Additionally, the 1000-label set displays an improvement in accuracy up to 58% before a decline, suggesting overfitting, as evidenced in Figure 14.Comparing this network with different label quantities, the loss is significantly lower at 10,000 labels, a result of hyperparameter tuning and network initialization. Even with 44,000 labels, the network struggles to reduce loss due to an unfavorable start.Moreover, the network’s commencement varies across datasets, except for the 10,000-label set, where loss increases from 0 to 30 epochs. Significant fluctuations are particularly noticeable for the 1000-label data. These issues could be addressed through hyperparameter tuning using strategies such as Random Search or Grid Search.
- Comparison of Mean Teacher with model using WideResNet as the core network: Figure 15 compares the testing and validation accuracy of both networks, demonstrating the Mean Teacher model’s superiority over the model with 4000-label datasets.Furthermore, as shown in Figure 16, the Mean Teacher model excels in terms of loss, even with suboptimal hyperparameter tuning, highlighting the efficacy of the network design with the same core network and identical hyperparameters.
- Mean Teacher with DenseNet and Shake-Shake26: We deemed it essential to test the state-of-the-art SSML Mean Teacher model with a core network other than Shake-Shake26, hence the choice of DenseNet-121. Figure 17 shows that the training loss for both student networks is predictably lower, as the teacher network guides them using the EMA formula. These outcomes suggest that the teacher’s parameters are optimized for the network’s best overall performance. Moreover, the loss from the teacher represents the most optimal value attainable.Additionally, the Shake-Shake26 network significantly outperforms DenseNet-121. As seen in Figure 18, this superiority is also reflected in accuracy performance. The primary reason is Shake-Shake26’s broader and deeper network compared to DenseNet-121. However, it requires more computational time for training and hyperparameter tuning. System designers must consider this computational overhead, particularly when prioritizing performance over flexibility across datasets.Figure 19 provides a few intriguing observations about the relationship between the teacher and the student models. First, with data with 4000 labels, the Shake-Shake26 network training process experiences a huge surge in the loss in the student nework. As a result, the teacher’s network progress is also impacted. Secondly, the accuracy performance of a particular neural network is different for different datasets with various percentages of the labeled dataset. Therefore, it implies that the neural network design must be modified to incorporate dataset variations. Moreover, the hyperparameter tuning in the EMA can render better assumptions that would help in certain scenarios.
Summary:
6.2. PF-SSCP with Different Network Classifiers
Results of MixMatch and ReMixMatch Analysis
- MixMatch Compared to ReMixMatch with WideResNet as a Core Network:Given PF-SSCP’s emphasis on performance with the increased use of unlabeled data, we compared the outcomes from both frameworks using the same WideResNet-28-2, which has a depth of 28, a width of 2, and incorporates batch normalization [34]. As depicted in Figure 20, ReMixMatch slightly outperforms MixMatch with 4000 labels. This discrepancy widens with fewer labeled data.With 1000 labels, as shown in Figure 21, the performance gap becomes more pronounced. An intriguing observation is the convergence of the two semi-supervised machine learning models (SSMLs) within a limited number of epochs. ReMixMatch achieves higher initial accuracy, indicating its strength with fewer labeled data. However, overall performance differences are marginal, as MixMatch eventually catches up by the end of training.
- MixMatch and ReMixMatch with Shake-Shake26 as a core network:Employing different core networks allows us to visualize the performance contributions of the SSMLs, independent of the underlying networks. By switching to Shake-Shake26, we sought to discern any performance variations. A close examination with 1000 labels, as seen in Figure 22, reveals an insignificant difference compared to WideResNet-28-2. Shake-Shake26 converges more rapidly, with marginally better accuracy, but this is not as pronounced when these Deep Neural Networks (DNNs) are used independently.In Figure 23, with 4000 labels, ReMixMatch paired with Shake-Shake26 shows the highest accuracy. Notably, ReMixMatch with WideResNet and MixMatch with Shake-Shake26 are closely matched. The accuracy of ReMixMatch with WideResNet is slightly higher, but only by a narrow margin. Another key observation is the divergence in convergence patterns between MixMatch with Shake-Shake26 and MixMatch with WideResNet, as confirmed through five consecutive tests to ensure the finding’s accuracy.
Summary:
6.3. Training Time
7. Discussion
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
SSML | Semi-supervised machine learning |
DNNs | Deep neural networks |
XAI | Explainable artificial intelligence |
CNN | Convolution neural networks |
GNNs | Graph neural networks |
SSCP | Semi-supervised machine learning process |
PF-SSCP | Preprocessing-focused semi-supervised computation process |
CF-SSCP | Network classifier-focused semi-supervised computation process |
ResNet | Residual neural network |
WRN | Wide residual network |
DenseNet | Densely connected convolutional network |
PCC | Pearson correlation coefficient |
Appendix A
- Experiment 1 (Temporal Ensembling and Shake-Shake26): This trial involved processing 32 × 32 × 3 images with ZCA and training on two Shake-Shake26 networks. A batch size of 128 was used over 300 epochs, with a dropout rate of 0.2, a learning rate of 0.2 (reduced at 50% and 75% of epochs), a momentum of 0.86, and a weight decay of 0.0002.
- Experiment 2 (Temporal Ensembling and DenseNet-121): Similar to Experiment 1 but utilizing two DenseNet-121 networks with a batch size of 64 and an initial learning rate of 0.1, which is reduced at the midpoint and three-quarter mark of epochs, alongside a momentum of 0.9 and a weight decay of 0.0001.
- Experiment 3 (Temporal Ensembling and WRN-40-2): Similar to Experiment 1 but using two WRN-40-2 networks, with a dropout of 0.1, learning rate of 0.1 (decreased at pre-set epochs), and a weight decay of 0.0005.
- Experiment 4 (Temporal Ensembling and WRN-28-10): Similar to Experiment 3 but with two WRN-28-10 networks and an increased dropout of 0.3.
- Experiment 5 (Π model and Shake-Shake26): Similar to Experiment 1 but under the model methodology with the same dropout rate and learning rate reduction schedule.
- Experiment 6 (Π model and DenseNet-121): Similar to Experiment 2 but following the model framework with identical dropout and learning rate scheduling.
- Experiment 7 (Π model and WRN-40-2): Similar to Experiment 3, applying the model approach with a dropout of 0.3.
- Experiment 8 (Π model and WRN-28-10): Similar to Experiment 7 but using WRN-28-10 networks.
- Experiment 9 (Mean Teacher and Shake-Shake26): Similar to Experiment 1 with the Mean Teacher model, applying an EMA of 0.999 and using the Adam optimizer with specified parameters.
- Experiment 10 (Mean Teacher and DenseNet-121): Similar to Experiment 2, adhering to the Mean Teacher methodology.
- Experiment 11 (Mean Teacher and WRN-40-2): Similar to Experiment 3 with the Mean Teacher framework and the same dropout rate.
- Experiment 12 (Mean Teacher and WRN-28-10): Similar to Experiment 7, utilizing the Mean Teacher strategy.
- Experiment 13 (MixMatch and Shake-Shake26): Similar to Experiment 1 but using the MixMatch approach, with specific augmentations and no dropout regularization.
- Experiment 14 (MixMatch and DenseNet-121): Similar to Experiment 2, employing the MixMatch method with a weight decay of 0.997.
- Experiment 15 (MixMatch and WRN-40-2): Similar to Experiment 3, following the MixMatch technique with an optimal weight decay.
- Experiment 16 (MixMatch and WRN-28-10): Similar to Experiment 7, incorporating the MixMatch framework.
- Experiment 17 (ReMixMatch and Shake-Shake26): Similar to Experiment 13 but applying the ReMixMatch methodology.
- Experiment 18 (ReMixMatch and DenseNet-121): Similar to Experiment 14, using the ReMixMatch strategy with a preferred weight decay.
- Experiment 19 (ReMixMatch and WRN-40-2): Similar to Experiment 15, employing the ReMixMatch technique with a specific weight decay.
- Experiment 20 (ReMixMatch and WRN-28-10): Similar to Experiment 16, following the ReMixMatch approach.
- Experiment 21 (Temporal Ensembling and Shake-Shake26 with reduced dropout): Similar to Experiment 1 but with a reduced dropout of 0.023.
- Experiment 22 (Temporal Ensembling and DenseNet-121 with reduced dropout): Similar to Experiment 2, with a lowered dropout rate of 0.045.
- Experiment 23 (Temporal Ensembling and WRN-40-2 with reduced dropout): Similar to Experiment 3, with a dropout rate adjusted to 0.087.
- Experiment 24 (Temporal Ensembling and WRN-28-10 with reduced dropout): Similar to Experiment 4, with a decreased dropout of 0.083.
- Experiment 25 (Π model and Shake-Shake26 with reduced dropout): Similar to Experiment 5, maintaining a dropout rate of 0.2.
- Experiment 26 (Π model and DenseNet-121 with reduced dropout): Similar to Experiment 6, with a lowered dropout of 0.022.
- Experiment 27 (Π model and WRN-40-2 with reduced dropout): Similar to Experiment 7, with a reduced dropout of 0.072.
- Experiment 28 (Π model and WRN-28-10 with reduced dropout): Similar to Experiment 8, with a dropout rate of 0.068.
- Experiment 29 (Mean Teacher and Shake-Shake26 with reduced dropout): Similar to Experiment 9, with a dropout of 0.03.
- Experiment 30 (Mean Teacher and DenseNet-121 with reduced dropout): Similar to Experiment 10, with a reduced dropout rate of 0.02.
- Experiment 31 (Mean Teacher and WRN-40-2 with reduced dropout): Similar to Experiment 11, with a decreased dropout of 0.082.
- Experiment 32 (Mean Teacher and WRN-28-10 with reduced dropout): Similar to Experiment 12, with a lowered dropout rate of 0.075.
- Experiment 33 (MixMatch and Shake-Shake26 with four augmentations): Similar to Experiment 13 but with an increased number of augmentations set to 4.
- Experiment 34 (MixMatch and DenseNet-121 with reduced weight decay): Similar to Experiment 14, with a modified weight decay of 0.997.
- Experiment 35 (MixMatch and WRN-40-2 with optimal weight): Similar to Experiment 15, with a tuned weight decay of 0.999.
- Experiment 36 (MixMatch and WRN-28-10 with optimal weight): Similar to Experiment 16, applying a fine-tuned weight.
- Experiment 37 (MixMatch and WRN-16-10 with optimal weight): Similar to Experiment 14, but using WRN-16-10 networks.
- Experiment 38 (ReMixMatch and Shake-Shake26 with dropout): Same as Experiment 17, but with a dropout of 0.953.
- Experiment 39 (ReMixMatch and DenseNet-121 with dropout): Similar to Experiment 18, with a dropout rate of 0.238.
- Experiment 40 (ReMixMatch and WRN-40-2 reduced weight decay): Similar to Experiment 19, with a lowered weight decay of 0.877.
- Experiment 41 (ReMixMatch and WRN-16-10 with optimal weight): Similar to Experiment 14, using WRN-16-10 networks.
- Experiment 42 (ReMixMatch and WRN-16-10 with dropout): Similar to Experiment 41, with a dropout rate of 0.38.
- Experiment 43 (ReMixMatch and WRN-16-10 with weight decay): Identical to Experiment 42, with a weight decay of 0.997.
- Experiment 44 (ReMixMatch and WRN-28-10 with dropout): Similar to Experiment 20, with a dropout rate of 0.233.
- Experiment 45 (ReMixMatch and WRN-28-10 with very low dropout): Similar to Experiment 20, with a very low dropout of 0.0233.
References
- Chapelle, O.; Schölkopf, B.; Zien, A. (Eds.) Semi-Supervised Learning; The MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Rudin, C. Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. 2019. Available online: http://xxx.lanl.gov/abs/1811.10154 (accessed on 3 April 2023).
- Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. CoRR 2016, abs/1610.02242. Available online: http://xxx.lanl.gov/abs/1610.02242 (accessed on 14 February 2022).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. CoRR 2015, abs/1512.03385. Available online: http://xxx.lanl.gov/abs/1512.03385 (accessed on 17 February 2022).
- Neghawi, E.; Liu, Y. Evaluation of Parameter Update Effects in Deep Semi-Supervised Learning Algorithms. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 351–360. [Google Scholar]
- Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. CoRR 2017, abs/1712.01887. Available online: http://xxx.lanl.gov/abs/1712.01887 (accessed on 3 March 2023).
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. 2017. Available online: http://xxx.lanl.gov/abs/1605.07146 (accessed on 19 March 2023).
- Baldassarre, F.; Azizpour, H. Explainability Techniques for Graph Convolutional Networks. CoRR 2019, abs/1905.13686. Available online: http://xxx.lanl.gov/abs/1905.13686 (accessed on 22 February 2023).
- Yuan, H.; Yu, H.; Gui, S.; Ji, S. Explainability in Graph Neural Networks: A Taxonomic Survey. CoRR 2020, abs/2012.15445. Available online: http://xxx.lanl.gov/abs/2012.15445 (accessed on 3 March 2023).
- Xie, N.; Ras, G.; van Gerven, M.; Doran, D. Explainable Deep Learning: A Field Guide for the Uninitiated. CoRR 2020, abs/2004.14545. Available online: http://xxx.lanl.gov/abs/2004.14545 (accessed on 7 March 2023).
- Shen, W.; Wei, Z.; Huang, S.; Zhang, B.; Fan, J.; Zhao, P.; Zhang, Q. Interpretable Compositional Convolutional Neural Networks. CoRR 2021, abs/2107.04474. Available online: http://xxx.lanl.gov/abs/2107.04474 (accessed on 8 February 2023).
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Yuan, H.; Tang, J.; Hu, X.; Ji, S. XGNN: Towards Model-Level Explanations of Graph Neural Networks. CoRR 2020, abs/2006.02587. Available online: http://xxx.lanl.gov/abs/2006.02587 (accessed on 20 December 2022).
- Pope, P.E.; Kolouri, S.; Rostami, M.; Martin, C.E.; Hoffmann, H. Explainability Methods for Graph Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 10764–10773. [Google Scholar] [CrossRef]
- Kim, G. Recent Deep Semi-supervised Learning Approaches and Related Works. CoRR 2021, abs/2106.11528. Available online: http://xxx.lanl.gov/abs/2106.11528 (accessed on 17 February 2022).
- Tarvainen, A.; Valpola, H. Weight-averaged consistency targets improve semi-supervised deep learning results. CoRR 2017, abs/1703.01780. Available online: http://xxx.lanl.gov/abs/1703.01780 (accessed on 22 December 2022).
- Berthelot, D.; Carlini, N.; Goodfellow, I.J.; Papernot, N.; Oliver, A.; Raffel, C. MixMatch: A Holistic Approach to Semi-Supervised Learning. CoRR 2019, abs/1905.02249. Available online: http://xxx.lanl.gov/abs/1905.02249 (accessed on 22 December 2022).
- David Berthelot, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv 2020, arXiv:1911.09785.
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Pal, K.K.; Sudeep, K.S. Preprocessing for image classification by convolutional neural networks. In Proceedings of the 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, 20–21 May 2016. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2013. [Google Scholar]
- Arrieta, A.B.; Díaz-Rodríguez, N.; Ser, J.D.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. 2019. Available online: http://xxx.lanl.gov/abs/1910.10045 (accessed on 17 April 2023).
- Chen, J.; Song, L.; Wainwright, M.J.; Jordan, M.I. Learning to Explain: An Information-Theoretic Perspective on Model Interpretation. CoRR 2018, abs/1802.07814. Available online: http://xxx.lanl.gov/abs/1802.07814 (accessed on 22 February 2023).
- Lipton, Z.C. The mythos of model interpretability. Queue 2018, 16, 30. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. 2020. Available online: http://xxx.lanl.gov/abs/1911.05722 (accessed on 22 February 2023).
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. CoRR 2015, abs/1512.00567. Available online: http://xxx.lanl.gov/abs/1512.00567 (accessed on 5 April 2023).
- Hadjis, S.; Zhang, C.; Mitliagkas, I.; Ré, C. Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs. CoRR 2016, abs/1606.04487. Available online: http://xxx.lanl.gov/abs/1606.04487 (accessed on 18 February 2022).
- Huang, G.; Liu, Z.; Weinberger, K.Q. Densely Connected Convolutional Networks. CoRR 2016, abs/1608.06993. Available online: http://xxx.lanl.gov/abs/1608.06993 (accessed on 17 February 2022).
- Mania, H.; Pan, X.; Papailiopoulos, D.; Recht, B.; Ramchandran, K.; Jordan, M.I. Perturbed Iterate Analysis for Asynchronous Stochastic Optimization. 2016. Available online: http://xxx.lanl.gov/abs/1507.06970 (accessed on 23 August 2023).
- Gastaldi, X. Shake-Shake regularization. CoRR 2017, abs/1705.07485. Available online: http://xxx.lanl.gov/abs/1705.07485 (accessed on 23 August 2023).
- Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
- Goyal, P.; Dollár, P.; Girshick, R.B.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. CoRR 2017, abs/1706.02677. Available online: http://xxx.lanl.gov/abs/1706.02677 (accessed on 7 March 2023).
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. CoRR 2015, abs/1502.03167. Available online: http://xxx.lanl.gov/abs/1502.03167 (accessed on 7 March 2023).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
Labeled Data | Unlabeled Data | Labeled Data per Class | Unlabeled Data per Class |
---|---|---|---|
1000 | 49,000 | 100 | 4900 |
4000 | 46,000 | 400 | 4600 |
50,000 | 0 | 5000 | 0 |
Framework | Model | Experiment and Architecture | Summarized Purpose |
---|---|---|---|
CF-SSCP | Temporal Ensembling (TE) | Exp 1: TE and Shake-Shake26 | Ensemble predictions over time for stability and consistency |
Exp 2: TE and DenseNet-121 | Utilizing ensemble learning with a focus on network depth | ||
Exp 3: TE and WRN-40-2 | Applying ensembles to widen and deepen network architectures | ||
Exp 4: TE and WRN-28-10 | Ensemble methods combined with a wider network model | ||
Model | Exp 5: model and Shake-Shake26 | Ensuring consistent network predictions without skip connections | |
Exp 6: model and DenseNet-121 | Consistency of prediction with depth-oriented network architectures | ||
Exp 7: model and WRN-40-2 | Deeper networks under consistency constraints | ||
Exp 8: model and WRN-28-10 | Wider networks maintaining prediction consistency | ||
Mean Teacher (MT) | Exp 9: MT and Shake-Shake26 | Teacher–student model consistency without skip connections | |
Exp 10: MT and DenseNet-121 | Depth and skip connections in a teacher–student setup | ||
Exp 11: MT and WRN-40-2 | Deeper architecture in a mean teacher framework | ||
Exp 12: MT and WRN-28-10 | Enhanced width in the teacher–student model’s architecture | ||
PF-SSCP | MixMatch (MM) | Exp 13: MM and Shake-Shake26 | Augmentation and mixing strategies for semi-supervised learning |
Exp 14: MM and DenseNet-121 | Deep architecture applied to advanced mix-and-match techniques | ||
Exp 15: MM and WRN-40-2 | Widening and deepening networks with semi-supervised mix–matching | ||
Exp 16: MM and WRN-28-10 | Wide network structures in advanced mix–match learning scenarios | ||
ReMixMatch (RM) | Exp 17: RM and Shake-Shake26 | Refinement of mix–match techniques with a preprocessing focus | |
Exp 18: RM and DenseNet-121 | Application of preprocessing strategies in deep learning models | ||
Exp 19: RM and WRN-40-2 | Preprocessing alignment in wider and deeper network structures | ||
Exp 20: RM and WRN-28-10 | Extensive preprocessing in a widened network scenario |
Network Classifier | SSML Algorithm | Labeled Data/ Unlabeled Data | Training Time (min) | PCC | Average PCC |
---|---|---|---|---|---|
Shake-Shake26 | Temporal E. | 49,000/1000 | 1020 | p-value | p-value |
46,000/4000 | 824 | ||||
0/50,000 | 120 | ||||
Mean Teacher | 49,000/1000 | 1161 | |||
46,000/4000 | 1094 | ||||
0/50,000 | 158 | ||||
model | 49,000/1000 | 1351 | |||
46,000/4000 | 1272 | ||||
0/50,000 | 168 | ||||
MixMatch | 49,000/1000 | 1211 | |||
46,000/4000 | 1103 | ||||
0/50,000 | 163 | ||||
ReMixMatch | 49,000/1000 | 1239 | |||
46,000/4000 | 1142 | ||||
0/50,000 | 149 | ||||
DenseNet-121 | Temporal E. | 49,000/1000 | 323 | p-value | |
46,000/4000 | 276 | ||||
0/50,000 | 110 | ||||
Mean Teacher | 49,000/1000 | 392 | |||
46,000/4000 | 337 | ||||
0/50,000 | 124 | ||||
model | 49,000/1000 | 443 | |||
46,000/4000 | 407 | ||||
0/50,000 | 168 | ||||
MixMatch | 49,000/1000 | 404 | |||
46,000/4000 | 353 | ||||
0/50,000 | 131 | ||||
ReMixMatch | 49,000/1000 | 411 | |||
46,000/4000 | 367 | ||||
0/50,000 | 139 | ||||
WideResNet | Temporal E. | 49,000/1000 | 387 | p-value | |
46,000/4000 | 324 | ||||
0/50,000 | 132 | ||||
Mean Teacher | 49,000/1000 | 537 | |||
46,000/4000 | 485 | ||||
0/50,000 | 158 | ||||
model | 49,000/1000 | 553 | |||
46,000/4000 | 512 | ||||
0/50,000 | 179 | ||||
MixMatch | 49,000/1000 | 551 | |||
46000/4000 | 512 | ||||
0/50,000 | 149 | ||||
ReMixMatch | 49,000/1000 | 567 | |||
46,000/4000 | 517 | ||||
0/50,000 | 151 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Neghawi, E.; Liu, Y. Analysing Semi-Supervised ConvNet Model Performance with Computation Processes. Mach. Learn. Knowl. Extr. 2023, 5, 1848-1876. https://doi.org/10.3390/make5040089
Neghawi E, Liu Y. Analysing Semi-Supervised ConvNet Model Performance with Computation Processes. Machine Learning and Knowledge Extraction. 2023; 5(4):1848-1876. https://doi.org/10.3390/make5040089
Chicago/Turabian StyleNeghawi, Elie, and Yan Liu. 2023. "Analysing Semi-Supervised ConvNet Model Performance with Computation Processes" Machine Learning and Knowledge Extraction 5, no. 4: 1848-1876. https://doi.org/10.3390/make5040089
APA StyleNeghawi, E., & Liu, Y. (2023). Analysing Semi-Supervised ConvNet Model Performance with Computation Processes. Machine Learning and Knowledge Extraction, 5(4), 1848-1876. https://doi.org/10.3390/make5040089