research-article

Ensembled CTR Prediction via Knowledge Distillation

Authors:

Zibin ZhengAuthors Info & Claims

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 2941 - 2958

https://doi.org/10.1145/3340531.3412704

Published: 19 October 2020 Publication History

Get Access

Abstract

Recently, deep learning-based models have been widely studied for click-through rate (CTR) prediction and lead to improved prediction accuracy in many industrial applications. However, current research focuses primarily on building complex network architectures to better capture sophisticated feature interactions and dynamic user behaviors. The increased model complexity may slow down online inference and hinder its adoption in real-time applications. Instead, our work targets at a new model training strategy based on knowledge distillation (KD). KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model. The KD strategy not only allows us to simplify the student model as a vanilla DNN model but also achieves significant accuracy improvements over the state-of-the-art teacher models. The benefits thus motivate us to further explore the use of a powerful ensemble of teachers for more accurate student model training. We also propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss. We conduct comprehensive experiments against 12 existing models and across three industrial datasets. Both offline and online A/B testing results show the effectiveness of our KD-based training strategy.

Supplementary Material

MP4 File (3340531.3412704.mp4)

This is the video for paper "Ensembled CTR Prediction via Knowledge Distillation". In this paper, we make an attempt to apply KD for ensembled CTR prediction. Our KD-based training strategy enables the production use of a powerful ensemble of teacher models and makes large accuracy improvements. Surprisingly, we demonstrate that a vanilla DNN trained with KD can even surpasses the ensemble teacher models. Our key contribution includes an intensive evaluation of the KD-based training strategy, teacher gating for sample-wise teacher selection, and early stopping by KD loss that increases the utilization of validation data. Both offline and online A/B testing results demonstrate the effectiveness of our approach. We hope that our encouraging results could attract more research efforts to study training strategies for CTR prediction.

Download
8.31 MB

References

[1]

2009. The BigChaos Solution to the Netflix Grand Prize. https://www.netflixprize. com/assets/GrandPrize2009_BPC_BigChaos.pdf

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

A knowledge distillation-based deep interaction compressed network for CTR prediction

An Ad CTR Prediction Method Based on Feature Learning of Deep and Shallow Layers

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations