Attention-Based Bi-Prediction Network for Versatile Video Coding (VVC) over 5G Network
Abstract
:1. Introduction
- We propose an attention-based bi-prediction network (ABPN). Different from the existing bi-prediction methods in VVC, the proposed method can enhance the quality of bi-prediction block by using a CNN-based manner. Because the proposed ABPN can reduce the bit rate while providing higher visual quality compared to the original VVC, the efficiency of transmission over 5G networks can be increased considerably.
- The depth of the proposed network is deeper than that of the networks proposed in existing deep learning-based bi-prediction studies. Because there are more bi-prediction modes of VVC than HEVC, it is an mandatory choice for replacing all modes. In this paper, we utilize a learning technique named KD which distills the knowledge from a larger deep neural network into a small network. It allows the number of parameters to be reduced while keeping the quality of the result similarly.
- The proposed ABPN is integrated into VTM-11.0 NNVC-1.0 anchor on JVET neural network-based video coding (NNVC) standard. The experimental results demonstrate that the proposed method achieves superior coding performance compared with the VTM anchor.
2. Related Works
2.1. Traditional Bi-Prediction Method in Video Coding
2.2. Deep Learning-Based Inter Prediction in Video Coding
3. Methodology
3.1. Architecture of the Proposed ABPN
3.2. KD-Based Lightweighted Network Design
3.3. Integration into VVC Reference Software
3.3.1. The Scope of the Proposed ABPN in VVC
3.3.2. The Strategy of the Proposed ABPN in VVC
4. Experimental Results and Discussion
4.1. Generation of Training Dataset
4.2. Training Settings
4.3. KD-Based Training
4.4. Encoding Configurations
4.5. Comparisons with VVC Standard
4.6. Analysis on Performance of KD-Based Strategy and Attention Mechanism
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ding, A.Y.; Janssen, M. Opportunities for applications using 5G networks: Requirements, challenges, and outlook. In Proceedings of the Seventh International Conference on Telecommunications and Remote Sensing, New York, NY, USA, 8–9 October 2018; pp. 27–34. [Google Scholar]
- Tang, Y.; Dananjayan, S.; Hou, C.; Guo, Q.; Luo, S.; He, Y. A survey on the 5G network and its impact on agriculture: Challenges and opportunities. Comput. Electron. Agric. 2021, 180, 105895. [Google Scholar] [CrossRef]
- Saha, A.; Lee, Y.W.; Hwang, Y.S.; Psannis, K.E.; Kim, B.G. Context-aware block-based motion estimation algorithm for multimedia internet of things (IoT) platform. Pers. Ubiquitous Comput. 2018, 22, 163–172. [Google Scholar] [CrossRef]
- Kim, B.G. Fast coding unit (CU) determination algorithm for high-efficiency video coding (HEVC) in smart surveillance application. J. Supercomput. 2017, 73, 1063–1084. [Google Scholar] [CrossRef]
- Choi, Y.J.; Lee, Y.W.; Kim, B.G. Residual-based graph convolutional network for emotion recognition in conversation for smart Internet of Things. Big Data 2021, 9, 279–288. [Google Scholar] [CrossRef]
- Sjöberg, R.; Ström, J.; Litwic, Ł.; Andersson, K. Versatile Video Coding explained—The Future of Video in a 5G World. Ericsson Technol. Rev. 2020, 2020, 2–12. [Google Scholar] [CrossRef]
- Tang, G.; Hu, Y.; Xiao, H.; Zheng, L.; She, X.; Qin, N. Design of Real-time video transmission system based on 5G network. In Proceedings of the IEEE Conference on Industrial Electronics and Applications (ICIEA), Kristiansand, Norway, 9–13 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 522–526. [Google Scholar]
- Wiegand, T.; Sullivan, G.J.; Bjontegaard, G.; Luthra, A. Overview of the H. 264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 2003, 13, 560–576. [Google Scholar] [CrossRef] [Green Version]
- Sullivan, G.J.; Ohm, J.R.; Han, W.J.; Wiegand, T. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 1649–1668. [Google Scholar] [CrossRef]
- Bross, B.; Wang, Y.K.; Ye, Y.; Liu, S.; Chen, J.; Sullivan, G.J.; Ohm, J.R. Overview of the versatile video coding (VVC) standard and its applications. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3736–3764. [Google Scholar] [CrossRef]
- Chien, W.J.; Zhang, L.; Winken, M.; Li, X.; Liao, R.L.; Gao, H.; Hsu, C.W.; Liu, H.; Chen, C.C. Motion Vector Coding and Block Merging in the Versatile Video Coding Standard. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3848–3861. [Google Scholar] [CrossRef]
- Yang, H.; Chen, H.; Chen, J.; Esenlik, S.; Sethuraman, S.; Xiu, X.; Alshina, E.; Luo, J. Subblock-Based Motion Derivation and Inter Prediction Refinement in the Versatile Video Coding Standard. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3862–3877. [Google Scholar] [CrossRef]
- Liu, S.; Segall, A.; Alshina, E.; Liao, R.L. JVET common test conditions and evaluation procedures for neural network-based video coding technology. In Proceedings of the Document JVET-X2016, 24th JVET Meeting, Vitual, 6–15 October 2021; pp. 1–10. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
- Chen, Y.S.; Wang, Y.C.; Kao, M.H.; Chuang, Y.Y. Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6306–6314. [Google Scholar]
- Li, F.; Bai, H.; Zhao, Y. Learning a deep dual attention network for video super-resolution. IEEE Trans. Image Process. 2020, 29, 4474–4488. [Google Scholar] [CrossRef] [PubMed]
- Choi, Y.J.; Lee, Y.W.; Kim, B.G. Wavelet Attention Embedding Networks for Video Super-Resolution. In Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7314–7320. [Google Scholar]
- Jia, C.; Wang, S.; Zhang, X.; Wang, S.; Liu, J.; Pu, S.; Ma, S. Content-aware convolutional neural network for in-loop filtering in high efficiency video coding. IEEE Trans. Image Process. 2019, 28, 3343–3356. [Google Scholar] [CrossRef] [PubMed]
- Ding, D.; Kong, L.; Chen, G.; Liu, Z.; Fang, Y. A switchable deep learning approach for in-loop filtering in video coding. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1871–1887. [Google Scholar] [CrossRef]
- Wang, M.Z.; Wan, S.; Gong, H.; Ma, M.Y. Attention-based dual-scale CNN in-loop filter for Versatile Video Coding. IEEE Access 2019, 7, 145214–145226. [Google Scholar] [CrossRef]
- Yang, R.; Xu, M.; Liu, T.; Wang, Z.; Guan, Z. Enhancing quality for HEVC compressed videos. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2039–2054. [Google Scholar] [CrossRef] [Green Version]
- Guan, Z.; Xing, Q.; Xu, M.; Yang, R.; Liu, T.; Wang, Z. MFQE 2.0: A new approach for multi-frame quality enhancement on compressed video. arXiv 2019, arXiv:1902.09707. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jin, Z.; An, P.; Yang, C.; Shen, L. Post-processing for intra coding through perceptual adversarial learning and progressive refinement. Neurocomputing 2020, 394, 158–167. [Google Scholar] [CrossRef]
- Schiopu, I.; Huang, H.; Munteanu, A. CNN-based intra-prediction for lossless HEVC. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1816–1828. [Google Scholar] [CrossRef]
- Zhu, L.; Kwong, S.; Zhang, Y.; Wang, S.; Wang, X. Generative adversarial network-based intra prediction for video coding. IEEE Trans. Multimed. 2019, 22, 45–58. [Google Scholar] [CrossRef]
- Yan, N.; Liu, D.; Li, H.; Li, B.; Li, L.; Wu, F. Convolutional neural network-based fractional-pixel motion compensation. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 840–853. [Google Scholar] [CrossRef]
- Jin, D.; Lei, J.; Peng, B.; Li, W.; Ling, N.; Huang, Q. Deep Affine Motion Compensation Network for Inter Prediction in VVC. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 3923–3933. [Google Scholar] [CrossRef]
- Zhao, Z.; Wang, S.; Wang, S.; Zhang, X.; Ma, S.; Yang, J. Enhanced bi-prediction with convolutional neural network for high-efficiency video coding. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 3291–3301. [Google Scholar] [CrossRef]
- Mao, J.; Yu, L. Convolutional neural network based bi-prediction utilizing spatial and temporal information in video coding. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1856–1870. [Google Scholar] [CrossRef]
- Lim, B.; Son, S.H.; Kim, H.W.; Nah, S.J.; Lee, K.M. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Huo, S.; Liu, D.; Wu, F.; Li, H. Convolutional neural network-based motion compensation refinement for video coding. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–4. [Google Scholar]
- Wang, Y.; Fan, X.; Jia, C.; Zhao, D.; Gao, W. Neural network based inter prediction for HEVC. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
- Zhao, L.; Wang, S.; Zhang, X.; Wang, S.; Ma, S.; Gao, W. Enhanced motion-compensated video coding with deep virtual reference frame generation. IEEE Trans. Image Process. 2019, 28, 4832–4844. [Google Scholar] [CrossRef]
- Liu, J.; Xia, S.; Yang, W. Deep reference generation with multi-domain hierarchical constraints for inter prediction. IEEE Trans. Multimed. 2019, 22, 2497–2510. [Google Scholar] [CrossRef] [Green Version]
- Huo, S.; Liu, D.; Li, B.; Ma, S.; Wu, F.; Gao, W. Deep network-based frame extrapolation with reference frame alignment. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1178–1192. [Google Scholar] [CrossRef]
- Choi, H.; Bajić, I.V. Affine transformation-based deep frame prediction. IEEE Trans. Image Process. 2021, 30, 3321–3334. [Google Scholar] [CrossRef]
- Alshin, A.; Alshina, E. Bi-directional pptical flow for future video codec. In Proceedings of the Data Compression Conference (DCC), Snowbird, UT, USA, 29 March–1 April 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 83–90. [Google Scholar]
- Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
- Ma, D.; Zhang, F.; Bull, D. BVI-DVC: A training database for deep video compression. arXiv 2021, arXiv:2003.13552. [Google Scholar] [CrossRef]
- Kingma, D.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2015; pp. 1–15. [Google Scholar]
- Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
- Bjøntegaard, G. Calculation of average PSNR differences between RD-curves. In Proceedings of the VCEG-M33, Austin, TX, USA, 2–4 April 2001; pp. 1–4. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv 2016, arXiv:1611.06440. [Google Scholar]
Categories | Problem Formulation | Methods | Coding Standards |
---|---|---|---|
Uni-prediction block enhancement | Refine the uni-prediction block | CNNMCR [32] | HEVC |
NNIP [33] | |||
Bi-prediction block generation | Generate the bi-prediction block | CNN-based bi-prediction [28] | |
STCNN [29] | |||
Frame extrapolation and interpolation | Estimate additional reference frame from existing reference frames | VECNN [34] | |
MQ-FKCNN [35] | HEVC and VVC | ||
Deep network-based frame extrapolation [36] | |||
Affine transformation-based deep frame prediction [37] | HEVC |
Sequence | ||||
BQTerrace (1920 × 1080, B Class) | BasketballDrive (1920 × 1080, B class) | |||
CU size | Number of CUs | Area ratio | Number of CUs | Area ratio |
128 × 128 | 28.22% | 78.58% | 13.28% | 59.96% |
64 × 64 | 8.25% | 5.75% | 11.15% | 12.59% |
32 × 32 | 5.38% | 0.94% | 10.37% | 2.93% |
16 × 16 | 2.68% | 0.12% | 3.45% | 0.24% |
8 × 8 | 0.81% | 0.01% | 0.38% | 0.01% |
2N × N, N × 2N | 27.15% | 11.53% | 33.03% | 18.96% |
Others | 27.50% | 3.08% | 28.33% | 5.32% |
Sequence | ||||
PartyScene (832 × 480, C class) | BQSquare (416 × 240, D class) | |||
CU size | Number of CUs | Area ratio | Number of CUs | Area ratio |
128 × 128 | 3.08% | 32.57% | 3.62% | 26.59% |
64 × 64 | 11.52% | 30.48% | 15.34% | 28.19% |
32 × 32 | 13.16% | 8.70% | 9.68% | 4.45% |
16 × 16 | 8.15% | 1.35% | 4.85% | 0.56% |
8 × 8 | 3.92% | 0.16% | 2.16% | 0.06% |
2N × N, N × 2N | 36.05% | 20.90% | 39.93% | 35.59% |
Others | 24.13% | 5.84% | 24.43% | 7.56% |
CU Size | QP | Number of Pair of Blocks | Number of Epochs (Number of Iterations = 600,000 and Batch Size = 64) |
---|---|---|---|
22 | 549,936 | 70 | |
27 | 641,210 | 60 | |
32 | 501,868 | 77 | |
37 | 237,373 | 162 | |
42 | 174,378 | 221 | |
22 | 1,500,840 | 26 | |
27 | 767,004 | 51 | |
32 | 424,074 | 91 | |
37 | 206,211 | 187 | |
42 | 141,052 | 273 | |
22 | 2,351,966 | 17 | |
27 | 866,121 | 45 | |
32 | 376,900 | 102 | |
37 | 159,092 | 242 | |
42 | 74,471 | 516 | |
Overall | 8,972,496 |
Class | Sequence | RA | LDB | ||||
---|---|---|---|---|---|---|---|
BD-Rate | EncT | DecT | BD-Rate | EncT | DecT | ||
Class A1 (3840 × 2160) | Tango2 | −0.97% | 2494% | 108% | −0.42% | 2889% | 504% |
FoodMarket4 | −0.89% | 2641% | 235% | −1.58% | 3558% | 635% | |
Average | −0.93% | 2568% | 172% | −1.00% | 3224% | 569% | |
Class A2 (3840 × 2160) | CatRobot | −2.02% | 2523% | 160% | −1.73% | 2153% | 224% |
DaylightRoad2 | −2.04% | 2627% | 198% | −1.39% | 2211% | 370% | |
Average | −2.03% | 2575% | 179% | −1.56% | 2182% | 297% | |
Class B (1920 × 1080) | MarketPlace | −1.93% | 2035% | 147% | −1.13% | 2204% | 227% |
RitualDance | −1.00% | 1481% | 79% | −0.54% | 1727% | 234% | |
Cactus | −1.64% | 2334% | 173% | −1.70% | 2504% | 136% | |
BasketballDrive | −1.42% | 1819% | 191% | −0.78% | 2675% | 160% | |
BQTerrace | −2.50% | 3177% | 128% | −2.14% | 4795% | 127% | |
Average | −1.70% | 2169% | 144% | −1.26% | 2781% | 177% | |
Class C (832 × 480) | BasketballDrill | −1.12% | 2882% | 133% | −0.86% | 2327% | 232% |
BQMall | −1.86% | 2563% | 156% | −1.19% | 2000% | 191% | |
PartyScene | −1.70% | 2645% | 327% | −1.38% | 2312% | 523% | |
RaceHorses | −0.66% | 1076% | 84% | −0.18% | 769% | 98% | |
Average | −1.34% | 2291% | 175% | −0.90% | 1852% | 261% | |
Class D (416 × 240) | BasketballPass | −2.15% | 1414% | 111% | −0.46% | 972% | 126% |
BQSquare | −8.21% | 2769% | 430% | −5.37% | 2551% | 510% | |
BlowingBubbles | −1.83% | 1839% | 224% | −0.77% | 1075% | 191% | |
RaceHorses | −1.09% | 917% | 172% | 0.17% | 557% | 188% | |
Average | −3.32% | 1735% | 234% | −1.61% | 1289% | 254% | |
Class E (1280 × 720) | FourPeople | - | - | - | -2.63% | 2302% | 160% |
Johnny | - | - | - | -2.63% | 2246% | 97% | |
KristenAndSara | - | - | - | -2.06% | 2113% | 82% | |
Average | - | - | - | −2.44% | 2220% | 113% | |
All Sequences | Overall | −1.94% | 2190% | 180% | −1.44% | 2197% | 251% |
Class | Sequence | RA | LDB | ||||
---|---|---|---|---|---|---|---|
BD-Rate | EncT | DecT | BD-Rate | EncT | DecT | ||
Class A1 (3840 × 2160) | Tango2 | −0.17% | 2455% | 147% | −0.38% | 2636% | 453% |
FoodMarket4 | -0.57% | 3508% | 314% | −1.42% | 3652% | 718% | |
Average | −0.37% | 2981% | 230% | −0.90% | 3144% | 586% | |
Class A2 (3840 × 2160) | CatRobot | −1.14% | 2803% | 174% | −1.57% | 1680% | 245% |
DaylightRoad2 | −0.85% | 2426% | 218% | −1.48% | 1971% | 362% | |
Average | −0.99% | 2615% | 196% | −1.53% | 1826% | 304% | |
Class B (1920 × 1080) | MarketPlace | −0.63% | 2418% | 151% | −1.21% | 1931% | 226% |
RitualDance | −0.11% | 1266% | 78% | −0.63% | 1638% | 223% | |
Cactus | −0.76% | 1818% | 172% | −1.63% | 2150% | 138% | |
BasketballDrive | −0.35% | 1572% | 204% | −0.70% | 2568% | 170% | |
BQTerrace | −1.17% | 3650% | 132% | −1.33% | 4121% | 126% | |
Average | −0.60% | 2145% | 147% | −1.10% | 2482% | 177% | |
Class C (832 × 480) | BasketballDrill | −0.51% | 2007% | 121% | −0.86% | 2311% | 167% |
BQMall | −0.48% | 2025% | 177% | −1.22% | 2002% | 196% | |
PartyScene | −0.85% | 1548% | 194% | −1.39% | 2298% | 411% | |
RaceHorses | −0.12% | 776% | 89% | −0.02% | 832% | 99% | |
Average | −0.49% | 1589% | 145% | −0.87% | 1861% | 218% | |
Class D (416 × 240) | BasketballPass | −0.45% | 1423% | 106% | −0.55% | 1126% | 128% |
BQSquare | −5.89% | 2538% | 454% | −4.91% | 2783% | 484% | |
BlowingBubbles | −0.50% | 1783% | 248% | −0.78% | 1062% | 185% | |
RaceHorses | −0.13% | 835% | 172% | −0.09% | 483% | 175% | |
Average | −1.74% | 1645% | 245% | −1.58% | 1364% | 243% | |
Class E (1280 × 720) | FourPeople | - | - | - | −2.43% | 2423% | 163% |
Johnny | - | - | - | −2.57% | 2753% | 100% | |
KristenAndSara | - | - | - | −2.13% | 2086% | 81% | |
Average | - | - | - | −2.38% | 2421% | 115% | |
All Sequences | Overall | −0.86% | 2050% | 185% | −1.37% | 2125% | 243% |
PSNR (dB) | Params. | GFLOPs | |
---|---|---|---|
ABPN-T | 36.306 | 1,295,169 | 10.615 |
ABPN | 36.260 | 231,905 | 1.900 |
ABPN without KD | 36.227 | 231,905 | 1.900 |
ABPN without Attention | 36.221 | 139,617 | 1.150 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Choi, Y.-J.; Lee, Y.-W.; Kim, J.; Jeong, S.Y.; Choi, J.S.; Kim, B.-G. Attention-Based Bi-Prediction Network for Versatile Video Coding (VVC) over 5G Network. Sensors 2023, 23, 2631. https://doi.org/10.3390/s23052631
Choi Y-J, Lee Y-W, Kim J, Jeong SY, Choi JS, Kim B-G. Attention-Based Bi-Prediction Network for Versatile Video Coding (VVC) over 5G Network. Sensors. 2023; 23(5):2631. https://doi.org/10.3390/s23052631
Chicago/Turabian StyleChoi, Young-Ju, Young-Woon Lee, Jongho Kim, Se Yoon Jeong, Jin Soo Choi, and Byung-Gyu Kim. 2023. "Attention-Based Bi-Prediction Network for Versatile Video Coding (VVC) over 5G Network" Sensors 23, no. 5: 2631. https://doi.org/10.3390/s23052631