research-article

LiveAE: Attention-based and Edge-assisted Viewport Prediction for Live 360° Video Streaming

Authors:

Jinyao YanAuthors Info & Claims

EMS '23: Proceedings of the 2023 Workshop on Emerging Multimedia Systems

Pages 28 - 33

https://doi.org/10.1145/3609395.3610597

Published: 26 September 2023 Publication History

Abstract

Viewport prediction plays a crucial role in live 360° video streaming as it determines which tiles should be prefetched in high quality, thereby significantly impacting the user experience. However, the current approach to viewport prediction, which integrates content-level visual features with the viewer's head movement trajectory, faces the challenge of striking a balance between prediction accuracy and computational complexity. In this paper, we propose LiveAE, a novel attention-based and edge-assisted viewport prediction framework for live 360° video streaming. Specifically, we employ a pre-trained video encoder called Vision Transformer (ViT) for general visual feature extraction and a cross-attention mechanism for user-specific interest tracking. To address the computational complexity issue, we offload the aforementioned content-level operations to an edge server while retaining trajectory-related functions on the client side. Extensive experiments show that our proposed method not only outperforms state-of-the-art algorithms but also ensures the real-time requirements of live 360° video streaming.

References

[1]

Yixuan Ban, Lan Xie, Zhimin Xu, Xinggong Zhang, Zongming Guo, and Yue Wang. 2018. Cub360: Exploiting cross-users behaviors for viewport prediction in 360 video adaptive streaming. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[3]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[4]

Xianglong Feng, Zeyang Bao, and Sheng Wei. 2021. LiveObj: object semantics-based viewport prediction for live mobile virtual reality streaming. IEEE Transactions on Visualization and Computer Graphics 27, 5 (2021), 2736--2745.

[5]

Xianglong Feng, Weitian Li, and Sheng Wei. 2021. LiveROI: region of interest analysis for viewport prediction in live mobile virtual reality streaming. In Proceedings of the 12th ACM Multimedia Systems Conference. 132--145.

Digital Library

[6]

Xianglong Feng, Yao Liu, and Sheng Wei. 2020. LiveDeep: Online viewport prediction for live virtual reality streaming using lifelong deep learning. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 800--808.

[7]

Xianglong Feng, Viswanathan Swaminathan, and Sheng Wei. 2019. Viewport prediction for live 360-degree mobile video streaming using user-content hybrid motion tracking. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 2 (2019), 1--22.

Digital Library

[8]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[9]

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 558--567.

[10]

Apple Inc. 2023. Core ML. Retrieved June 19, 2023 from https://developer.apple.com/documentation/coreml

[11]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[12]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293--304.

Digital Library

[13]

Yixiang Mao, Liyang Sun, Yong Liu, and Yao Wang. 2020. Low-latency fov-adaptive coding and streaming for interactive 360° video streaming. In Proceedings of the 28th ACM International Conference on Multimedia. 3696--3704.

Digital Library

[14]

Roy Miles, Mehmet Kerim Yucel, Bruno Manganelli, and Albert Saà-Garriga. 2023. MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10480--10490.

[15]

Rafael Monroy, Sebastian Lutz, Tejo Chalasani, and Aljosa Smolic. 2018. Salnet360: Saliency maps for omni-directional images with cnn. Signal Processing: Image Communication 69 (2018), 26--34.

[16]

Afshin Taghavi Nasrabadi, Aliehsan Samiei, and Ravi Prakash. 2020. Viewport prediction for 360 videos: a clustering approach. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video. 34--39.

Digital Library

[17]

Stefano Petrangeli, Gwendal Simon, and Viswanathan Swaminathan. 2018. Trajectory-based viewport prediction for 360-degree virtual reality videos. In 2018 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR). IEEE, 157--160.

[18]

Feng Qian, Bo Han, Qingyang Xiao, and Vijay Gopalakrishnan. 2018. Flare: Practical viewport-adaptive 360-degree video streaming for mobile devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 99--114.

Digital Library

[19]

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems 34 (2021), 12116--12128.

[20]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[21]

Liyang Sun, Yixiang Mao, Tongyu Zong, Yong Liu, and Yao Wang. 2020. Flocking-based live streaming of 360-degree video. In Proceedings of the 11th ACM Multimedia Systems Conference. 26--37.

Digital Library

[22]

Liyang Sun, Yixiang Mao, Tongyu Zong, Yong Liu, and Yao Wang. 2022. Live 360 Degree Video Delivery based on User Collaboration in a Streaming Flock. IEEE Transactions on Multimedia (2022).

[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[24]

Shibo Wang, Shusen Yang, Hailiang Li, Xiaodan Zhang, Chen Zhou, Chenren Xu, Feng Qian, Nanbin Wang, and Zongben Xu. 2022. SalientVR: saliency-driven mobile 360-degree video streaming with gaze information. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 542--555.

Digital Library

[25]

Shuoqian Wang, Xiaoyang Zhang, Mengbai Xiao, Kenneth Chiu, and Yao Liu. 2020. Sphericrtc: A system for content-adaptive real-time 360-degree video communication. In Proceedings of the 28th ACM International Conference on Multimedia. 3595--3603.

Digital Library

[26]

Yu Emma Wang, Gu-Yeon Wei, and David Brooks. 2019. Benchmarking TPU, GPU, and CPU platforms for deep learning. arXiv preprint arXiv:1907.10701 (2019).

[27]

Lan Xie, Zhimin Xu, Yixuan Ban, Xinggong Zhang, and Zongming Guo. 2017. 360probdash: Improving qoe of 360 video streaming using tile-based http adaptive streaming. In Proceedings of the 25th ACM international conference on Multimedia. 315--323.

Digital Library

[28]

Lan Xie, Xinggong Zhang, and Zongming Guo. 2018. Cls: A cross-user learning based system for improving qoe in 360-degree video adaptive streaming. In Proceedings of the 26th ACM international conference on Multimedia. 564--572.

Digital Library

[29]

Tan Xu, Bo Han, and Feng Qian. 2019. Analyzing viewport prediction under different VR interactions. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies. 165--171.

Digital Library

[30]

Yanyu Xu, Yanbing Dong, Junru Wu, Zhengzhong Sun, Zhiru Shi, Jingyi Yu, and Shenghua Gao. 2018. Gaze prediction in dynamic 360 immersive videos. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5333--5342.

[31]

Lei Zhang, Weizhen Xu, Donghuan Lu, Laizhong Cui, and Jiangchuan Liu. 2022. MFVP: Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[32]

Qi Zhang, Jianchao Wei, Shanshe Wang, Siwei Ma, and Wen Gao. 2022. RealVR: Efficient, Economical, and Quality-of-Experience-Driven VR Video System Based on MPEG OMAF. IEEE Transactions on Multimedia (2022).

Digital Library

[33]

Yuanxing Zhang, Pengyu Zhao, Kaigui Bian, Yunxin Liu, Lingyang Song, and Xiaoming Li. 2019. DRL360: 360-degree video streaming with deep reinforcement learning. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications. IEEE, 1252--1260.

Digital Library

Cited By

Zheng RFeng XRen F(2024)Designing transport scheme of 3D naked-eye systemJournal of Network and Computer Applications10.1016/j.jnca.2024.103988231(103988)Online publication date: Nov-2024
https://doi.org/10.1016/j.jnca.2024.103988

Index Terms

LiveAE: Attention-based and Edge-assisted Viewport Prediction for Live 360° Video Streaming
1. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia streaming

Recommendations

PARIMA: Viewport Adaptive 360-Degree Video Streaming
WWW '21: Proceedings of the Web Conference 2021

With increasing advancements in technologies for capturing 360° videos, advances in streaming such videos have become a popular research topic. However, streaming 360° videos require high bandwidth, thus escalating the need for developing optimized ...
Multi-camera Live Video Streaming over Wireless Network
Advances in Mobile Computing and Multimedia Intelligence
Abstract
Due to the development of wireless communication technology, more and more streamers are using cameras mounted on mobile devices for live streaming in a wireless LAN environment. Conventional live streaming systems, which employ multiple images ...
Viewport prediction for 360° videos: a clustering approach
NOSSDAV '20: Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video

An important component for viewport-adaptive streaming of 360° videos is viewport prediction. Increasing viewport prediction horizon enables the client to prefetch more chunks into the playback buffer. Having longer buffer results in less rebuffering ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EMS '23: Proceedings of the 2023 Workshop on Emerging Multimedia Systems

September 2023

65 pages

ISBN:9798400703034

DOI:10.1145/3609395

Program Chairs:
Mallesham Dasari
Carnegie Mellon University
,
Junchen Jiang
University of Chicago
,
Maria Gorlatova
Duke University

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGCOMM: ACM Special Interest Group on Data Communication

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
National Natural Science Foundation of China

Conference

EMS '23

Sponsor:

SIGCOMM

EMS '23: 2023 Workshop on Emerging Multimedia Systems

September 10, 2023

NY, New York, USA

Acceptance Rates

Overall Acceptance Rate 9 of 15 submissions, 60%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
189
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zheng RFeng XRen F(2024)Designing transport scheme of 3D naked-eye systemJournal of Network and Computer Applications10.1016/j.jnca.2024.103988231(103988)Online publication date: Nov-2024
https://doi.org/10.1016/j.jnca.2024.103988

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten