Towards a comprehensive computational model foraesthetic assessment of videos
Proceedings of the 21st ACM international conference on Multimedia, 2013•dl.acm.org
In this paper we propose a novel aesthetic model emphasizing psycho-visual statistics
extracted from multiple levels in contrast to earlier approaches that rely only on descriptors
suited for image recognition or based on photographic principles. At the lowest level, we
determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells
within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual
classifiers) on a given frame, that invoke specific sentiments such as" colorful clouds"," …
extracted from multiple levels in contrast to earlier approaches that rely only on descriptors
suited for image recognition or based on photographic principles. At the lowest level, we
determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells
within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual
classifiers) on a given frame, that invoke specific sentiments such as" colorful clouds"," …
In this paper we propose a novel aesthetic model emphasizing psycho-visual statistics extracted from multiple levels in contrast to earlier approaches that rely only on descriptors suited for image recognition or based on photographic principles. At the lowest level, we determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual classifiers) on a given frame, that invoke specific sentiments such as "colorful clouds", "smiling face" etc. and collect the classifier responses as frame-level statistics. At the topmost level, we extract trajectories from video shots. Using viewer's fixation priors, the trajectories are labeled as foreground, and background/camera on which statistics are computed. Additionally, spatio-temporal local binary patterns are computed that capture texture variations in a given shot. Classifiers are trained on individual feature representations independently. On thorough evaluation of 9 different types of features, we select the best features from each level -- dark channel, affect and camera motion statistics. Next, corresponding classifier scores are integrated in a sophisticated low-rank fusion framework to improve the final prediction scores. Our approach demonstrates strong correlation with human prediction on 1,000 broadcast quality videos released by NHK as an aesthetic evaluation dataset.
ACM Digital Library