skip to main content
10.1145/3581783.3611969acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment

Published: 27 October 2023 Publication History

Abstract

Image aesthetics assessment (IAA) aims at predicting the aesthetic quality of images. Recently, large pre-trained vision-language models, like CLIP, have shown impressive performances on various visual tasks. When it comes to IAA, a straightforward way is to finetune the CLIP image encoder using aesthetic images. However, this can only achieve limited success without considering the uniqueness of multimodal data in the aesthetics domain. People usually assess image aesthetics according to fine-grained visual attributes, e.g., color, light and composition. However, how to learn aesthetics-aware attributes from CLIP-based semantic space has not been addressed before. With this motivation, this paper presents a CLIP-based multi-attribute contrastive learning framework for IAA, dubbed AesCLIP. Specifically, AesCLIP consists of two major components, i.e., aesthetic attribute-based comment classification and attribute-aware learning. The former classifies the aesthetic comments into different attribute categories. Then the latter learns an aesthetic attribute-aware representation by contrastive learning, aiming to mitigate the domain shift from the general visual domain to the aesthetics domain. Extensive experiments have been done by using the pre-trained AesCLIP on four popular IAA databases, and the results demonstrate the advantage of AesCLIP over the state-of-the-arts. The source code will be public at https://github.com/OPPOMKLab/AesCLIP.

References

[1]
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. "O'Reilly Media, Inc.".
[2]
Luigi Celona, Marco Leonardi, Paolo Napoletano, and Alessandro Rozza. 2022. Composition and Style Attributes Guided Image Aesthetic Assessment. IEEE Transactions on Image Processing, Vol. 31 (2022), 5009--5024.
[3]
Kuang-Yu Chang, Kung-Hung Lu, and Chu-Song Chen. 2017. Aesthetic Critiques Generation for Photos. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3534--3543.
[4]
Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, and Zeynep Akata. 2021. Distilling Audio-Visual Knowledge by Compositional Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 7016--7025.
[5]
Yubin Deng, Chen Change Loy, and Xiaoou Tang. 2017. Image Aesthetic Assessment: An Experimental Survey. IEEE Signal Processing Magazine, Vol. 34, 4 (2017), 80--106.
[6]
Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High Level Describable Attributes for Predicting Aesthetics and Interestingness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1657--1664.
[7]
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 2021. CLIP-Adapter: Better Vision-Language Models With Feature Adapters. arXiv preprint arXiv:2110.04544 (2021).
[8]
Shuai He, Yongchang Zhang, Rui Xie, Dongxiang Jiang, and Anlong Ming. 2022. Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. In Proceeding of the Thirty-First International Joint Conference on Artificial Intelligence. 942--948.
[9]
Vlad Hosu, Bastian Goldlucke, and Dietmar Saupe. 2019. Effective Aesthetics Prediction with Multi-level Spatially Pooled Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9375--9383.
[10]
Kiyohito Iigaya, Sanghyun Yi, Iman A Wahle, Koranis Tanwisuth, and John P O'Doherty. 2021. Aesthetic Preference for Art can be Predicted from a Mixture of Low-and High-Level Visual Features. Nature Human Behaviour, Vol. 5, 6 (2021), 743--755.
[11]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In Proceedings of International Conference on Machine Learning. PMLR, 4904--4916.
[12]
Xin Jin, Le Wu, Geng Zhao, Xiaodong Li, Xiaokun Zhang, Shiming Ge, Dongqing Zou, Bin Zhou, and Xinghui Zhou. 2019. Aesthetic Attributes Assessment of Images. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 311--319.
[13]
Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya, Quang-Tuan Luong, James Z. Wang, Jia Li, and Jiebo Luo. 2011. Aesthetics and Emotions in Images. IEEE Signal Processing Magazine, Vol. 28, 5 (2011), 94--115.
[14]
Chen Kang, Giuseppe Valenzise, and Frédéric Dufaux. 2020. EVA: An Explainable Visual Aesthetics Dataset. In Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends. 5--13.
[15]
Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. 2021. MUSIQ: Multi-Scale Image Quality Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 5148--5157.
[16]
Yan Ke, Xiaoou Tang, and Feng Jing. 2006. The Design of High-Level Features for Photo Quality Assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 1. IEEE, 419--426.
[17]
Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016. Photo Aesthetics Ranking Network with Attributes and Content Adaptation. In Proceedings of the European Conference on Computer Vision. Springer, 662--679.
[18]
J Li, R Datta, D Joshi, and JZ Wang. 2006. Studying Aesthetics in Photographic Images using a Computational Approach. Lecture Notes in Computer Science, Vol. 3953 (2006), 288--301.
[19]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems. 9694--9705.
[20]
Leida Li, Jiachen Duan, Yuzhe Yang, Liwu Xu, Yaqian Li, and Yandong Guo. 2022. Psychology Inspired Model for Hierarchical Image Aesthetic Attribute Prediction. In 2022 IEEE International Conference on Multimedia and Expo. IEEE, 1--6.
[21]
Leida Li, Yipo Huang, Jinjian Wu, Yuzhe Yang, Yaqian Li, Yandong Guo, and Guangming Shi. 2023. Theme-aware Visual Attribute Reasoning for Image Aesthetics Assessment. IEEE Transactions on Circuits and Systems for Video Technology (2023), 1--1. https://doi.org/10.1109/TCSVT.2023.3249185
[22]
Leida Li, Hancheng Zhu, Sicheng Zhao, Guiguang Ding, and Weisi Lin. 2020. Personality-assisted Multi-task Learning for Generic and Personalized Image Aesthetics Assessment. IEEE Transactions on Image Processing, Vol. 29 (2020), 3898--3910.
[23]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.
[24]
Dong Liu, Rohit Puri, Nagendra Kamath, and Subhabrata Bhattacharya. 2020. Composition-Aware Image Aesthetics Assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, 3569--3578.
[25]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., 13--23.
[26]
Xin Lu, Zhe Lin, Hailin Jin, Jianchao Yang, and James Z Wang. 2014. Rapid: Rating Pictorial Aesthetics using Deep Learning. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 457--466.
[27]
Xin Lu, Zhe Lin, Xiaohui Shen, Radomir Mech, and James Z Wang. 2015. Deep Multi-patch Aggregation Network for Image Style, Aesthetics, and Quality Estimation. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 990--998.
[28]
Wei Luo, Xiaogang Wang, and Xiaoou Tang. 2011. Content-based Photo Quality Assessment. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2206--2213.
[29]
Haoyu Ma, Handong Zhao, Zhe Lin, Ajinkya Kale, Zhangyang Wang, Tong Yu, Jiuxiang Gu, Sunav Choudhary, and Xiaohui Xie. 2022. EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 18051--18061.
[30]
Shuang Ma, Jing Liu, and Chang Wen Chen. 2017. A-Lamp: Adaptive Layout-aware Multi-patch Deep Convolutional Neural Network for Photo Aesthetic Assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4535--4544.
[31]
Long Mai, Hailin Jin, and Feng Liu. 2016. Composition-Preserving Deep Photo Aesthetics Assessment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 497--506.
[32]
Gautam Malu, Raju S Bapi, and Bipin Indurkhya. 2017. Learning Photography Aesthetics with Deep CNNs. arXiv preprint arXiv:1707.03981 (2017).
[33]
Luca Marchesotti, Florent Perronnin, Diane Larlus, and Gabriela Csurka. 2011. Assessing the Aesthetic Quality of Photographs using Generic Image Descriptors. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1784--1791.
[34]
Naila Murray and Albert Gordo. 2017. A Deep Architecture for Unified Aesthetic Prediction. arXiv preprint arXiv:1708.04890 (2017).
[35]
Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-scale Database for Aesthetic Visual Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2408--2415.
[36]
Masashi Nishiyama, Takahiro Okabe, Imari Sato, and Yoichi Sato. 2011. Aesthetic Quality Classification of Photographs based on Color Harmony. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 33--40.
[37]
Yuzhen Niu, Shanshan Chen, Bingrui Song, Zhixian Chen, and Wenxi Liu. 2023. Comment-Guided Semantics-Aware Image Aesthetics Assessment. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 33, 3 (2023), 1487--1492. https://doi.org/10.1109/TCSVT.2022.3201510
[38]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2641--2649.
[39]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of International conference on Machine Learning. PMLR, 8748--8763.
[40]
Dongyu She, Yu-Kun Lai, Gaoxiong Yi, and Kun Xu. 2021. Hierarchical Layout-aware Graph Convolutional Network for Unified Aesthetics Assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 8475--8484.
[41]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-Bert: Pre-training of Generic Visual-Linguistic Representations. arXiv preprint arXiv:1908.08530 (2019).
[42]
Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural Image Assessment. IEEE Transactions on Image Processing, Vol. 27, 8 (2018), 3998--4011.
[43]
Daniel Vera Nieto, Luigi Celona, and Clara Fernandez Labrador. 2022. Understanding Aesthetics with Language: A Photo Critique Dataset for Aesthetic Assessment. In Advances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., 34148--34161.
[44]
Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. 2023. Exploring CLIP for Assessing the Look and Feel of Images. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press.
[45]
Wenguan Wang, Jianbing Shen, and Haibin Ling. 2019. A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 7 (2019), 1531--1544.
[46]
Yuzhe Yang, Liwu Xu, Leida Li, Nan Qie, Yaqian Li, Peng Zhang, and Yandong Guo. 2022. Personalized Image Aesthetics Assessment With Rich Attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 19861--19869.
[47]
Hui Zeng, Zisheng Cao, Lei Zhang, and Alan C Bovik. 2019. A Unified Probabilistic Formulation of Image Aesthetic Assessment. IEEE Transactions on Image Processing, Vol. 29 (2019), 1548--1561.
[48]
Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2021. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. arXiv preprint arXiv:2111.03930 (2021).
[49]
Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Jie Li. 2020. Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks. IEEE Transactions on Multimedia, Vol. 23 (2020), 611--623.
[50]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022a. Conditional Prompt Learning for Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 16816--16825.
[51]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022b. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, Vol. 130, 9 (2022), 2337--2348.
[52]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-training for Image Captioning and vQA. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13041--13049.
[53]
Ye Zhou, Xin Lu, Junping Zhang, and James Z Wang. 2016. Joint Image and Text Representation for Aesthetics Analysis. In Proceedings of the 24th ACM International Conference on Multimedia. ACM, 262--266.
[54]
Hancheng Zhu, Yong Zhou, Leida Li, Yaqian Li, and Yandong Guo. 2023. Learning Personalized Image Aesthetics From Subjective and Objective Attributes. IEEE Transactions on Multimedia, Vol. 25 (2023), 179--190.

Cited By

View all

Index Terms

  1. AesCLIP: Multi-Attribute Contrastive Learning for Image Aesthetics Assessment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. aesthetics attributes
    2. clip
    3. contrastive learning
    4. image aesthetics assessment

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)392
    • Downloads (Last 6 weeks)40
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media