skip to main content
10.1145/3394171.3413830acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Diverter-Guider Recurrent Network for Diverse Poems Generation from Image

Published: 12 October 2020 Publication History

Abstract

Poem generation from image aims to automatically generate the poetic sentences for presenting the image content or overtone. Previous works focused on 1-to-1 image-poem generation with the demands of poeticness and content relevance. This paper proposes the paradigm of multiple poems generation from one image, which is closer to human poetizing but more challenging. Its key problem is to simultaneously guarantee the diversity of multiple poems with poeticness and relevance. To this end, we propose an end-to-end probabilistic Diverter-Guider Recurrent Network (DG-Net), which is a context-based encoder-decoder generative model with the hierarchical stochastic variables. Specifically, the diverter-variable represents the decoding-context inferred from the input image to diversify the poem themes; the guider-variable is introduced as an attribute decoder to restricts the word-choice with supervised information. Extensive experiments on automatic evaluations and human judgments demonstrate the superior performance of DG-Net than existing poem generation methods. Qualitative study show that our model can generate diverse poems with the poeticness and relevance.

Supplementary Material

MP4 File (3394171.3413830.mp4)
This paper proposes the paradigm of multiple poems generation from one image, which is closer to human poetizing but more challenging. Its key problem is to simultaneously guarantee the diversity of multiple poems with poeticness and relevance. To this end, we propose an end-to-end probabilistic Diverter-Guider Recurrent Network (DG-Net), which is a context-based encoder-decoder generative model with the hierarchical stochastic variables. Specifically, the diverter-variable represents the decoding-context inferred from the input image to diversify the poem themes; the guider-variable is introduced as an attribute decoder to restricts the word-choice with supervised information. Extensive experiments on automatic evaluations and human judgments demonstrate the superior performance of DG-Net than existing poem generation methods. Qualitative study show that our model can generate diverse poems with the poeticness and relevance.

References

[1]
Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou. 2018. Affective Neural Response Generation. In European Conference on Information Retrieval. Springer, 154--166.
[2]
Yi Bin, Yang Yang, Jie Zhou, Zi Huang, and Heng Tao Shen. 2017. Adaptively attending to visual attributes and linguistic knowledge for captioning. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 1345--1353.
[3]
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating Sentences from a Continuous Space. Computer Science (2015).
[4]
Kris Cao and Stephen Clark. 2017. Latent Variable Dialogue Models and their Diversity. In Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 182--187.
[5]
Fuhai Chen, Rongrong Ji, Jinsong Su, Yongjian Wu, and Yunsheng Wu. 2017. Structcap: Structured semantic embedding for image captioning. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 46--54.
[6]
Xuanyi Dong, Linchao Zhu, De Zhang, Yi Yang, and Fei Wu. 2018. Fast Parameter Adaptation for Few-shot Image Captioning and Visual Question Answering. In ACM Multimedia. 54--62.
[7]
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, Vol. 76, 5 (1971), 378.
[8]
Marjan Ghazvininejad, Xing Shi, Yejin Choi, and Kevin Knight. 2016. Generating topical poetry. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1183--1191.
[9]
Marjan Ghazvininejad, Xing Shi, Jay Priyadarshi, and Kevin Knight. 2017. Hafez: an interactive poetry generation system. Proceedings of ACL 2017, System Demonstrations (2017), 43--48.
[10]
Jack Hopkins and Douwe Kiela. 2017. Automatically generating rhythmic verse with neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 168--178.
[11]
Diederik P Kingma and Max Welling. 2013. Auto-Encoding Variational Bayes. arXiv (2013).
[12]
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. Computer Science (2016).
[13]
L Li, S Jiang, and Q Huang. 2012. Learning Hierarchical Semantic Description Via Mixed-Norm Regularization for Image Understanding. Multimedia IEEE Transactions on, Vol. 14, 5 (2012), p.1401--1413.
[14]
Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. 2018a. Beyond narrative description: Generating poetry from images by multi-adversarial training. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 783--791.
[15]
Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018b. Context-Aware Visual Policy Network for Sequence-Level Image Captioning. In ACM Multimedia.
[16]
Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020 b. Interactive Dual Generative Adversarial Networks for Image Captioning. AAAI (2020).
[17]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-jun Zha, and Qingming Huang. 2020 a. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. IEEE CVPR (2020).
[18]
Guo Longteng, Liu Jing, Zhu Xinxin, Yao Peng, Lu Shichen, and Lu Hanqing. 2020. Normalized and Geometry-Aware Self-Attention Network for Image Captioning. IEEE CVPR (2020).
[19]
Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002).
[20]
Yadan Luo, Zi Huang, Zheng Zhang, Ziwei Wang, Jingjing Li, and Yang Yang. 2019. Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation. In ACM Multimedia. 2213--2221.
[21]
Alexander Mathews, Lexing Xie, and Xuming He. 2015. SentiCap: Generating Image Descriptions with Sentiments. arXiv:1510.01431 (2015).
[22]
Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, and Zhi Jin. 2016. Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970 (2016).
[23]
Hugo Oliveira. 2009. Automatic generation of poetry: an overview. Universidade de Coimbra (2009).
[24]
Hugo Goncc alo Oliveira. 2012. PoeTryMe: a versatile platform for poetry generation. Computational Creativity, Concept Invention, and General Intelligence, Vol. 1 (2012), 21.
[25]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14--1162
[26]
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C Courville, and Yoshua Bengio. 2017. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues. In AAAI. 3295--3301.
[27]
Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, and Ray Kurzweil. 2017. Generating long and diverse responses with neural conversation models. arXiv preprint arXiv:1701.03185 (2017).
[28]
Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A conditional variational framework for dialog generation. arXiv preprint arXiv:1705.00316 (2017).
[29]
Chen Shizhe, Jin Qin, Wang Peng, and Wu Qi. 2020. Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs. IEEE CVPR (2020).
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[31]
Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 (2016).
[32]
Liwei Wang, Alexander Schwing, and Svetlana Lazebnik. 2017. Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In Advances in Neural Information Processing Systems. 5756--5766.
[33]
Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou Chen, and Lawrence Carin. 2019. Topic-Guided Variational Autoencoders for Text Generation. arXiv: Computation and Language (2019).
[34]
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2016. Topic augmented neural response generation with a joint attention mechanism. URL http://arxiv. org/abs/1606.08340 (2016).
[35]
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic Aware Neural Response Generation. In AAAI, Vol. 17. 3351--3357.
[36]
Jun Xu, Ting Yao, Yongdong Zhang, and Tao Mei. 2017. Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia. ACM, 537--545.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.
[38]
Shijie Yang, Liang Li, Shuhui Wang, Weigang Zhang, and Qi Tian. 2019. SkeletonNet: A Hybrid Network With a Skeleton-Embedding Process for Multi-View Image Representation Learning. IEEE Transactions on Multimedia, Vol. PP, 99 (2019), 1--1.
[39]
Kaisheng Yao, Baolin Peng, Geoffrey Zweig, and Kam-Fai Wong. 2016. An attentional neural conversation model with improved specificity. arXiv preprint arXiv:1606.01292 (2016).
[40]
Xiaoyuan Yi, Ruoyu Li, and Maosong Sun. 2017. Generating chinese classical poems with rnn encoder-decoder. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, 211--223.
[41]
Z-J. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu. 2019. Context-Aware Visual Policy Network for Fine-Grained Image Captioning. EEE Trans. on Pattern Analysis and Machine Intelligence (2019).
[42]
Z-J. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu. 2020 a. Adversarial attribute-text embedding for person search with natural language query. IEEE Trans. on Multimedia, Vol. 22, 7 (2020), 1836--1846.
[43]
Z-J. Zha, C. Wang, D. Liu, H. Xie, and Y. Zhang. 2020 b. Robust Deep Co-Saliency Detection With Group Semantic and Pyramid Attention. IEEE Trans. on Neural Netwoks and Learning Systems, Vol. 31, 7 (2020), 2398--2408.
[44]
Beichen Zhang, Liang Li, Shijie Yang, Shuhui Wang, and Qingming Huang. 2020. State-Relabeling Adversarial Active Learning. IEEE CVPR (2020).
[45]
Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 670--680.
[46]
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. arXiv (2017), 654--664.
[47]
Wentian Zhao, Xinxiao Wu, and Xiaoxun Zhang. 2020. MemCap: Memorizing Style Knowledge for Image Captioning. AAAI (2020).
[48]
Ganbin Zhou, Ping Luo, Rongyu Cao, Fen Lin, Bo Chen, and Qing He. 2017b. Mechanism-aware neural machine for dialogue response generation. In AAAI Conference on Artificial Intelligence, AAAI.
[49]
Ganbin Zhou, Ping Luo, Rongyu Cao, Yijun Xiao, Fen Lin, Bo Chen, and Qing He. 2018. Tree-Structured Neural Machine for Linguistics-Aware Sentence Generation. In AAAI Conference on Artificial Intelligence, AAAI.
[50]
Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2017a. Emotional chatting machine: emotional conversation generation with internal and external memory. arXiv preprint arXiv:1704.01074 (2017).
[51]
L. Zhou, Y. Zhang, Y. Jiang, T. Zhang, and W. Fan. 2020. Re-Caption: Saliency-Enhanced Image Captioning Through Two-Phase Learning. IEEE Transactions on Image Processing, Vol. 29 (2020), 694--709.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. poem generation
  2. recurrent network
  3. stochastic latent variables

Qualifiers

  • Research-article

Funding Sources

  • National Key R&D Program of China under Grant
  • National Natural Science Foundation of China

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media