Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalashnikov, Sergey Levine, Yao Lu, Carolina Parada, Kanishka Rao, Pierre Sermanet, Alexander T Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Mengyuan Yan, Noah Brown, Michael Ahn, Omar Cortes, Nicolas Sievers, Clayton Tan, Sichun Xu, Diego Reyes, Jarek Rettinghouse, Jornell Quiambao, Peter Pastor, Linda Luu, Kuang-Huei Lee, Yuheng Kuang, Sally Jesmonth, Nikhil J. Joshi, Kyle Jeffrey, Rosario Jauregui Ruano, Jasmine Hsu, Keerthana Gopalakrishnan, Byron David, Andy Zeng, Chuyuan Kelly Fu
Proceedings of The 6th Conference on Robot Learning, PMLR 205:287-318, 2023.

Abstract

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project’s website, video, and open source can be found at say-can.github.io.

Cite this Paper


BibTeX
@InProceedings{pmlr-v205-ichter23a, title = {Do As I Can, Not As I Say: Grounding Language in Robotic Affordances}, author = {ichter, brian and Brohan, Anthony and Chebotar, Yevgen and Finn, Chelsea and Hausman, Karol and Herzog, Alexander and Ho, Daniel and Ibarz, Julian and Irpan, Alex and Jang, Eric and Julian, Ryan and Kalashnikov, Dmitry and Levine, Sergey and Lu, Yao and Parada, Carolina and Rao, Kanishka and Sermanet, Pierre and Toshev, Alexander T and Vanhoucke, Vincent and Xia, Fei and Xiao, Ted and Xu, Peng and Yan, Mengyuan and Brown, Noah and Ahn, Michael and Cortes, Omar and Sievers, Nicolas and Tan, Clayton and Xu, Sichun and Reyes, Diego and Rettinghouse, Jarek and Quiambao, Jornell and Pastor, Peter and Luu, Linda and Lee, Kuang-Huei and Kuang, Yuheng and Jesmonth, Sally and Joshi, Nikhil J. and Jeffrey, Kyle and Ruano, Rosario Jauregui and Hsu, Jasmine and Gopalakrishnan, Keerthana and David, Byron and Zeng, Andy and Fu, Chuyuan Kelly}, booktitle = {Proceedings of The 6th Conference on Robot Learning}, pages = {287--318}, year = {2023}, editor = {Liu, Karen and Kulic, Dana and Ichnowski, Jeff}, volume = {205}, series = {Proceedings of Machine Learning Research}, month = {14--18 Dec}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v205/ichter23a/ichter23a.pdf}, url = {https://proceedings.mlr.press/v205/ichter23a.html}, abstract = {Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project’s website, video, and open source can be found at say-can.github.io.} }
Endnote
%0 Conference Paper %T Do As I Can, Not As I Say: Grounding Language in Robotic Affordances %A brian ichter %A Anthony Brohan %A Yevgen Chebotar %A Chelsea Finn %A Karol Hausman %A Alexander Herzog %A Daniel Ho %A Julian Ibarz %A Alex Irpan %A Eric Jang %A Ryan Julian %A Dmitry Kalashnikov %A Sergey Levine %A Yao Lu %A Carolina Parada %A Kanishka Rao %A Pierre Sermanet %A Alexander T Toshev %A Vincent Vanhoucke %A Fei Xia %A Ted Xiao %A Peng Xu %A Mengyuan Yan %A Noah Brown %A Michael Ahn %A Omar Cortes %A Nicolas Sievers %A Clayton Tan %A Sichun Xu %A Diego Reyes %A Jarek Rettinghouse %A Jornell Quiambao %A Peter Pastor %A Linda Luu %A Kuang-Huei Lee %A Yuheng Kuang %A Sally Jesmonth %A Nikhil J. Joshi %A Kyle Jeffrey %A Rosario Jauregui Ruano %A Jasmine Hsu %A Keerthana Gopalakrishnan %A Byron David %A Andy Zeng %A Chuyuan Kelly Fu %B Proceedings of The 6th Conference on Robot Learning %C Proceedings of Machine Learning Research %D 2023 %E Karen Liu %E Dana Kulic %E Jeff Ichnowski %F pmlr-v205-ichter23a %I PMLR %P 287--318 %U https://proceedings.mlr.press/v205/ichter23a.html %V 205 %X Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project’s website, video, and open source can be found at say-can.github.io.
APA
ichter, b., Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., Kalashnikov, D., Levine, S., Lu, Y., Parada, C., Rao, K., Sermanet, P., Toshev, A.T., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Yan, M., Brown, N., Ahn, M., Cortes, O., Sievers, N., Tan, C., Xu, S., Reyes, D., Rettinghouse, J., Quiambao, J., Pastor, P., Luu, L., Lee, K., Kuang, Y., Jesmonth, S., Joshi, N.J., Jeffrey, K., Ruano, R.J., Hsu, J., Gopalakrishnan, K., David, B., Zeng, A. & Fu, C.K.. (2023). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. Proceedings of The 6th Conference on Robot Learning, in Proceedings of Machine Learning Research 205:287-318 Available from https://proceedings.mlr.press/v205/ichter23a.html.

Related Material