research-article

ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Authors:

Chelse Swoopes,

Priyan Vaithilingam,

Martin Wattenberg,

Elena L. GlassmanAuthors Info & Claims

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 304, Pages 1 - 18

https://doi.org/10.1145/3613904.3642016

Published: 11 May 2024 Publication History

Abstract

Evaluating outputs of large language models (LLMs) is challenging, requiring making—and making sense of—many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

Supplemental Material

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

References

[1]

Aleph-Alpha. 2023. Aleph-Alpha. https://www.aleph-alpha.com Accessed: Sep 2 2023.

[2]

Shraddha Barke, Michael B James, and Nadia Polikarpova. 2023. Grounded copilot: How programmers interact with code-generating models. Proceedings of the ACM on Programming Languages 7, OOPSLA1 (2023), 85–111.

Digital Library

[3]

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2023. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 1946–1969.

Digital Library

[4]

Markus Binder, Bernd Heinrich, Marcus Hopf, and Alexander Schiller. 2022. Global reconstruction of language models with linguistic rules–Explainable AI for online consumer reviews. Electronic Markets 32, 4 (2022), 2123–2138.

[5]

Stephen Brade, Bryan Wang, Mauricio Sousa, Sageev Oore, and Tovi Grossman. 2023. Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. arXiv preprint arXiv:2304.09337 (2023).

[6]

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv preprint arXiv:2307.08715 (2023).

[7]

Harrison Chase et al.2023. LangChain. https://pypi.org/project/langchain/.

[8]

FlowiseAI, Inc. 2023. FlowiseAI Build LLMs Apps Easily. flowiseai.com.

[9]

Nat Friedman, Zain Huda, and Alex Lourenco. 2023. Nat.Dev. https://nat.dev/.

[10]

Dedre Gentner and Arthur B Markman. 1997. Structure mapping in analogy and similarity.American psychologist 52, 1 (1997), 45.

[11]

Saul Greenberg and Bill Buxton. 2008. Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Florence, Italy) (CHI ’08). Association for Computing Machinery, New York, NY, USA, 111–120. https://doi.org/10.1145/1357054.1357074

Digital Library

[12]

Chip Huyen. 2022. Designing machine learning systems. " O’Reilly Media, Inc.".

[13]

Jest. 2023. Jest. https://jestjs.io/.

[14]

Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Donsbach, Michael Terry, and Carrie J Cai. 2022. PromptMaker: Prompt-based Prototyping with Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 35, 8 pages. https://doi.org/10.1145/3491101.3503564

Digital Library

[15]

Peiling Jiang, Jude Rayan, Steven P. Dow, and Haijun Xia. 2023. Graphologue: Exploring Large Language Model Responses with Interactive Diagrams. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 3, 20 pages. https://doi.org/10.1145/3586183.3606737

Digital Library

[16]

Laewoo (Leo) Kang, Steven J. Jackson, and Phoebe Sengers. 2018. Intermodulation: Improvisation and Collaborative Art Practice for HCI. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal, QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3173574.3173734

Digital Library

[17]

Tae Soo Kim, Yoonjoo Lee, Minsuk Chang, and Juho Kim. 2023. Cells, Generators, and Lenses: Design Framework for Object-Oriented Interaction with Large Language Models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (San Francisco, CA, USA) (UIST ’23). Association for Computing Machinery, New York, NY, USA, Article 4, 18 pages. https://doi.org/10.1145/3586183.3606833

Digital Library

[18]

Michelle S. Lam, Zixian Ma, Anne Li, Izequiel Freitas, Dakuo Wang, James A. Landay, and Michael S. Bernstein. 2023. Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 741, 24 pages. https://doi.org/10.1145/3544548.3581290

Digital Library

[19]

David Ledo, Steven Houben, Jo Vermeulen, Nicolai Marquardt, Lora Oehlberg, and Saul Greenberg. 2018. Evaluation Strategies for HCI Toolkit Research. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal, QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3173574.3173610

Digital Library

[20]

Q. Vera Liao and S. Shyam Sundar. 2022. Designing for Responsible Trust in AI Systems: A Communication Perspective. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1257–1268. https://doi.org/10.1145/3531146.3533182

Digital Library

[21]

Mark Liffiton, Brad Sheese, Jaromir Savelka, and Paul Denny. 2023. CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes. arXiv preprint arXiv:2308.06921 (2023).

[22]

Logspace. 2023. LangFlow. https://www.langflow.org/.

[23]

Ference Marton. 2014. Necessary conditions of learning. Routledge.

[24]

Microsoft. 2023. Prompt Flow. https://microsoft.github.io/promptflow/.

[25]

Aditi Mishra, Utkarsh Soni, Anjana Arunkumar, Jinbin Huang, Bum Chul Kwon, and Chris Bryan. 2023. PromptAid: Prompt Exploration, Perturbation, Testing and Iteration using Visual Analytics for Large Language Models. arXiv preprint arXiv:2304.01964 (2023).

[26]

Graham Neubig and Zhiwei He. 2023. Zeno GPT Machine Translation Report.

[27]

Wing Ng, Ava Anjom, and Joanna M Drinane. 2022. Beyond Amazon: Social Justice and Ethical Considerations for Research Compensation. Psychotherapy Bulletin (2022), 17.

[28]

Dan R. Olsen. 2007. Evaluating user interface systems research. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (Newport, Rhode Island, USA) (UIST ’07). Association for Computing Machinery, New York, NY, USA, 251–258. https://doi.org/10.1145/1294211.1294256

Digital Library

[29]

OpenAI. 2023. OpenAI Playground. https://platform.openai.com/playground.

[30]

OpenAI. 2023. openai/evals. https://github.com/openai/evals.

[31]

Jessica Pater, Amanda Coupe, Rachel Pfafman, Chanda Phelan, Tammy Toscos, and Maia Jacobs. 2021. Standardizing Reporting of Participant Compensation in HCI: A Systematic Literature Review and Recommendations for the Field. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 141, 16 pages. https://doi.org/10.1145/3411764.3445734

Digital Library

[32]

Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. In NeurIPS ML Safety Workshop.

[33]

Charvi Rastogi, Marco Tulio Ribeiro, Nicholas King, and Saleema Amershi. 2023. Supporting Human-AI Collaboration in Auditing LLMs with LLMs. arXiv preprint arXiv:2304.09991 (2023).

[34]

Fobo Shi, Peijun Qing, Dong Yang, Nan Wang, Youbo Lei, Haonan Lu, and Xiaodong Lin. 2023. Prompt Space Optimizing Few-shot Reasoning Success with Large Language Models. arXiv preprint arXiv:2306.03799 (2023).

[35]

Hendrik Strobelt, Albert Webson, Victor Sanh, Benjamin Hoover, Johanna Beyer, Hanspeter Pfister, and Alexander M Rush. 2022. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics 29, 1 (2022), 1146–1156.

[36]

Sangho Suh, Bryan Min, Srishti Palani, and Haijun Xia. 2023. Sensecape: Enabling Multilevel Exploration and Sensemaking with Large Language Models. arXiv preprint arXiv:2305.11483 (2023).

[37]

Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. 2022. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning. PMLR, 20841–20855.

[38]

TruLens. 2023. trulens: Evaluate and Track LLM Applications. https://www.trulens.org/.

[39]

Vellum. 2023. Vellum The dev platform for production LLM apps. https://www.vellum.ai/.

[40]

Vercel. 2023. Vercel: Deveop.Preview.Ship. https://vercel.com/.

[41]

Ian Webster. 2023. promptfoo: Test your prompts. https://www.promptfoo.dev/.

[42]

Weights and Biases. 2023. Weights and Biases Docs: Prompts for LLMs. https://docs.wandb.ai/guides/prompts.

[43]

Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 359, 10 pages. https://doi.org/10.1145/3491101.3519729

Digital Library

[44]

Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 385, 22 pages. https://doi.org/10.1145/3491102.3517582

Digital Library

[45]

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. https://doi.org/10.1145/3544548.3581388

Digital Library

[46]

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910 (2022).

Cited By

Kahng MTenney IPushkarna MLiu MWexler JReif EKallarackal KChang MTerry MDixon L(2025)LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language ModelsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345635431:1(503-513)Online publication date: Jan-2025
https://doi.org/10.1109/TVCG.2024.3456354
Diaz-De-Arcaya JLópez-De-Armentia JMiñón ROjanguren ITorre-Bastida A(2024)Large Language Model Operations (LLMOps): Definition, Challenges, and Lifecycle Management2024 9th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech61897.2024.10612341(1-4)Online publication date: 25-Jun-2024
https://doi.org/10.23919/SpliTech61897.2024.10612341
Li JZhang MLi NWeyns DJin ZTei K(2024)Generative AI for Self-Adaptive Systems: State of the Art and Research RoadmapACM Transactions on Autonomous and Adaptive Systems10.1145/368680319:3(1-60)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3686803
Show More Cited By

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

ChainForge: An open-source visual programming environment for prompt engineering
UIST '23 Adjunct: Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

Prompt engineering for large language models (LLMs) is a critical to effectively leverage their capabilities. However, due to the inherent stochastic and opaque nature of LLMs, prompt engineering is far from an exact science. Crafting prompts that ...
Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Pre-trained large language models (“LLMs”) like GPT-3 can engage in fluent, multi-turn instruction-taking out-of-the-box, making them attractive materials for designing natural language interactions. Using natural language to steer LLM outputs (“...
CoPrompt: Supporting Prompt Sharing and Referring in Collaborative Natural Language Programming
CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Natural language (NL) programming has become more approachable due to the powerful code-generation capability of large language models (LLMs). This shift to using NL to program enhances collaborative programming by reducing communication barriers and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

May 2024

18961 pages

ISBN:9798400703300

DOI:10.1145/3613904

Editors:
Florian Floyd Mueller
Monash University
,
Penny Kyburz
The Australian National University
,
Julie R. Williamson
University of Glasgow
,
Corina Sas
Lancaster University
,
Max L. Wilson
University of Nottingham
,
Phoebe Toups Dugas
Monash University/New Mexico State University
,
Irina Shklovski
University of Copenhagen

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 May 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Honorable Mention

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

CHI '24

Sponsor:

CHI '24: CHI Conference on Human Factors in Computing Systems

May 11 - 16, 2024

HI, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
4,172
Total Downloads

Downloads (Last 12 months)4,172
Downloads (Last 6 weeks)364

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kahng MTenney IPushkarna MLiu MWexler JReif EKallarackal KChang MTerry MDixon L(2025)LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language ModelsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345635431:1(503-513)Online publication date: Jan-2025
https://doi.org/10.1109/TVCG.2024.3456354
Diaz-De-Arcaya JLópez-De-Armentia JMiñón ROjanguren ITorre-Bastida A(2024)Large Language Model Operations (LLMOps): Definition, Challenges, and Lifecycle Management2024 9th International Conference on Smart and Sustainable Technologies (SpliTech)10.23919/SpliTech61897.2024.10612341(1-4)Online publication date: 25-Jun-2024
https://doi.org/10.23919/SpliTech61897.2024.10612341
Li JZhang MLi NWeyns DJin ZTei K(2024)Generative AI for Self-Adaptive Systems: State of the Art and Research RoadmapACM Transactions on Autonomous and Adaptive Systems10.1145/368680319:3(1-60)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3686803
Zhang JArawjo I(2024)ChainBuddy: An AI-assisted Agent System for Helping Users Set up LLM PipelinesAdjunct Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3672539.3686763(1-3)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3672539.3686763
Atzenbeck C(2024)Unwinding AI's Moral Maze: Hypertext's Ethical PotentialProceedings of the 35th ACM Conference on Hypertext and Social Media10.1145/3648188.3678213(23-28)Online publication date: 10-Sep-2024
https://dl.acm.org/doi/10.1145/3648188.3678213
Masson DMalacria SCasiez GVogel D(2024)DirectGPT: A Direct Manipulation Interface to Interact with Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642462(1-16)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642462
Gero KSwoopes CGu ZKummerfeld JGlassman E(2024)Supporting Sensemaking of Large Language Model Outputs at ScaleProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642139(1-21)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642139
Croisdale G(2024)Interfaces to Support Exploration and Control of Generative Models2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL/HCC60511.2024.00058(384-385)Online publication date: 2-Sep-2024
https://doi.org/10.1109/VL/HCC60511.2024.00058
Lee SBahukhandi ALiu DMa K(2024)Towards Dataset-Scale and Feature-Oriented Evaluation of Text Summarization in Large Language Model PromptsIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345639831:1(481-491)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1109/TVCG.2024.3456398
Sahbi AAlec CBeust P(2024)Automatic Ontology Population from Textual Advertisements: LLM vs. Semantic ApproachProcedia Computer Science10.1016/j.procs.2024.09.364246(3083-3092)Online publication date: 2024
https://doi.org/10.1016/j.procs.2024.09.364
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Table of Contents