Pleasing Everyone: Artificial Teammates, Experiment 1
WE ARE TESTING AN IDEA: Teams perform better than individuals, and that large language models' (LLM) abilities make them more like teammates than tools. The future of work looks like hybrid human-AI teams working together. We aren't afraid to ask the hard questions about whether these systems are safe or to what extent they will replace human labor. But that requires validating these assumptions in the real world.
To test these ideas in the wild, we've designed a series of experiments that escalate in difficulty and complexity. Our first experiment involves a seemingly simple task – choosing a restaurant for lunch – that involves some tricky coordination problems around time constraints, dietary preferences, and difficult personalities.
THE SETUP: For Experiment 1, we're starting in a "one-player" environment: ChatGPT's native web interface. Prompts are designed for ChatGPT to behave like it is participating in a conversation with multiple people, each written by Emily.
THE PROMPT: "I have a group of five people. We are trying to decide which restaurant to choose for lunch. Please participate in the conversation using short, natural responses. The people in the group are Adam, Beth, Caleb, Danielle, and Ethan.
Ethan: Where do you guys want to eat?
Beth: I don't know; I'm not that hungry. What do you guys think?
ChatGPT:""
THE CHARACTERS: Adam is a vegetarian. He is short-tempered and brusque. Beth is a team player but has limited time to eat. Caleb, Ethan, and Danielle are coworkers with varying opinions – we'll learn more about them in future experiments. Note that all these personas are played by Emily, imbibing these characteristics.
THE RESULTS: ChatGPT (underpinned by GPT-4) performed admirably as a discussion moderator and successfully led the group, who frequently disagreed, to a mutually agreeable decision, and decisively so.
A few things that went particularly well:
The prompt to "use short, natural responses" worked well and solicited polite, human-sounding responses.
Once ChatGPT identified the group's criteria, it led to a discussion about a category and a specific restaurant.
ChatGPT acknowledged Beth's time constraint and factored that into its recommendations.
When the group was uncertain, ChatGPT solicited opinions from the group in a very natural way. This happened several times.
ChatGPT successfully negotiated a compromise between stubborn members.
In the smoothest and most impressive move, ChatGPT pushed back against one character's suggestion, Arby's, to make sure Adam, the vegetarian group member, was included.
Not everything went smoothly. At first, ChatGPT responded as Adam, not as the "ChatGPT" character - this was okay since the prompt didn’t specify that ChatGPT was to play ChatGPT. We’ve factored this into future prompt designs.
We've also observed (as you can see in the Arby's section) that ChatGPT can be overly agreeable, including apologizing for things that aren't its fault when other group members disagree. This is a common failure mode of agreeable teammates and not a dealbreaker – however, eliciting the right tone in conversation will be a crucial part of our work.
THE OUTCOME: We were impressed with ChatGPT's social dexterity and performance in the mediator/decision-support role. This role carries a low risk of harm from hallucination since human teammates are providing the decision constraints and outcome evaluation. It seems like current models, setting aside the grave security concerns, may be sufficient for teams to integrate and benefit from ChatGPT mediation.
What does it mean for a chatbot to perform at a near-human level in moderation? Our ability to connect with others (at least in written conversation) may not make us unique as humans anymore. That's heavy and something that requires a great deal of consideration from the engineers designing our models, the leaders who deploy them, and the artists who dream up visions of humanity's future.