OpenAI recently posted on Twitter about teaching computer agents ‘hide and seek’. The agents learned at least six different strategies for playing the game and eventually learned a few cheats:
We’ve observed AIs discovering complex tool use while competing in a simple game of hide-and-seek. They develop a series of six distinct strategies and counter strategies, ultimately using tools in the environment to break our simulated physics.
In the simulations, seekers are incentivized to maintain line of sight of hiders and hiders are incentivized to avoid line of sight from seekers. The agents environments contain various shelters including cubicles, movable partitions, blocks and ramps. That said, there is no built-in incentive for agents to interact with objects around them. The first stages of training see random movement for hiders and seekers. After many rounds, six different strategies and counter strategies emerged.
As agents train against each other in hide-and-seek, as many as six distinct strategies emerge. Each new strategy creates a previously nonexistent pressure for agents to progress to the next stage. Note that there are no direct incentives for agents to interact with objects or to explore; rather, the emergent strategies shown below are a result of the autocurriculum induced by multi-agent competition and the simple dynamics of hide-and-seek.
- Chasing – The seekers learned to increase line of sight by chasing after the hiders.
- Finding or Building Shelter – the hiders learn to block doors or build shelters to prevent line of sight from seekers. This is akin to locking yourself in a closet when playing in real life. Smart but not necessarily in the spirit of the game.
- Ramp Usage – To level the playing field for the seekers there are also ramps in the simulation. These ramps allow agents to go over cubicle or partition walls by moving a ramp to a the shelter and climbing over.
- Ramp Defense – Hiders learned to bring the ramp into their hide-y hole with them to prevent seekers from using them. In some cases they would lock them in place instead which brings us to #5…where AI breaks physics and learns to cheat.
- Cube Surfing – If a ramp is locked in place, the seekers can’t move it to a partition wall to climb over. The solution? Cheat…hard or SURF A CUBE. The seekers did this by moving an unlocked cube over to the locked ramp which they used to climb on top of the cube. Then, the agents would ‘surf’ the cube over to the partition wall and climb over. The simulation physics somewhat accidentally allowed agents to create magic carpet cubes…
- Cube Surfing Defense – Hiders countered the cube surfing offensive by locking all boxes walls and ramps in place so that they could not be used to climb over the partition walls.
OpenAI utilized the algorithms from Dactyl and OpenAI Five for training. The agent policies were trained with self-play and Proximal Policy Optimization and are independent in the simulation environment. Agents have unique observations and hidden memory state.
We use the same training infrastructure and algorithms used to train OpenAI Five and Dactyl. However, in our environment each agent acts independently, using its own observations and hidden memory state. Agents use an entity-centric state-based representation of the world, which is permutation invariant with respect to objects and other agents.
…Agent policies are trained with self-play and Proximal Policy Optimization. During optimization, agents can use privileged information about obscured objects and other agents in their value function.
We’ve provided further evidence that human-relevant strategies and skills, far more complex than the seed game dynamics and environment, can emerge from multi-agent competition and standard reinforcement learning algorithms at scale.