We present a new behavioral evaluation of ChatGPT-3.5 and 4 Large Language Models. Groups of 3 ChatGPT agents were asked to play a collaborative number-guessing game in which each agent had to submit a number, and the sum of the team’s numbers needed to match the target number. ChatGPT-4 model showed better performance than ChatGPT-3.5. However, both models did worse than previously reported human participant results in this game. A deeper analysis of model errors shows that the two models failed for different reasons, and neither model adopted human-like strategies of social coordination.