Assessing GenAI Models in Retrieving Official NBA Game Data: Capabilities and Limitations

Assessing GenAI Models in Retrieving Official NBA Game Data: Capabilities and Limitations
Source: Unsplash - tjdragotta

In this post we present a comparative test of several generative AI models, focusing on their ability to retrieve official sports statistics. The task involved collecting NBA game data through a single prompt, with the expectation that the models would accurately locate, extract, and structure the requested information. While the aim was straightforward, the outcomes varied considerably, reflecting differences in how each model handles data collection tasks.

Performance Comparison

Model Grade
Claude Sonnet 4 A
Grok 4 A
Gemini 2.5 Pro A
GPT-5 C
Mistral D
GPT-4o E
DeepSeek-V3 E
Qwen3-Max E
Copilot E

Claude Sonnet 4, Grok 4, and Gemini 2.5 Pro successfully completed the task, returning accurate NBA data. GPT-5 declined to provide results, citing copyright restrictions, while GPT-4o produced “illustrative” but ultimately incorrect figures. Mistral did not gather the requested dataset either, but at least avoided fabrications and provided a link to the relevant official site. By contrast, DeepSeek-V3, Qwen3-Max, and Copilot all hallucinated data, making their outputs unreliable.

Gemini 2.5 Pro, Claude Sonnet 4, Grok 4

Gemini 2.5 Pro's performance (accessed on 10 September 2025)

GPT-5

GPT-5's performance (accessed on 10 September 2025)

GPT-4o

GPT-4o's performance (accessed on 10 September 2025)

DeepSeek-V3, Qwen3-Max, Copilot

DeepSeek-V3's performance (accessed on 10 September 2025)

Mistral

Mistral's performance (accessed on 10 September 2025)

Recommendations

The results underline how unpredictable generative AI models can be when applied to data collection tasks. Even when the instruction is simple and unambiguous, performance varies widely across models, ranging from accurate retrieval to complete fabrication. For this reason, any outputs must always be cross-checked against the official source to ensure reliability and to guard against the risk of using hallucinated data.