model comparison

model comparison

Assessing GenAI Models in Retrieving Official NBA Game Data: Capabilities and Limitations

In this post we present a comparative test of several generative AI models, focusing on their ability to retrieve official sports statistics. The task involved collecting NBA game data through a single prompt, with the expectation that the models would accurately locate, extract, and structure the requested information. While the

Can OpenAI Agent Support Academic Research? A Practical Comparison with Manus.ai and Perplexity

We tested the new OpenAI Agent to assess its usefulness in academic research tasks, comparing it directly with Manus.ai and Perplexity’s research mode. Our aim was to evaluate how effectively each tool finds relevant scholarly and policy sources, navigates restricted websites (including captchas and Cloudflare protections), and allows

Unlocking Large Language Models via API: Capabilities, Access, and Practical Considerations

Accessing large language models (LLMs) through APIs opens up research and development opportunities that go well beyond the limits of traditional chat interfaces. For researchers API access enables language models to be built directly into bespoke workflows, analytical tools, or automated processes—supporting tasks such as large-scale text analysis, data

Benchmarking GenAI Models for Penguin Species Prediction: Grok 3, DeepSeek-V3, and Qwen2.5-Max Delivered Top Results

How well can today’s leading GenAI models classify real-world biodiversity data—without bespoke code or traditional machine learning pipelines? In this study, we benchmarked a range of large language models on the task of predicting penguin species from tabular ecological measurements, including both numerical and categorical variables. Using a

Comparing the FutureHouse Platform’s Falcon Agent and OpenAI’s o3 for Literature Search on Machine Coding for the Comparative Agendas Project

Having previously explored the FutureHouse Platform’s agents in tasks such as identifying tailor-made laws and generating a literature review on legislative backsliding, we now directly compare its Falcon agent and OpenAI’s o3. Our aim was to assess their performance on a focused literature search task: compiling a ranked

Identifying Primary Texts Where Deleuze and Foucault Critique Marxist Theory: A Literature Search Prompt

Can large language models return accurate results when asked to find real academic texts written by specific philosophers on a specific topic? In this case, the topic was Marxist theory — and the instruction was to list peer-reviewed publications in which Gilles Deleuze or Michel Foucault offer a critique of it.

Current Limitations of GenAI Models in Data Visualisation: Lessons from a Model Comparison Case Study

In earlier explorations, we identified the GenAI models that appeared most promising for data visualisation tasks—models that demonstrated strong code generation capabilities, basic support for data wrangling, and compatibility with popular Python libraries such as Matplotlib and Seaborn. In this follow-up case study, we examine a different dimension: rather

Claude 3.7 Sonnet, GPT-4o and GPT-4.5 as Top Choices for Data Visualisation with GenAI

As data visualisation becomes an increasingly integral part of communicating research and insights, the choice of generative AI models capable of handling both code and chart rendering has never been more important. In this post, we explore which models can be reliably used for visual tasks—such as generating Python-based