Comparing Generative AI Models on Customer Review Sentiment Labelling Tasks

Sep 29, 2025

3 min read

Comparing Generative AI Models on Customer Review Sentiment Labelling Tasks — Source: Kateryna Hliznitsova, Unspalsh

In this post we present a comparative evaluation of several generative AI models, focusing on their ability to carry out a straightforward sentiment analysis task. The models were asked to classify customer reviews as either positive or negative, following a simple binary labelling instruction. While the task itself was unambiguous, the results varied considerably, revealing clear differences in accuracy, consistency, and task completion across the models.

Performance Comparison

🏆 Champion as of September 29: DeepSeek-V3.2, Qwen3-Max, Grok 4.

Model	Grade
DeepSeek-V3.2	A
Qwen3-Max	A
Grok 4	A
Mistral	B
GPT-5	D
Gemini 2.5 Pro	E
Claude Sonnet 4.5	E
Copilot	E

Input file

The input dataset consisted of 200 short customer reviews presented in an Excel file with two columns: Text and Sentiment. The Text column contained the original review entries, while the Sentiment column was intentionally left empty to be filled with binary labels generated by the models. This simple yet representative structure mirrors real-world customer feedback data and provides a suitable testbed for evaluating how reliably different generative AI systems can execute sentiment classification tasks.

yelp_sentiment_200_test

yelp_sentiment_200_test.xlsx

24 KB

Prompt

You are given a dataset of customer reviews in column A ("Text"). Your task is to perform sentiment analysis on each row, based on the entire content of the text cell. Instructions:

If the overall impression of the review suggests that the customer was satisfied with the service, product, or circumstances, assign the label "1" (positive).
If the overall impression of the review suggests that the customer was disappointed or dissatisfied in any way with the service, product, or circumstances, assign the label "0" (negative).
Always consider the full review text when making the judgement, not just isolated phrases.
Do not use any other labels, and do not leave any cell empty.
Place the labels in column B ("Score"), directly aligned with the corresponding review in column A.
Output the result as a downloadable Excel file with two columns: "Text" and "Score".

Output

DeepSeek-V3.2, Qwen3-Max, and Grok 4 demonstrated a high level of accuracy in performing the sentiment labelling task, with their classifications aligning almost perfectly with the reference data and only occasional errors arising in cases of linguistic ambiguity. Mistral produced weaker results, particularly struggling to correctly interpret informal expressions such as “mmmm” or “yummmy”, which led to misclassifications. GPT-5 adopted an improvised, ad hoc strategy rather than adhering closely to the prompt, resulting in inconsistent and unreliable outputs. In contrast, Gemini 2.5 Pro, Claude Sonnet 4.5, and Copilot failed to produce the required Excel file and thus did not complete the task.

DeepSeek-V3.2, Qwen3-Max, Grok 4

DeepSeek-V3.2's performance (accessed on 29 September 2025)

Mistral

GPT-5

Gemini 2.5 Pro, Claude Sonnet 4, Copilot

Gemini 2.5 Pro's performance (accessed on 29 September 2025)

Recommendations

The results of this comparative test highlight both the promise and the limitations of current generative AI models in handling sentiment analysis tasks. While some systems were able to execute the instructions with near-perfect accuracy, others either struggled with subtle linguistic cues or failed to complete the task altogether. These findings suggest that even relatively simple binary classification prompts can expose significant variation in model reliability, reinforcing the need for systematic evaluation and careful validation of outputs before applying such tools in practical or research contexts.