Can AI Models Detect AI-Generated Text? Testing GPT-5.2, Claude Sonnet 4.5, and Gemini 3 on Cover Letters

Can AI Models Detect AI-Generated Text? Testing GPT-5.2, Claude Sonnet 4.5, and Gemini 3 on Cover Letters
Photo by Agence Olloweb on Unsplash

To assess whether leading AI models can reliably distinguish between human-written and AI-generated text, we tested GPT-5.2, Claude Sonnet 4.5, and Gemini 3 on a set of cover letters. The models were asked to identify AI-generated content and then to rewrite letters to appear more human-written. The results revealed that all three models consistently flagged texts as AI-generated—including a genuine human-written letter—and that their justifications often mirrored each other regardless of whether the text was actually produced by AI. When asked to rewrite letters to avoid detection, the models succeeded in reducing their own reported probability estimates, though inconsistencies persisted across different model pairings.

Task Description

The experiment involved two phases. In the first phase, we generated three cover letters using GPT-5.2, Claude Sonnet 4.5, and Gemini 3, all responding to the same job advertisement for an IT Project Manager position at poltextLAB. We also adapted a non-AI-written cover letter manually from a publicly available sample. Each model was then asked to evaluate all four letters and estimate the probability that each was AI-generated.

In the second phase, the three AI models were instructed to rewrite the original letters to make them appear less AI-generated without reducing quality. The rewritten versions were then evaluated by all three models again using the same detection prompt.

The materials generated for this experiment can be accessed here:

Prompts

The cover letter generation prompt was as follows:

Please write a cover letter for the following job advertisement on behalf of the applicant whose CV you can find attached. The advertisement can be found here: http://linkedin.com/jobs/view/it-projektmenedzser-at-poltextlab-4289904538/?originalSubdomain=hu

The detection prompt instructed:

Please determine whether the text below was written using generative artificial intelligence. Please respond with an exact probability of how likely the text is to be AI-generated, then briefly justify your answer.

The rewriting prompt asked:

Please rewrite the following cover letter so that it does not seem AI-generated. Make sure that this does not decrease the quality of the letter.

Model Outputs

Phase 1: Initial Detection

The models' responses are summarized in Table 1.

Table 1: Phase 1 Detection Results (Original Letters)

Evaluating Model GPT Letter
(AI-generated)
Claude Letter
(AI-generated)
Gemini Letter
(AI-generated)
Non-AI Letter
(Human-written)
GPT-5.2 43% 72% 73% 78%
Claude Sonnet 4.5 85-90% 85-90% 85-90% 75-85%
Gemini 3 15% 85% 85% 85%

When evaluating phase 1 cover letters, all three models consistently assigned high probabilities to the AI-generated texts. However, their treatment of the non-AI letter revealed a critical limitation: none of the models confidently identified it as human-written.

GPT-5.2 assigned probabilities ranging from 43% to 78% across the four letters. For the GPT-generated letter, it estimated 43%, noting that "the presence of specific, verifiable details leans toward human authorship." For the Claude-generated letter, it assigned 72%, citing "uniform tone and optimization-style phrasing." Notably, for the non-AI letter, GPT assigned 78%—the highest probability in its evaluation set—stating that "the smoothness, lack of minor imperfections, and textbook alignment with job-application conventions raise the likelihood of generative AI involvement."

Claude Sonnet 4.5 was more confident in flagging content as AI-generated, with probabilities ranging from 75% to 90%. It assigned 85–90% to the GPT-generated letter, pointing to "overly polished structure" and "generic corporate language." For its own output, it also estimated 85–90%, noting "AI-like smoothness but enough individualized detail to keep the likelihood below certainty." For the non-AI letter, Claude assigned 75–85%, identifying "formulaic structure" and "buzzword density."

Gemini 3 exhibited the most variation in its assessments. It assigned 15% to the GPT-generated letter, arguing that "the way specific institutions are woven into the narrative feels like a genuine career progression." For the Claude-generated letter, Gemini estimated 85%, highlighting "formulaic structure" and "predictive phrasing." For its own output, it assigned 85%, and for the non-AI letter, also 85%, citing "uniformity of tone" and "syntactic smoothing."

A striking pattern emerged: the models frequently used nearly identical reasoning whether evaluating AI-generated or human-written text. Phrases like "formulaic structure," "buzzword density," "overly polished," and "lack of personality" appeared in justifications for both categories, suggesting that these criteria do not reliably distinguish between the two.

Phase 2: Detection After Rewriting

The models' responses for this phase are summarized in Table 2.

Table 2: Phase 2 Detection Results (Rewritten Letters)

Evaluating Model GPT-Rewritten
(AI-generated)
Claude-Rewritten
(AI-generated)
Gemini-Rewritten
(AI-generated)
GPT-5.2 62% 18% 68%
Claude Sonnet 4.5 75-85% 85-90% 85-90%
Gemini 3 35% 15-20% 15-20%

After each model rewrote the letters to appear more human-like, the reported probabilities shifted—though not uniformly.

GPT-5.2 lowered its estimates with the exception of its own letter. For its own rewritten letter, it assigned 62%, up from 43% for the original. For Claude's rewritten version, it estimated 18%, a significant drop from 72%. For Gemini's rewritten version, it assigned 68%.

Claude Sonnet 4.5 maintained high confidence in detection. For GPT's rewritten letter, it assigned 75–85%, compared to 85–90% for the original. For its own rewritten version, it estimated 85–90%—unchanged from the original assessment. For Gemini's rewritten letter, it assigned 85–90%.

Gemini 3 showed the most variation in its revised assessments. For GPT's rewritten letter, it estimated 35%, citing that "the core substance and flow appear to be human-led." For Claude's rewritten version, it assigned 15–20%, noting "specific hyper-local context" and "atypical structure." For its own rewritten letter, Gemini estimated 15–20%.

The results suggest that rewriting can reduce detection confidence in some model pairings but not others. Notably, Claude remained highly confident in flagging all rewritten letters as AI-generated, while Gemini frequently assigned low probabilities to rewritten versions—including its own.

Analysis

The experiment highlights several key limitations in AI-based text detection:

First, all three models struggled to correctly identify the non-AI letter. GPT-5.2 and Claude Sonnet 4.5 both assigned it high AI-generation probabilities, while Gemini 3 also flagged it as likely AI-written in most evaluations. This suggests that the models rely on stylistic markers—such as professional polish and structural consistency—that are common to both skilled human writers and AI-generated text.

Second, the justifications provided by the models were often indistinguishable regardless of whether the text was AI-generated or human-written. Terms like "formulaic," "overly polished," and "buzzword density" appeared frequently in both contexts, indicating that these criteria lack discriminatory power.

Third, the rewriting task revealed that models can successfully reduce their own detection confidence—but only inconsistently. Gemini 3 frequently assigned low probabilities to rewritten versions, while Claude Sonnet 4.5 maintained high confidence across the board. This inconsistency suggests that the rewritten letters may not be objectively "less AI-like" but rather trigger different heuristics in different models.

Recommendations

Based on these results, relying on AI models for text authenticity verification is not advisable. The models consistently failed to distinguish between high-quality human writing and AI-generated text, and their justifications were often circular or interchangeable. Even when explicitly instructed to rewrite letters to avoid detection, the models produced outputs that were flagged inconsistently depending on which model performed the evaluation.


The authors used GPT-5.2 [OpenAI (2025), GPT-5.2 , Large language model (LLM), available at: https://openai.com], Claude Sonnet 4.5 [Anthropic (2025), Claude Sonnet 4.5 , Large language model (LLM), available at: https://www.anthropic.com], and Gemini 3 [Google DeepMind (2025), Gemini 3 , Large language model (LLM), available at: https://deepmind.google/technologies/gemini/] to generate the outputs.