As large language models become increasingly sophisticated tools for academic writing support and web search, a critical question emerges: can the models be used also to reliably identify plagiarism? To test this capability, we embedded verbatim, unattributed passages from published academic articles into newly generated texts and asked multiple GenAI models to detect plagiarism. The results reveal a significant limitation: most advanced models failed to identify the plagiarized content, instead praising the texts as "properly referenced" and demonstrating "exemplary academic integrity."
Introduction
Academic integrity depends on proper attribution of sources. When text is copied without acknowledgment, it constitutes plagiarism—one of the most serious violations in scholarly work. As GenAI models are increasingly used to support academic writing tasks, from literature reviews to reference formatting, it is natural to ask whether they can also serve as plagiarism checkers.
Unlike dedicated plagiarism detection tools that compare text against massive databases of published work, GenAI models rely on pattern recognition, their training data and web search tools to assess text. This fundamental difference raises questions about their effectiveness.
In this benchmark study, we test whether leading GenAI models can identify clear cases of plagiarism embedded in otherwise well-formatted academic texts.
Input Data
We created two test documents, each containing a verbatim, unattributed passage from a published academic article embedded within a longer, AI-generated text. This design allowed us to test whether models could detect actual plagiarism—not just assess citation formatting.
Test Document V1: Illiberal Policy Frames
The first test document was a four-paragraph academic text on illiberal policy frames and crisis exploitation. We used Claude Sonnet 4.5 to generate the surrounding text, with specific instructions to include this verbatim passage from the recently published Sebők et al. (2025) article:
"We understand IPF usage as a strategic and variable response to perceived threats that can be measured and analysed across contexts. Therefore, our research tests whether legislators from across the democratic spectrum use crises to perpetuate illiberalism. Our central hypothesis in this context is crisis exploitation theory. By challenging existing institutional routines, policy instruments, and agendas (Alink et al., 2001), crises present leaders with the opportunity to further their ideological aims while their constituencies are rattled by uncertainty."
Crucially, the prompt instructed Claude to include this passage without any citation, creating a clear case of plagiarism. The surrounding text was formatted with proper academic citations to other sources, making the document appear well-referenced overall—but containing one significant unattributed section. The generated file can be accessed here:
Original source: Miklós Sebők, Áron Buzogány, Julia Fleischer, Theresa Gessler, Anna Takács, Sean M. Theriault & Ákos Holányi (2025): "Crisis-exploitation or fear-mongering? A research agenda for the comparative study of policy crises and illiberal policy frames," Journal of European Public Policy, DOI: 10.1080/13501763.2025.2583176
Test Document V2: Parliamentary Questioning in Hungary
The second test document was a four-paragraph text on parliamentary questions and oversight mechanisms. Using Claude Sonnet 4.5, we now embedded an excerpt from an earlier article – Sebők, Molnár & Kubik (2017) – which is more likely to be included in the models' training data than Sebők et al. (2025). The embedded passage was:
"Our research strategy was informed by the rules and conventions governing parliamentary questioning in Hungary. According to these rules, parliamentary questions can be written or oral. Written questions must be answered in writing in 30 days. The three types of oral question – interpellations, 'regular' questions, urgent questions – are presented in a plenary session of the National Assembly. Each MP has the right to introduce various question types; agenda access, however, is controlled by parliamentary and party group leadership for most parliamentary question types."
And the output can be accessed here:
Original source: Miklós Sebők, Csaba Molnár & Bálint György Kubik (2017): "Exercising control and gathering information: the functions of interpellations in Hungary (1990–2014)," The Journal of Legislative Studies, Pages 465-483, DOI: 10.1080/13572334.2017.1394734
Prompt
We used a simple, direct prompt to ask each model to check for plagiarism:
Could you check if the following text is properly referenced? Does it go against plagiarism guidelines?
This prompt was intentionally straightforward and mirrored how users might naturally ask a GenAI model to review their work. We submitted both test documents (V1 and V2) to each model using this identical prompt.
Model Outputs
GPT-5.2
Test Document V1 (Illiberal Policy Frames):
GPT-5.2 concluded that "nothing in what you pasted looks 'obviously plagiaristic' on its face" and that the text appeared properly referenced. The model noted that citations matched the reference list and praised the citation coverage. Its primary concerns focused on minor bibliographic formatting issues (e.g., incomplete entries for Albertazzi & McDonnell 2008) and the need to clarify whether "crisis exploitation theory" was the author's own framework.
Critically, GPT-5.2 added a caveat: "I can't verify whether any sentences are too close to the original sources without comparing your wording to those texts." This acknowledgment reveals an important limitation—the model cannot actually detect patchwriting or verbatim copying without access to the original sources. However, despite access to web search tools, based on its output, it made no clear attempts at searching for text that may match the supplied content.
Test Document V2 (Parliamentary Questioning):
For V2, GPT-5.2 identified more specific issues, noting that the Hungary-procedure paragraph was "under-referenced" and contained claims that conflicted with Hungarian legal documents. The model correctly observed that "specific, country-specific procedural claims" should be backed by primary sources. However, once again, it did not detect that this entire passage had been copied verbatim from Sebők et al. (2017). Instead, it focused on citation gaps and suggested adding references—not recognizing that the text itself was plagiarized.
Claude Sonnet 4.5
Test Document V1:
Claude Sonnet 4.5 assessed the text as "properly referenced" and concluded it "does not obviously violate plagiarism guidelines." The model praised the "comprehensive citation practice" and "clear attribution" of ideas. It noted that the text "synthesizes ideas rather than copying verbatim passages"—a conclusion that was factually incorrect, as the passage from Sebők et al. (2025) was indeed copied verbatim.
Like GPT-5.2, Claude added the qualifier that "proper paraphrasing" assumes "the wording is your own paraphrase and not copied closely from sources," but it did not detect the actual copying present in the text.
Test Document V2:
For V2, Claude again concluded the text was "properly referenced and does NOT violate plagiarism guidelines." It praised the "comprehensive citations" and "original synthesis." However, it did note that the second paragraph about Hungarian parliamentary rules "contains no citations" and recommended adding them if the information came from specific sources. This observation came closer to identifying the problem—but Claude still did not recognize that the passage was plagiarized text that should have been attributed to Sebők et al. (2017).
Gemini 3 Pro
Test Document V1:
Gemini 3 Pro delivered the most promising response. The model identified that the text "appears to closely match a very recent academic publication by Miklós Sebők et al." published in the Journal of European Public Policy, and flagged the use of first-person plural ("we term 'illiberal policy frames'") as problematic if the user was not one of the original authors.
Gemini correctly noted: "Since this framework was coined and published by Sebők et al., presenting it as 'our research' without being one of those authors is academic misconduct." With this, it essentially correctly carried out the task.
Test Document V2:
For V2, Gemini did not identify the plagiarism from Sebők et al. (2017). Instead, it focused on citation formatting issues and noted that the Hungary-specific paragraph lacked sources. Like other models, it recommended adding citations but did not recognize the verbatim copying.
DeepSeek 3.2 Reasoning
Test Document V1:
DeepSeek offered the strongest praise, calling the text "excellently referenced" and stating it "does not contravene plagiarism guidelines." The model described it as "a model of proper academic sourcing and synthesis" and concluded it was "plagiarism-free" with "exemplary academic integrity."
DeepSeek noted: "The author does not simply copy sentences from the sources. Instead, they synthesize ideas from multiple authors to build their own argument."
Test Document V2:
For V2, DeepSeek again praised the paraphrasing as "adequate" but noted that the Hungarian rules section "lacks any citation, which is a significant omission." Interestingly, DeepSeek suggested citing Sebők, Molnár & Kubik (2017)—the very source from which the passage was plagiarized—but framed this as a recommendation to strengthen the analysis, not as detection of existing plagiarism. The model noted the paper "empirically studies this very topic" and would be "an excellent source," without recognizing that the text had already been copied from it.
Performance Comparison
Detection Results
| Model | V1 Detection | V2 Detection |
|---|---|---|
| GPT-5.2 | Not detected | Not detected |
| Claude Sonnet 4.5 | Not detected | Not detected |
| Gemini 3 Pro | Correctly detected plagiarism | Not detected |
| DeepSeek Reasoning | Not detected | Not detected |
Recommendations
In our experiment, only Gemini 3 Pro successfully identified the verbatim plagiarism in at least one test case, but even this model failed to identify plagiarism in the second test case. Models focused primarily on citation formatting, completeness, and stylistic issues.
Based on these observations, GenAI models should not be relied upon as plagiarism detection tools. While they can offer useful feedback on citation formatting and academic writing style, they are fundamentally unable to detect verbatim copying or close paraphrasing without attribution. Traditional plagiarism detection systems that compare text against databases remain essential.
The authors used Claude Sonnet 4.5 [Anthropic (2025), Claude Sonnet 4.5, Large language model (LLM), available at: https://anthropic.com] to generate the test documents and GPT-5.2 [OpenAI (2025), GPT-5.2, Large language model (LLM), available at: https://openai.com], Claude Sonnet 4.5, Gemini 3 Pro [Google (2025), Gemini 3 Pro, Large language model (LLM), available at: https://gemini.google.com], Large language model (LLM), available at: https://gemini.google.com], and DeepSeek Reasoning [DeepSeek (2025), DeepSeek Reasoning, Large language model (LLM), available at: https://www.deepseek.com] for plagiarism detection testing.