Miklós Sebők - Rebeka Kiss - Dávid László

Testing Winston.AI and AI or Not for AI-Generated Image Detection

As AI image generation tools become increasingly sophisticated, the ability to distinguish between AI-generated and authentic photographs has emerged as a critical challenge. Platforms like Winston.AI and AI or Not claim to address this need by detecting synthetic imagery with high accuracy. But how well do these detection systems

Benchmarking AI Audio Transcription: How Leading Models Handle English and Hungarian Speech

As audio becomes an increasingly common input for AI assistants, it becomes natural to ask whether these platforms can also transcribe pre-recorded audio accurately. In this benchmark, we tested five leading generative AI models on identical English and Hungarian audio files to assess their transcription accuracy, language support, and practical

Can GenAI Models Detect Plagiarism Reliably? A Multi-Model Benchmark on Academic Text

As large language models become increasingly sophisticated tools for academic writing support and web search, a critical question emerges: can the models be used also to reliably identify plagiarism? To test this capability, we embedded verbatim, unattributed passages from published academic articles into newly generated texts and asked multiple GenAI

Exploring Temperature Settings in Creative Writing: A Haiku Generation Case Study with OpenAI API

Every time you prompt a large language model, you're witnessing the result of thousands of sequential sampling decisions—each token drawn from a probability distribution over the model's vocabulary. In standard chat interfaces, these distributions are sampled with fixed settings that users cannot adjust. But API

Testing AI Coding Abilities: How GPT 5.2, Opus 4.5, Gemini 3 Pro, and Grok 4.1 Handle Algorithm Problems

As AI models increasingly demonstrate coding capabilities, understanding AI-written codes' reliability and ease of use becomes essential for researchers and practitioners. In this benchmark study, we evaluated four recent models—GPT 5.2 Thinking, Opus 4.5, Gemini 3 Pro, and Grok 4.1 Expert—on two algorithm problems

Testing PangramLabs for AI Text Detection: Impressive but Imperfect Performance

Can AI detection tools reliably distinguish between human and machine-generated text? In our previous testing, we found that platforms like Originality.ai and ZeroGPT and GenAI models, frequently misclassified both AI and human text. Now, PangramLabs has gained attention with claims of near-perfect accuracy, verified by third-party organizations. In this

Understanding Model Confidence Through Log Probabilities: A Practical OpenAI API Implementation

Log probabilities (logprobs) provide a window into how confident a language model is about its predictions. In this technical implementation, we demonstrate how to access and interpret logprobs via the OpenAI API, using a series of increasingly difficult multiplication tasks. Our experiment reveals that declining confidence scores can effectively signal

Testing Claude Projects' New RAG Feature: Fast Setup, Accurate Retrieval Across 113 Articles

Claude Projects recently introduced a Retrieval-Augmented Generation (RAG) feature that extends its capabilities beyond the standard context window. Unlike traditional approaches that require custom vector databases and API integration, Claude's implementation offers a streamlined, no-code solution for handling large knowledge bases directly within the Projects interface. In this

Building Intelligent Tool Use with the OpenAI API: A Practical Implementation Guide

Tool use (also called function calling) is one of the most powerful capabilities available in modern language models. It allows models to extend beyond text generation by interacting with external systems, databases, or custom logic—making AI agents capable of real-world tasks like checking weather, querying APIs, or running calculations.

Can AI Models Detect AI-Generated Text? Testing GPT-5.2, Claude Sonnet 4.5, and Gemini 3 on Cover Letters

To assess whether leading AI models can reliably distinguish between human-written and AI-generated text, we tested GPT-5.2, Claude Sonnet 4.5, and Gemini 3 on a set of cover letters. The models were asked to identify AI-generated content and then to rewrite letters to appear more human-written. The results