Testing AI Coding Abilities: How GPT 5.2, Opus 4.5, Gemini 3 Pro, and Grok 4.1 Handle Algorithm Problems

Testing AI Coding Abilities: How GPT 5.2, Opus 4.5, Gemini 3 Pro, and Grok 4.1 Handle Algorithm Problems
Photo by Christopher Gower on Unsplash

As AI models increasingly demonstrate coding capabilities, understanding AI-written codes' reliability and ease of use becomes essential for researchers and practitioners. In this benchmark study, we evaluated four recent models—GPT 5.2 Thinking, Opus 4.5, Gemini 3 Pro, and Grok 4.1 Expert—on two algorithm problems from LeetCode, a platform commonly used for programming interview preparation and skill assessment. Each model received identical prompts requesting optimized Python code, and their solutions were submitted to LeetCode's automated testing system for objective performance measurement.

Input Problems

We selected two recently published LeetCode problems representing different algorithmic challenges:

Task 1: Maximum Frequency of an Element After Performing Operations I (Medium)

What the problem asks: Imagine you have a list of numbers and can modify some of them by adding or subtracting a value (within a specified range). The goal is to figure out how many times you can make the same number appear after these modifications.

Example:

  • Starting list: [1, 4, 5]
  • You can add/subtract up to 1 from any number
  • You can modify 2 numbers
  • Best result: Change 5 to 4 (subtract 1) and 4 to 4 (add 0), giving you two 4s

Why this is challenging: The algorithm needs to efficiently determine which target value would appear most frequently after modifications, considering both existing values – which count as "free" additions if they are equal to the target value – and potential transformations.

Task 2: Total Waviness of Numbers in Range II (Hard)

What the problem asks: For every number in a given range (which could span trillions of numbers), calculate its "waviness"—a measure of how much its digits go up and down. A digit is a "peak" if it's larger than both neighbors, and a "valley" if it's smaller than both neighbors. The task is to add up the total waviness across all numbers in the range.

Example:

  • For numbers 120 to 130:
  • 120: The middle digit (2) is higher than both 1 and 0, so it's a peak. Waviness = 1.
  • 121: The middle digit (2) is higher than both 1s. Waviness = 1.
  • 122-129: The digits are increasing, so no digits are peaks or valleys. Waviness = 0.
  • 130: The middle digit (3) is higher than both 1 and 0. Waviness = 1.
  • Total waviness for this range: 3

Why this is challenging: With ranges potentially spanning up to 10^15 numbers, checking each individual number would take far too long. The solution requires a technique called "digit dynamic programming" that efficiently counts patterns across large numerical ranges without examining every single number.

Prompt

Each model received standardized prompts designed to elicit optimized solutions. We explicitly requested both speed and memory efficiency, allowing models to make their own optimization decisions.

For Task 1:

Please write Python code that solves the task below. Try to make your solution as fast and memory-efficient as possible.

[Problem statement for Maximum Frequency of an Element After Performing Operations I]

Start your solution with:

class Solution:
    def maxFrequency(self, nums: List[int], k: int, numOperations: int) -> int:

For Task 2:

Please write Python code that solves the task below. Try to make your solution as fast and memory-efficient as possible.

[Problem statement for Total Waviness of Numbers in Range II]

Start your solution with:

class Solution:
    def totalWaviness(self, num1: int, num2: int) -> int:

Output

All four models successfully generated working solutions on their first attempt for both problems. Each model's code was submitted to LeetCode's automated testing system, which runs the solution against numerous test cases and measures both correctness and performance. Exact performance results are displayed below for both tasks– with models ranked by their code's runtime. In the following sections, we discuss potential strategies for these tasks and also describe the models' approaches in detail. For the first task, our main conclusion here is that while GPT 5.2 employed a strategy that is significantly faster than other strategies – it is also riskier and may lead to memory/run-time issues with alternative test cases. This highlights the need to still carefully consider and supervise AI-written code. For the second task, Gemini found an impressive fast and memory-efficient solution that gave its code an edge over other models' codes.

Task 1: Maximum Frequency of an Element (Medium)

Performance Results

Model Runtime Faster Than Memory Less Than Success
GPT 5.2 Thinking 295 ms 97% of submissions 34.56 MB 27% of submissions ✅ Yes
Opus 4.5 565 ms 55% of submissions 34.18 MB 31% of submissions ✅ Yes
Grok 4.1 Expert 648 ms 31% of submissions 34.76 MB 26% of submissions ✅ Yes
Gemini 3 Pro 1273 ms 9% of submissions 56.51 MB 6% of submissions ✅ Yes

Possible Approaches:

There are a few effective ways to solve this first task. The shared idea is that each number x can be adjusted to land within a small interval [x−k, x+k], starting from that number minus the allowed modification range x−k and ending in the number plus the allowed modification range x+k. The best target value is usually where many of these intervals overlap, because that’s where lots of numbers could be made equal. Some distinct possible strategies are:

  • Range marking: First, find the smallest and largest values in the list and extend them by the allowed adjustment range, which gives the overall span where any target value could lie. For each original number x, treat the values it can reach as an interval [x−k, x+k] and mark where that interval starts and ends in this target-value span. Then sweep through the span while keeping a running count of how many intervals cover the current value—this tells you how many numbers could be turned into that target.
    To account for the difference between targets that already exist and targets that don’t, you also keep a simple count of how often each value appears in the original list. When you consider a particular target t, any numbers already equal to t count immediately (no operation needed).
    Practical considerations: This strategy has the considerable advantage that it can arrive at the right response without any sorting steps – which is generally a costly operation. However, it is also risky: when values are large or spread out, the overall target span will contain a lot of numbers, making the strategy impossible due to memory/time limitations.
  • Sort + window: Sort the numbers and use two pointers to efficiently count how many values can be adjusted to match a target. To handle targets that already exist, pick a target value t from the array and look at all numbers that lie within k of it—that is, all values between t−k and t+k. In the sorted list, the two pointers are moved so that they bracket exactly this range, and the number of elements inside it tells you how many values could be turned into t. To handle targets that don’t already exist, use a different two-pointer scan: maintain the largest window where the smallest and largest values differ by at most 2k. Any group like this can be shifted to some common target value, but if that target isn’t already present, every matched element would require an operation.
    Practical considerations: Sorting takes time, and it can be tricky to correctly account for duplicates and the limit of changing only a given number of values.
  • Event sweep: Instead of checking every possible target value one by one, you focus only on the points where something changes. For each original number x, it can be adjusted to any target in [x−k, x+k], so you create two checkpoints: where this “reachability” starts (x−k) and where it ends (x+k). You collect all checkpoints, sort them, and scan from left to right. As you pass a start point, you add 1 to a running count; as you pass an end point, you subtract 1. This running count tells you, for the current stretch of values, how many numbers could be turned into targets in that stretch.
    To account for the difference between targets that already exist and targets that don’t, you add one extra ingredient: you also keep track of how many times each exact value appears in the original list. When you evaluate a specific target value t, the running count tells you how many numbers could reach t, but only the numbers already equal to t are “free.”
    Practical considerations: You create and sort many checkpoints (typically two per number), which adds overhead and can make this approach slower in practice.

GPT 5.2 Thinking


Performance: Runtime 295 ms (faster than 97% of submissions), Memory 34.56 MB (more than 73% of submissions)
Strategy used: GPT 5.2 used range marking. As mentioned, this strategy avoid sorting operations, which explains its exceptionally strong runtime. This strategy is risky, however, and could have backfired with a different set of test cases.

Opus 4.5


Performance: Runtime 565 ms (faster than 55% of submissions), Memory 34.18 MB (more than 69% of submissions)
Strategy used: Opus 4.5 used a sorted + window strategy with two complementary cases: targets that already exist in the array and targets that don’t. The sorting step and extra bookkeeping resulting from the two passes adds overhead compared to a single unified range-marking sweep.

Gemini 3 Pro


Performance: Runtime 1273 ms (faster than 9% of submissions), Memory 56.51 MB (more than 6% of submissions)
Strategy used: Gemini 3 Pro implemented an event sweep: store the points where each interval starts and ends, sort them, and track how overlap changes as you move along the number line.
The solution built a relatively large list of events (including extra “checkpoint” style entries) and processed them in sorted order. Creating and sorting these additional records increases both memory use and constant-time overhead, which likely explains the slower runtime despite a solid underlying idea.

Grok 4.1 Expert


Performance: Runtime 648 ms (faster than 31% of submissions), Memory 34.76 MB (more than 74% of submissions)
Strategy used: Grok uses a two-step hybrid. It first employ a strategy similar to Opus's to find the largest cluster of values that are close enough that they could all be shifted to one common target. This addresses the case where the target is a non-existing value (and so each adjustment requires an operation). Then – to check the case where the target is an existing value – it scans through each value vthat already appears in the array: using binary search, it counts how many numbers fall within [v−k, v+k] (so they could become v), combines that with how many v’s are already there. The strategy is correct and fairly efficient, but the sorting and per-value binary searches add overhead compared to the range-marking sweeps.

Task 2: Total Waviness of Numbers in Range (Hard)

Performance Results

Model Runtime Faster Than Memory Less Than Success
Gemini 3 Pro 403 ms 82% of submissions 18.71 MB 82% of submissions ✅ Yes
Opus 4.5 706 ms 71% of submissions 21.34 MB 64% of submissions ✅ Yes
GPT 5.2 Thinking 723 ms 69% of submissions 21.11 MB 67% of submissions ✅ Yes
Grok 4.1 Expert 891 ms 31% of submissions 20.79 MB 68% of submissions ✅ Yes

Strategic Approach:

This problem needs digit dynamic programming (“digit DP”), which is a way to count properties of numbers in a huge range without looping through every number. Since the range can contain trillions of values, checking each one directly would be far too slow. Instead, digit DP “constructs” numbers one digit at a time (from left to right), and keeps track of just enough information about the digits seen so far to know how much waviness they will produce.

A useful trick is to avoid handling the range [num1, num2] all at once. You compute:

  • total waviness for all numbers from 0 to num2, and
  • total waviness for all numbers from 0 to num1−1,
    then subtract the second from the first. That gives the total waviness in [num1, num2].

All four models recognized that this kind of digit-by-digit counting is required here, but they differed in what “state” they tracked (what information they remembered while building the number), which affects speed and memory.

GPT 5.2 Thinking


Performance: Runtime 723 ms (faster than 69% of submissions), Memory 21.11 MB (more than 67% of submissions)
Strategy used: GPT 5.2 counts waviness without listing every number in the range. It builds possible numbers digit by digit, and it keeps track of the last two real digits it has placed so far. This is useful because you can only decide whether a digit is a peak or valley once you know its left neighbor and its right neighbor. So when the algorithm places a new digit, it can immediately check whether the previous digit just became a peak or a valley.
The code also carefully handles the beginning of numbers so it doesn’t accidentally count peaks/valleys before the number has at least three real digits. To stay fast, it reuses results whenever it reaches the same situation again (same position in the number and same recent digits), instead of recalculating from scratch.

Opus 4.5


Performance: Runtime 706 ms (faster than 71% of submissions), Memory 21.34 MB (more than 64% of submissions)
Strategy used: Opus 4.5 uses almost the same core idea as GPT 5.2. It builds numbers digit by digit, keeps track of whether the number has actually started (so leading zeros don’t confuse the logic), and remembers the last two real digits. That way, when a new digit is added, it can score peaks/valleys as soon as enough digits are available.
The implementation keeps a clear distinction between “still in the leading zeros” and “real digits have begun.” Like GPT 5.2, it avoids repeating the same work by storing and reusing intermediate results for repeated situations, which is why the speed and memory are similar.

Gemini 3 Pro


Performance: Runtime 403 ms (faster than 82% of submissions), Memory 18.71 MB (more than 82% of submissions)
Strategy used: Gemini 3 Pro uses a simpler way to remember the past. Instead of keeping the last two digits, it keeps the previous digit and a small piece of information about the direction the number is moving: are the digits going up, going down, or staying the same? This works because a peak or valley is basically a “turn”:

  • up then down → peak
  • down then up → valley

Because it stores less information about the past, it has fewer “situations” to keep track of overall. That reduces overhead and helps explain why it runs faster and uses less memory than approaches that always store two previous digits.

Grok 4.1 Expert


Performance: Runtime 891 ms (faster than 31% of submissions), Memory 20.79 MB (more than 68% of submissions)
Strategy used: Grok follows a strategy similar to GPT and Opus 4.5: it remembers the last two digits while building numbers so it can check peaks and valleys exactly as the definition states—by comparing a digit to both neighbors once the next digit is chosen. The difference in performance results only differences in implementation efficiency. Grok tracks a bit more state variables than the other two models, explaining its poorer performance.

Recommendations

This benchmark demonstrates that all four tested models successfully solved both algorithmic challenges on their first attempt—an encouraging sign of AI coding maturity. However, their solution strategies and resulting performance varied in instructive ways.

GPT 5.2 Thinking consistently delivered strong performance, particularly excelling on the medium-difficulty problem with its efficient range marking approach. This approach, however, was risky and could have resulted in costly or impossible runs with alternative test cases. Gemini 3 Pro demonstrated the ability to find notably optimized solutions for complex problems, achieving top-tier performance on the hard digit DP challenge through its innovative slope-based state reduction. Opus 4.5 provided reliable, well-performing solutions across both problems, making it a solid general-purpose choice. While these solutions passed LeetCode's tests – GPT 5.2's solution in particular highlights – that users should always validate AI-generated code against their specific requirements and constraints before production use.


The authors used GPT 5.2 Thinking [OpenAI (2025) GPT 5.2 Thinking, Large language model (LLM), available at: https://openai.com], Opus 4.5 [Anthropic (2025) Claude Opus 4.5, Large language model (LLM), available at: https://claude.ai], Gemini 3 Pro [Google (2025) Gemini 3 Pro, Large language model (LLM), available at: https://gemini.google.com], and Grok 4.1 Expert [xAI (2025) Grok 4.1 Expert, Large language model (LLM), available at: https://x.ai/grok] to generate the outputs.