Prompt-Driven Video Analysis of Animal Behaviour Using Gemini 2.5 Pro in Google AI Studio and via API

Prompt-Driven Video Analysis of Animal Behaviour Using Gemini 2.5 Pro in Google AI Studio and via API
Source: Getty Images For Unsplash+

Can state-of-the-art multimodal models analyse animal behaviour directly from video footage? In this study, we tested Google’s Gemini 2.5 Pro — both in AI Studio and via its API — to assess whether it can produce structured ethological descriptions based purely on short animal-related videos. By applying a consistent prompt across multiple examples, we explored the model’s potential for automating video-based behavioural analysis in ethological and scientific research contexts.

Input files

The input data consists of 18 short video clips, each approximately 10 seconds long. These were automatically segmented from longer animal-related YouTube videos using a custom script. Each clip captures a clearly observable behaviour performed by a single animal. The clips vary in quality and often depict multiple animals, making the task intentionally more challenging. This setup was designed to test the limits of large multimodal models when analysing real-world footage with varying visual clarity and complexity.

The full set of video clips is available in the following ZIP archive:

Prompt

To evaluate the capacity of large multimodal models for structured behavioural analysis, we formulated a zero-shot prompt that casts the model as an ethologist. The task required the model to watch 18 short video clips and extract key ethological information based solely on visual observation. For each clip, the model was asked to identify the animal’s common and scientific genus name, followed by a one-sentence behavioural description grounded strictly in observable features—such as body posture, movement, or interaction with the environment.

You are an ethologist analysing 18 short wildlife video clips. Each video shows a single animal. Carefully examine the visible behaviour in each clip and provide a structured analysis for each.

For each video (numbered 1 to 18), provide:

  • Video number (e.g. Video 1, Video 2, …)
  • Common name of the species (in English)
  • Scientific genus (in Latin)
  • Behavioural description
     Describe the animal’s behaviour in one concise sentence, using appropriate ethological terms.

Base your behavioural descriptions strictly on visible cues only, such as:

  • Posture and body position
  • Movement and orientation
  • Head direction and focus
  • Use of mouth, limbs, tail, or other body parts

Do not infer emotions, intentions, or hidden context. Only describe what is clearly and directly visible in the footage.

Running the Prompt in Google AI Studio (Gemini 2.5 Pro Preview)

Gemini 2.5 Pro Preview's performance (accessed on 3 June 2025)

We tested the task in Google AI Studio using Gemini 2.5 Pro Preview, which allowed us to upload and analyse all 18 video clips simultaneously. The interface handled the batch input smoothly and returned a structured output for each video, based on our ethology-focused prompt. The model responded with a numbered list that included species labels and behaviour descriptions derived solely from the visible content of each clip. In many cases, Gemini provided surprisingly relevant and accurate behavioural summaries.

Below, we highlight two example results alongside the corresponding videos. The first video response demonstrated strong ethological accuracy: the model correctly identified the species as Ursus and described the polar bears’ posture and spatial arrangement with precise observational detail. In the second case, it likewise performed well, recognising Loxodonta and providing a clear behavioural account of grazing, trunk use, and ear-flapping — all based strictly on visible cues, without speculative interpretation.

0:00
/0:09
0:00
/0:06
Gemini 2.5 Pro Preview's performance (accessed on 3 June 2025)

Automated Behaviour Analysis via Gemini 1.5 Pro API

In the API-based version of the task, we used Gemini 1.5 Pro to process the same 18 animal behaviour clips. The prompt structure closely mirrored the one tested in Google AI Studio: the model was instructed to identify the species and genus visible in each video, and to summarise the observed behaviour using ethological terms. Each response was returned in a structured format, including one row per clip. The results were automatically saved into an Excel file, allowing for efficient comparison, documentation, and further analysis.

The full script used for the API-based processing — including video loading, prompt definition, Gemini 1.5 Pro calls, and Excel export — is available below:

The final output was highly accurate across all 18 clips: the model successfully identified each animal and provided concise, appropriate behavioural descriptions using ethological terms. Species recognition and genus assignment were correct in all cases, and the observed actions—whether affiliative, agonistic, foraging, or resting—were clearly and consistently described. This demonstrates the Gemini 1.5 Pro model’s strong ability to analyse animal behaviour in short video segments via API interaction.

Recommendations

These results suggest that Gemini models — both the 2.5 Pro version in AI Studio and the 1.5 Pro variant accessed via API — are capable of producing accurate and structured ethological descriptions from short video clips. Their ability to identify animal species and generate behaviourally precise summaries highlights their potential value for research in ethology and animal behaviour analysis. We used Google AI Studio for Gemini 2.5 Pro because, at the time of testing, file uploads were not yet supported in the Gemini Chat interface — a current limitation for more general users. Overall, the results demonstrate that large multimodal models can meaningfully support the automated analysis of animal behaviour in video-based research.

The authors used Gemini 2.5 Pro Preview [Google DeepMind (2025) Gemini 2.5 Pro Preview (accessed on 3 June 2025), Large language model (LLM), available at: https://deepmind.google/technologies/gemini/] to generate the output.