Argania Detection from Sentinel-2 Spectral Data: DeepSeek-V3 Excels with Prompt-Based Labelling of Structured Data

Argania Detection from Sentinel-2 Spectral Data: DeepSeek-V3 Excels with Prompt-Based Labelling of Structured Data
Source: Unsplash - mana5280

How far can today’s large language models go in scientific data analysis—without bespoke coding or deep learning pipelines? In this experiment, we explore the ability of DeepSeek-V3 to perform pixel-level detection of Argania trees (i.e., binary classification for each pixel) using only tabular Sentinel-2 spectral data. By providing the model with 100 labelled examples and 25 unlabelled test cases (using an 80-20 train–test split), we demonstrate that DeepSeek-V3 can achieve flawless, 100% accurate pixel-level classification—within seconds and without traditional model training. This approach opens the door to rapid, reproducible remote sensing tasks using nothing but natural language instructions and spreadsheet data.

Input files

The data used in this experiment are drawn from a publicly available Kaggle dataset focused on Argania tree detection using Sentinel-2 spectral bands. For demonstration purposes, we selected a subset of 100 labelled examples as our training set (isArgania_train.xlsx) and 25 unlabelled samples as our test set (isArgania_test.xlsx), both in Excel format.

In this experiment, “detection” refers to a binary classification task: for each Sentinel-2 pixel, the goal is to determine whether it contains an Argania tree (isArgania = 1) or not (isArgania = 0). In the training data, the 'isArgania' column is labelled 1 for presence and 0 for absence. isArgania_train.xlsx — This file contains 100 labelled examples for supervised learning. Each row represents a Sentinel-2 pixel, with columns for the spectral bands (B1, ... B9) and the target column ("isArgania", with values 1 or 0).

Prompt

The DeepSeek-V3 model was provided with a structured prompt that included 100 labelled training examples (tabular Sentinel-2 spectral data) and a further 25 unlabelled test cases. The prompt instructed the model to learn exclusively from the supplied examples—identifying patterns and thresholds in the band values associated with the presence or absence of Argania trees. The model was then required to predict the correct label (isArgania) for each test sample, inserting the result as a new column in the output table. Crucially, the model was not permitted to use any external knowledge or general reasoning, but had to rely strictly on the logic present in the 100 labelled cases. This approach allowed us to evaluate the model’s ability to perform “in-prompt” learning and generalisation on structured, real-world data.

You have received two attached Excel spreadsheets:

  • isArgania_train.xlsx — This file contains 100 labelled examples for supervised learning. Each row represents a Sentinel-2 pixel, with columns for the spectral bands (B1, B10, B11, B12, B2, B3, B4, B5, B6, B7, B8, B8A, B9) and the target column ("isArgania", with values 1 or 0).
  • isArgania_test.xlsx — This file contains 25 unlabelled rows in the exact same format, but without the "isArgania" column.

Your task:

  • Study the 100 labelled examples in isArgania_train.xlsx to learn how the values of the spectral bands relate to the presence of Argania trees (isArgania = 1) or their absence (isArgania = 0).
  • Rely strictly and exclusively on the statistical relationships, boundaries, and patterns you observe in the training data.
  • Do not use any outside knowledge or assumptions—use only the information from the 100 labelled rows.
  • For each row in isArgania_test.xlsx, predict whether "isArgania" should be 1 or 0.
  • Write your prediction in a new column "0" (labelled "isArgania") at the end of each row, so the output is a complete table with all original columns plus your prediction.

Output only the completed table (with all columns, including your predictions) for the 25 test rows.

No explanations, comments, lists, or changes in row order—only the tabular output with the new column.

Excel input format summary:

  • Both files are in standard Excel (.xlsx) format with column headers.
  • Column order is: B1, B10, B11, B12, B2, B3, B4, B5, B6, B7, B8, B8A, B9 — and, in the training file, the final column is "isArgania".
  • The test file does not have the "isArgania" column; you must add it with your predicted value for each row.

Your output:
Return the test table with all original columns, and your predicted "isArgania" in the last column for each row.

Output

The model classified every case correctly, achieving 100% accuracy on the test set. Each prediction in the isArgania column matched the ground truth exactly, with no errors or discrepancies. This level of performance is notable given the modest size of the training sample—just 100 labelled examples—and the fact that DeepSeek-V3 relied entirely on in-prompt numerical data, without any external training or fine-tuning. Completing the task in seconds, using nothing but prompt engineering, underscores the practical potential of these models for scientific and applied data analysis.

DeepSeek-V3's performance (accessed on 29 May 2025)

Recommendations

These findings indicate that prompt-based classification using large language models like DeepSeek-V3 can achieve highly accurate results on well-structured scientific data, provided the prompt is clear and the training examples are representative. For researchers, this approach offers a fast and code-free alternative for specific, rule-based tasks, as long as sufficient labelled data are available. However, it is crucial to recognise that the model’s performance is entirely dependent on the quality and coverage of the examples supplied, and it may not generalise to more complex or ambiguous cases. We recommend careful validation on held-out data and transparent reporting of accuracy to ensure reliable conclusions.

The authors used DeepSeek-V3 [DeepSeek (2025) DeepSeek-V3 (accessed on 29 May 2025), Large language model (LLM), available at: https://www.deepseek.com] to generate the output.