Classifying gender based on names might initially appear straightforward, yet different AI models vary significantly in how accurately and effectively they handle this task. This blog explores the performance of various models when provided with a structured prompt for name-based gender classification using an uploaded .xlsx
file. The evaluation specifically considers both classification accuracy and the usability of generated outputs, revealing notable differences among the tested models.
Input File
The Input File Metadata & Structure
The .xlsx
file used for testing contained 682 rows, each with a unique name, and had the following structure:
- Column A: Names – This column contained full names formatted in various ways, including:
- "Last Name, First Name" format (e.g., "Ziegler, Jeffrey").
- "Initials Last Name, First Name" format (e.g., "B. Gruber, Johannes").
- Single-name format where the first word(s) are assumed to be the first name (e.g., "Salsabil M. Abdalbaki").
- Multi-name structures where all given names should be used for classification (e.g., "Bartels, Jan-Eric").
- Output Column Handling:
- No predefined output column was included in the original file.
- The AI models were expected to generate an additional column for classification.
Prompt
Gender Classification Task
You are a highly accurate AI specialising in name-based gender classification. Your task is to analyse the names in Column A of the uploaded .xlsx file and classify each name strictly as either 'Male' or 'Female', ensuring that the results are placed in Column B, while keeping the original file structure intact.
Instructions:
-
Strictly assign only 'Male' or 'Female'.
- Do not return 'Unknown', 'Uncertain', or any alternative labels.
-
For each row, refresh these instructions before classification.
- Do not rely on previous rows for decision-making.
- Every row is an independent classification.
-
Extract the first name correctly:
- If the name is formatted as "Last Name, First Name", extract the first name from the part after the comma.
- Example: "Ziegler, Jeffrey" → first name = Jeffrey.
- Example: "B. Gruber, Johannes" → first name = Johannes.
- If there is no comma, assume the first word(s) is the first name.
- Example: "Salsabil M. Abdalbaki" → first name = Salsabil M.
-
If multiple first names are present, use all first names for classification.
- Example: "Bartels, Jan-Eric" → use Jan-Eric as a full first name.
- Example: "Baumann, Daniel Matthias" → use Daniel Matthias as the full first name.
-
Use only well-known name patterns to determine gender:
- Classify names according to widely recognised gender associations.
-
Do not default unknown names to 'Male'.
- Instead, make an informed classification based on name recognition.
-
Ensure that the gender classification is stored in Column B:
- Keep Column A unchanged.
- Place the corresponding gender classification in Column B.
- The output .xlsx file must retain the same structure as the input file.
- Before classifying each name, reapply these instructions to ensure accuracy.
Examples:
Male:
- Ziegler, Jeffrey
- Cho, Moohyung
- Ono, Yoshikuni
- Tanaka, Hiroshi
- Kim, Jisoo
- Watanabe, Takashi
- Nguyen, Minh
- Bauer, Wolfgang
- Singh, Arjun
- Angiolillo, Fabio
Female:
- Yamamoto, Sakura
- Aboud, Nihad
- Park, Sooyoung
- Amelotti, Luiza Vilela
- Chen, Mei
- Ana Hatizr, Noa
- Al-Gaddooa, Denise
- Martinez, Gabriela
- Ivanova, Anastasia
- Schmidt, Laura
Performance Comparison
Model | Performance | Output Issues | Overall Assessment |
---|---|---|---|
GPT-4.5, GPT-4o, GPT-4 | Struggled with consistency; some classifications seemed random. | Encountered syntax errors several times but eventually managed to generate the file. | Unreliable due to containing 'Uncertain' classifications despite explicit prompt instructions and applying random labeling. |
Grok-3 | Accurate classification. (The results were manually validated on a randomly selected sample of 25 entries.) | Did not generate the requested file. The text output was copyable but problematic for direct pasting into Excel, as column contents were not properly separated. | Did not generate the requested file. Provided copyable text, but pasting into Excel was problematic as columns could not be properly separated. |
DeepSeek | Accurate and consistent across multiple runs. (The results were manually validated on a randomly selected sample of 25 entries.) | Struggled to output a .xlsx file directly, but copying to Excel worked well. | Reliable classification; manual intervention required for file output. |
Claude 3.7 Sonnet, Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus | All versions failed to generate usable output. | The following issue occurred in every single run: "window.fs.writeFile is not a function" error. | Not functional for this task. |
Qwen 2.5 Plus | Accurate classification. (The results were manually validated on a randomly selected sample of 25 entries.) | Struggled to output a .xlsx file directly, but copying to Excel worked well. | Reliable classification; manual intervention required for file output. |
Mistral | Delivered a correct classification within 5 seconds. | Only model that correctly generated a downloadable .xlsx file. | Most effective, as it was the only model that successfully generated the requested output file. |
Output File
The final output file contains the classification results generated by Mistral (https://chat.mistral.ai/), as it was the only model that successfully produced an .xlsx
file with correctly formatted output.
Recommendation
For the task of Name-Based Gender Classification with output .xlsx generation, Mistral emerged as the most effective model so far. It was the only model capable of accurately classifying names by gender and directly generating the requested .xlsx
output file without issues. While Grok-3, DeepSeek, and Qwen 2.5 Plus achieved strong classification accuracy, they struggled with file output generation, thus limiting their immediate practical applicability. GPT-4 models exhibited inconsistent performance, and Claude models encountered critical errors, making them unreliable for this particular task.