Data pre-processing is a crucial but often tedious stage in research, particularly when standardising complex CSV datasets. This blog demonstrates how GenAI can systematically structure geographic location data directly from an input CSV file without coding. Our primary objective was to ensure consistent and accurate categorisation, preparing the data for reliable analyses.
Our task involved standardising the 'Location' column from the input CSV by separating entries into 'Country', 'State', 'County_Province', and 'Municipality_City'. We designed a targeted GenAI prompt to effectively handle irregularities, irrelevant information, and missing data.
Input file
The input file contained 1,000 rows of pathogenetic research data featuring a 'Location' column with various inconsistencies, non-standard formatting, and irrelevant information. Locations were presented in multiple formats, often mixing countries, states, cities, counties, and medical or non-geographical entities within the same field.
Prompt
Retrieve all unique values from the 'Location' column, store them in a 'unique_locations' python list, and sort the full list in alphabetical order. Ensure that each unique value appears only once.
We first retrieved all unique values from the 'Location' column to clearly understand the types and variations of data present. This initial step helped us identify specific patterns, irregularities, and special cases, directly informing the examples we later used in our few-shot prompt for accurate data categorisation and standardisation.
Objective:
Your task is to analyse the provided 'unique_locations' list from the CSV file and split the 'Location' column into four new columns: 'Country', 'State', 'County_Province', and 'Municipality_City'. The classification must be accurate, following the guidelines provided, and handle special cases, including missing data, non-standard orders, and irrelevant information. For any missing or invalid data, fill the fields with '9999'. The final output should be a modified table saved as a CSV file, with the new columns positioned directly next to the 'Location' column in the original dataset.Examples (patterns for recognition):
1. Country, State, and County/Province
'USA: Curry County, Oregon' should be split as:
Country: USA
State: Oregon
County_Province: Curry County
Municipality_City: 9999
Note: The values for 'Country', 'State', and 'County_Province' are not always in a consistent order, and not all fields are available in every case. When a value is missing, use '9999' to indicate its absence.
2. Country and State
'USA:WA' should be split as:
Country: USA
State: WA
County_Province: 9999
Municipality_City: 9999
3. Country with irrelevant information (discard irrelevant details)
'Denmark: WHO Reference Center' should be split as:
Country: Denmark
State: 9999
County_Province: 9999
Municipality_City: 9999
4. Medical facility or irrelevant details combined with valid location
'USA: UC Davis Medical Center, Davis, Ca' should be split as:
Country: USA
State: CA
County_Province: 9999
Municipality_City: Davis
5. Non-geographical information
Entries like 'not collected', 'not determined', etc. should be sorted as:
Country: 9999
State: 9999
County_Province: 9999
Municipality_City: 9999
6. Location with irrelevant information
'United Kingdom: None' should be split as:
Country: United Kingdom
State: 9999
County_Province: 9999
Municipality_City: 9999
Instructions:
Pattern Recognition and Flexibility: Use the examples above to develop recognition patterns for different types of location data. These examples should guide how to categorise each element, even if the elements are presented in a non-standard order. Note that variations beyond the provided examples are possible, so flexibility is crucial in identifying and categorising different types of location data accurately. The elements in the 'Location' field may not always follow a consistent order, but it is important to correctly interpret the available data and assign it to the appropriate field.
Reordering Columns: Position the newly created columns right next to the 'Location' column (Column D).
Double-check your work:
General Validation: Ensure that each location entry is split accurately based on the rules and examples provided. Handle missing or irrelevant data by filling the corresponding fields with '9999'.
Special Case Handling: Specifically validate the handling of special cases from the examples provided, ensuring that data patterns, such as reversed orders or non-geographical information, are properly handled and split. Make sure that:
- Country, State, County, and Municipality elements are correctly identified even if in non-standard order.
- Values like 'USA: UC Davis Medical Center', 'WHO Reference Center' etc., properly split, disregarding irrelevant details.
- Non-geographical information such as 'not determined', 'not collected', 'None', etc. is correctly categorised as '9999'.
Column Positioning: Verify that the new columns—'Country', 'State', 'County_Province', and 'Municipality_City'—are inserted directly next to the original 'Location' column, following the required order. Specifically:
'Country' should be in column E.
'State' should be in column F.
'County_Province' should be in column G.
'Municipality_City' should be in column H.
Output: Save the modified table as a CSV file.
Output file
We tested the provided data preprocessing prompt using the GPT-4.5 model, successfully standardising the 'Location' column in a 1,000-row input CSV file according to the specified instructions. The model accurately handled diverse formatting patterns, special cases, and irrelevant details, consistently separating entries into 'Country', 'State', 'County_Province', and 'Municipality_City'. The resulting structured CSV output met all specified requirements and effectively demonstrated the prompt's robustness and flexibility.
The authors used GPT-4.5 [OpenAI (2025), GPT-4.5 (accessed on 25 March 2025), Large language model (LLM), available at: https://openai.com] to generate the output.