Linking Headlines to Article Bodies for Stance Detection: A Structured Pre-processing Workflow Using GPT-4o

Linking Headlines to Article Bodies for Stance Detection: A Structured Pre-processing Workflow Using GPT-4o
Sources: Getty Images For Unsplash+

Working with real-world text data often means dealing with structures that are not yet analysis-ready. In our case, the dataset included headlines and full article texts stored separately, across two different tables. The only link between them was a shared identifier field: Body ID. Before we could begin any further analysis or model development, we needed to merge these two sources — connecting each headline with its corresponding article body. This post outlines how we approached that preprocessing task using GPT-4o, and why such a structured merge is worth doing carefully.

Input files

The dataset we worked with originates from the Fake News Challenge (FNC-1), which was designed to support stance detection tasks — that is, determining the relationship between a short headline and the content of a full news article. Rather than providing ready-made pairs, the organisers released the data in two separate files. One file contains headlines and their stance labels. The other stores the corresponding article texts. These two tables are linked only by a shared field called Body ID, which must be used to match each headline with the appropriate article.

  • train_stances.csv contains 49,972 rows. Each row includes a short Headline, a Body ID indicating which article it refers to, and a Stance label (such as agree, disagree, discuss, or unrelated).
  • train_bodies.csv includes 1,684 article texts, each uniquely identified by a Body ID. These articles are often long-form texts with line breaks and multiple paragraphs.

Prompt

Our goal was to create a single, structured table linking each headline to its full article body and stance label. The challenge was that these elements were stored in separate files, connected only by a shared Body ID. Since many headlines referred to the same article, the merge required precise matching. To ensure clarity and traceability, we also introduced a custom ID column in the format <Body ID>_<N>, where N indicates the order of the headline within each article group.

Task:
You are given two CSV files: train_stances.csv and train_bodies.csv. These must be merged into a single Excel table by linking them via the Body ID column.

Input Files:

  • train_stances.csv: 49,972 rows with the columns:
    – Headline
    – Body ID
    – Stance
  • train_bodies.csv: 1,684 rows with the columns:
    – Body ID (each is unique)
    – articleBody

Objective:
For each row in train_stances.csv, use the Body ID to find the matching article from train_bodies.csv. Merge the matching data into a new row, and assign a unique custom ID for each merged record.

ID Formatting Logic:
Instead of using a single numeric ID, generate a compound identifier in the following format:
<Body ID>_<N>
where:

  • <Body ID> is the ID shared by multiple headlines
  • <N> is a sequential number starting from 1, representing the order of appearance of the headline within that Body ID group

Example:
If three headlines share Body ID = 712, then their IDs must be:

  • 712_1
  • 712_2
  • 712_3

This applies to every group of headlines sharing the same Body ID.

Step-by-step Instructions:

  • For each row in train_stances.csv, extract the Body ID.
  • Find the matching articleBody in train_bodies.csv where Body ID is the same.
  • Keep track of how many rows you have already matched with each Body ID — this is the suffix number.
  • Construct a new row with the following columns:
    • ID (format: <Body ID>_<N>)
    • Body ID
    • Headline
    • articleBody
    • Stance
  • After all rows are processed, sort the final table by Body ID (numerically), then by N.

Output Requirements:

  • File format: Excel, named merged_output.xlsx
  • Number of rows: exactly 49,972
  • ID values must follow the <Body ID>_<N> format and be unique
  • articleBody must be preserved with all line breaks inside a single Excel cell
  • Column order: ID, Body ID, Headline, articleBody, Stance

Output

We used GPT-4o to perform this structured merge based on a detailed prompt, ensuring consistent formatting and correct alignment across nearly 50,000 rows. The resulting table fully met our expectations. Each row correctly links a headline to its corresponding article body and stance label, based on an exact Body ID match. More importantly, the custom ID column — introduced to uniquely identify each headline–body pair — was generated as intended. It follows the required format (BodyID_1, BodyID_2, etc.), preserving both uniqueness and logical grouping.

The article body content is preserved in full, including paragraph breaks, and remains intact within a single cell — a critical requirement for downstream processing. The merged table is now well-structured, analysis-ready, and suitable for tasks such as stance classification, claim verification, or content coherence research.

GPT-4o's performance (accessed on 28 May 2025)

Recommendations

Researchers working with multi-table text datasets should be prepared to invest time in careful preprocessing, especially when elements like headlines and article bodies are stored separately. Merging based on a shared identifier may sound straightforward, but ensuring consistency, traceability, and clean formatting is essential for downstream use — particularly in supervised NLP tasks.Using a structured prompt with a capable model like GPT-4o proved effective for this kind of join operation. While scripting such a merge in code is certainly possible, prompting can be a valuable alternative in cases where logic must be expressed clearly, but flexibility and natural language understanding are also beneficial.

The authors used GPT-4o [OpenAI (2025), GPT-4o (accessed on 28 May 2025), Large language model (LLM), available at: https://openai.com] to generate the output.