Normalising inconsistent, messy, or incomplete data is tedious and time-consuming, but essential. AI can handle grunt work, but editorial decisions remain with the journalist.
As a visual journalist, I often find stories in data. Sometimes, in messy datasets: armed group names are spelt five different ways; date formats don't match; location names are inconsistent across sources.
At the Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini, a session called "Building an AI Assistant for Investigative Journalists" offered a useful approach. Presented by Reinaldo Chaves and Rune Ytreberg, the workshop demonstrated how AI can handle the grunt work of data normalisation, freeing journalists to focus on what the data actually reveals.
As an iMEdD fellow attending the conference, I was interested in how these techniques apply to the kind of data-driven analysis I work on regularly: cross-referencing conflict databases, standardising entity names across sources, and preparing datasets for visualisation.
The Problem with Data
Data normalisation is the process of standardising inconsistent formats in datasets. It's the difference between reporting an accurate story and missing it entirely. Here's a familiar scenario: working with conflict data to track armed group activity.
In one dataset, a militia appears as "Rapid Support Forces." In another source, the same group is listed as "RSF," "R.S.F.," "Rapid Support Forces (RSF)," or even transliterated variations from Arabic. A simple database join may miss these matches. This is precisely where AI can help. It excels at identifying which variations refer to the same entity, standardising formats across datasets, and flagging entries that need human review.
Why Use AI For Cleaning?
Data cleaning is one of the tasks best suited for AI, alongside document summarisation, data extraction, and cross-referencing. These tasks require pattern recognition across large volumes of data, and they're time-consuming for humans.
AI can process thousands of entity names in minutes, identifying likely matches based on spelling variations, abbreviation patterns, and contextual clues. It can standardise date formats from "January 5, 2024", "5/1/24", and "2024-01-05" into a consistent format. It can flag location names that appear inconsistent across your sources.
Obviously, the AI output isn't final. It's a first pass that still requires verification. But that verification is far faster than starting from scratch.
Before Using AI: The Uncomfortable Questions
- Who owns my data?
Read the Terms of Service. Does the platform claim any rights to your prompts or uploaded data?
- Is my data used for training?
Most free AI tools use your data to train their models. Look for an opt-out, or use a paid tier that contractually guarantees your data won't be used for training.
- Where is my data stored?
Is it processed on a server, and if so, in what legal jurisdiction? When working with data from sensitive sources or related to conflict zones, this can be a matter of security.
For sensitive datasets, consider paid enterprise tiers with contractual data protections, or local models running on your own hardware.
Prompting for Cleaning Data
Here's a framework for effective prompting that applies to data cleaning tasks.
- Define the role AI needs to assume:
"You are a data journalism assistant specialising in conflict data analysis."
- Specify the task:
"Standardise these armed group names and identify which variations refer to the same organisation."
- Provide context:
"I'm working with incident data from multiple sources covering the Sudan conflict. Armed group names may include variations in transliteration from Arabic, abbreviations, and inconsistent use of full organisational names."
- Request the output format:
"Create a table with four columns: Original Name, Standardised Name, Confidence Level (High/Medium/Low), and Reasoning."
- Demand explanation:
"For each proposed match, explain why you believe these variations refer to the same group."
- Set rules:
"Do not merge entities unless you are confident they refer to the same organisation. When uncertain, flag for human review rather than guessing."
How Does This Work in Practice?
Say you're analysing conflict incidents and your dataset contains variations of names. Paste the data into Claude, ChatGPT, or another AI assistant and write this prompt:
"You are a data journalism assistant specialising in conflict data from Sudan. I need to standardise armed group names from multiple sources.
Here are the names I need to reconcile:
- Rapid Support Forces
- RSF
- R.S.F.
- Rapid Support Forces (RSF)
- قوات الدعم السريع
- Sudanese Armed Forces
- SAF
- Sudan Armed Forces
- Janjaweed
- Janjaweed militia
For each group of names that refer to the same organisation, provide:
- The original variations found
- A standardised name I should use
- Your confidence level (High/Medium/Low)
- A brief explanation of why these are the same entity
Important:
Do not merge groups that are actually distinct organisations. The RSF and SAF are opposing forces. Flag any ambiguous cases rather than guessing."
The AI will return something like:
Group 1: Rapid Support Forces
- Variations: "Rapid Support Forces," "RSF," "R.S.F.," "Rapid Support Forces (RSF)," "قوات الدعم السريع"
- Standardised name: Rapid Support Forces (RSF)
- Confidence: High
- Reasoning: These are all references to the same Sudanese paramilitary force. The Arabic text (قوات الدعم السريع) is the direct translation. RSF is the standard English abbreviation used in international reporting.
Group 2: Sudanese Armed Forces
- Variations: "Sudanese Armed Forces," "SAF," "Sudan Armed Forces"
- Standardised name: Sudanese Armed Forces (SAF)
- Confidence: High
- Reasoning: These refer to Sudan's national military, distinct from the RSF.
Flagged for review: Janjaweed
-
The term "Janjaweed" historically refers to Arab militias in Darfur, some of which were later integrated into the RSF. However, "Janjaweed" and "RSF" are not synonymous in all contexts. Recommend human review to determine whether your dataset uses these terms interchangeably or distinctly.
That flagged entry is what you want from this prompt: the AI recognising the limits of its confidence and deferring to your editorial judgement.
Standardising Location Names
Conflict and humanitarian datasets often draw from multiple sources, UN agencies, local NGOs, government reports, news wire services, each with different transliteration conventions. The same town might appear as "El Fasher," "Al-Fashir," "El Fashir," or the Arabic "الفاشر." These inconsistencies can cause you to undercount violence in a region simply because sources spell the place differently. The AI prompt follows a similar structure to the armed groups example.
Provide your list of location names, specify the region and context ("I'm working with humanitarian data from Darfur, Sudan"), and ask the AI to group variations that refer to the same place.
Request that it include coordinates when confident, and flag cases where two similarly named locations might actually be distinct places. A town called "Al-Fashir" in North Darfur is not the same as a village with a similar name in a different state, and an AI that explains its reasoning will help you catch these distinctions before they corrupt your analysis.
Catching the Errors
AI systems can confidently generate incorrect output, or hallucinations. In data cleanup, this typically means incorrect merges (treating two distinct entities as the same) or missed matches (failing to recognise variations of the same entity). There's a technique for catching these errors: demand explanations.
When the AI must articulate why it believes two names refer to the same entity, you can evaluate its reasoning.
- "Merged because both contain 'Sudan'" is weak reasoning that might indicate a false match.
- "Merged because both are abbreviations of 'Rapid Support Forces,' the paramilitary group commanded by Mohamed Hamdan Dagalo" is specific and verifiable.
Other verification strategies include spot-checking a random sample of merges against your original sources and looking for patterns in the AI's errors (does it consistently confuse certain abbreviations?). Use the confidence levels to prioritise review, starting with the "Medium" and "Low" confidence matches. When AI is involved, this always bears repeating: never trust, always verify. The AI's output is a first draft, not a final answer.
Scaling Up with Customised AI Bots
If you're doing this kind of work repeatedly, you may benefit from creating a custom AI agent. The GIJC session walked through how to set these up without any coding.
Both Google Gemini and ChatGPT allow you to save custom instructions as reusable "Gems" or "GPTs." They can be set up without coding, and their advantage is consistency. Instead of rewriting a prompt each time, you create an agent with your instructions baked in. Name it something like "Conflict Data Normaliser," paste the prompt into the instructions field, and save it. From then on, you can simply upload new data, and the agent already knows the context, the format you want, and the rules you've established.
You can also upload reference material (e.g., a master list of armed groups with their standard names) and instruct the agent to match new data against it. The agent becomes a persistent tool rather than a one-off conversation.
Editorial Judgement to Stays Human
Cleaning datasets seems like a technical task, but it contains editorial decisions. When you standardise "Janjaweed" and "RSF" as the same entity, you're making an analytical choice that shapes what patterns emerge from your data. AI can propose these decisions. Only you can make them.
AI is a research tool, not a replacement for journalistic judgment. The pattern-recognition power is genuinely useful: processing thousands of entries to surface likely matches is work that would take hours manually. But the final determination of what's true, what's relevant, and what's publishable remains the journalist's responsibility. Your byline means you're accountable for the analysis, not the AI that helped you get there.
Resources
Session materials: http://reichaves.github.io/building-ai-assistants/
GIJN Resource Center: http://gijn.org/resource/
Session: "Building an AI Assistant for Investigative Journalists" by Reinaldo Chaves (Abraji) and Rune Ytreberg (iTromso Datajournalism Lab)
Conference: Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini
Fellowship support: iMEdD (Incubator for Media Education and Development)