How AI Can Clean Messy Data; and Where It Can't

A man looking at a computer screen that displays a pie chart and data, working on data journalism material.

How AI Can Clean Messy Data; and Where It Can't

Published on: 21 Mar, 2026

Normalising inconsistent, messy, or incomplete data is tedious and time-consuming, but essential. AI can handle grunt work, but editorial decisions remain with the journalist.

As a visual journalist, I often find stories in data. Sometimes, in messy datasets: armed group names are spelt five different ways; date formats don't match; location names are inconsistent across sources.

At the Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini, a session called "Building an AI Assistant for Investigative Journalists" offered a useful approach. Presented by Reinaldo Chaves and Rune Ytreberg, the workshop demonstrated how AI can handle the grunt work of data normalisation, freeing journalists to focus on what the data actually reveals.

As an iMEdD fellow attending the conference, I was interested in how these techniques apply to the kind of data-driven analysis I work on regularly: cross-referencing conflict databases, standardising entity names across sources, and preparing datasets for visualisation.

The Problem with Data

Data normalisation is the process of standardising inconsistent formats in datasets. It's the difference between reporting an accurate story and missing it entirely. Here's a familiar scenario: working with conflict data to track armed group activity.

In one dataset, a militia appears as "Rapid Support Forces." In another source, the same group is listed as "RSF," "R.S.F.," "Rapid Support Forces (RSF)," or even transliterated variations from Arabic. A simple database join may miss these matches. This is precisely where AI can help. It excels at identifying which variations refer to the same entity, standardising formats across datasets, and flagging entries that need human review.

Why Use AI For Cleaning?

Data cleaning is one of the tasks best suited for AI, alongside document summarisation, data extraction, and cross-referencing. These tasks require pattern recognition across large volumes of data, and they're time-consuming for humans.

AI can process thousands of entity names in minutes, identifying likely matches based on spelling variations, abbreviation patterns, and contextual clues. It can standardise date formats from "January 5, 2024", "5/1/24", and "2024-01-05" into a consistent format. It can flag location names that appear inconsistent across your sources.

Obviously, the AI output isn't final. It's a first pass that still requires verification. But that verification is far faster than starting from scratch.

Before Using AI: The Uncomfortable Questions

Who owns my data?
Read the Terms of Service. Does the platform claim any rights to your prompts or uploaded data?
Is my data used for training?
Most free AI tools use your data to train their models. Look for an opt-out, or use a paid tier that contractually guarantees your data won't be used for training.
Where is my data stored?
Is it processed on a server, and if so, in what legal jurisdiction? When working with data from sensitive sources or related to conflict zones, this can be a matter of security.

For sensitive datasets, consider paid enterprise tiers with contractual data protections, or local models running on your own hardware.

Prompting for Cleaning Data

Here's a framework for effective prompting that applies to data cleaning tasks.

Define the role AI needs to assume:
"You are a data journalism assistant specialising in conflict data analysis."
Specify the task:
"Standardise these armed group names and identify which variations refer to the same organisation."
Provide context:
"I'm working with incident data from multiple sources covering the Sudan conflict. Armed group names may include variations in transliteration from Arabic, abbreviations, and inconsistent use of full organisational names."
Request the output format:
"Create a table with four columns: Original Name, Standardised Name, Confidence Level (High/Medium/Low), and Reasoning."
Demand explanation:
"For each proposed match, explain why you believe these variations refer to the same group."
Set rules:
"Do not merge entities unless you are confident they refer to the same organisation. When uncertain, flag for human review rather than guessing."

How Does This Work in Practice?

Say you're analysing conflict incidents and your dataset contains variations of names. Paste the data into Claude, ChatGPT, or another AI assistant and write this prompt:

"You are a data journalism assistant specialising in conflict data from Sudan. I need to standardise armed group names from multiple sources.

Here are the names I need to reconcile:

Rapid Support Forces
RSF
R.S.F.
Rapid Support Forces (RSF)
قوات الدعم السريع
Sudanese Armed Forces
SAF
Sudan Armed Forces
Janjaweed
Janjaweed militia

For each group of names that refer to the same organisation, provide:

The original variations found
A standardised name I should use
Your confidence level (High/Medium/Low)
A brief explanation of why these are the same entity

Important:

Do not merge groups that are actually distinct organisations. The RSF and SAF are opposing forces. Flag any ambiguous cases rather than guessing."

The AI will return something like:

Group 1: Rapid Support Forces

Variations: "Rapid Support Forces," "RSF," "R.S.F.," "Rapid Support Forces (RSF)," "قوات الدعم السريع"
Standardised name: Rapid Support Forces (RSF)
Confidence: High
Reasoning: These are all references to the same Sudanese paramilitary force. The Arabic text (قوات الدعم السريع) is the direct translation. RSF is the standard English abbreviation used in international reporting.

Group 2: Sudanese Armed Forces

Variations: "Sudanese Armed Forces," "SAF," "Sudan Armed Forces"
Standardised name: Sudanese Armed Forces (SAF)
Confidence: High
Reasoning: These refer to Sudan's national military, distinct from the RSF.

Flagged for review: Janjaweed

The term "Janjaweed" historically refers to Arab militias in Darfur, some of which were later integrated into the RSF. However, "Janjaweed" and "RSF" are not synonymous in all contexts. Recommend human review to determine whether your dataset uses these terms interchangeably or distinctly.

That flagged entry is what you want from this prompt: the AI recognising the limits of its confidence and deferring to your editorial judgement.

Standardising Location Names

Conflict and humanitarian datasets often draw from multiple sources, UN agencies, local NGOs, government reports, news wire services, each with different transliteration conventions. The same town might appear as "El Fasher," "Al-Fashir," "El Fashir," or the Arabic "الفاشر." These inconsistencies can cause you to undercount violence in a region simply because sources spell the place differently. The AI prompt follows a similar structure to the armed groups example.

Provide your list of location names, specify the region and context ("I'm working with humanitarian data from Darfur, Sudan"), and ask the AI to group variations that refer to the same place.

Request that it include coordinates when confident, and flag cases where two similarly named locations might actually be distinct places. A town called "Al-Fashir" in North Darfur is not the same as a village with a similar name in a different state, and an AI that explains its reasoning will help you catch these distinctions before they corrupt your analysis.

Catching the Errors

AI systems can confidently generate incorrect output, or hallucinations. In data cleanup, this typically means incorrect merges (treating two distinct entities as the same) or missed matches (failing to recognise variations of the same entity). There's a technique for catching these errors: demand explanations.

When the AI must articulate why it believes two names refer to the same entity, you can evaluate its reasoning.

"Merged because both contain 'Sudan'" is weak reasoning that might indicate a false match.
"Merged because both are abbreviations of 'Rapid Support Forces,' the paramilitary group commanded by Mohamed Hamdan Dagalo" is specific and verifiable.

Other verification strategies include spot-checking a random sample of merges against your original sources and looking for patterns in the AI's errors (does it consistently confuse certain abbreviations?). Use the confidence levels to prioritise review, starting with the "Medium" and "Low" confidence matches. When AI is involved, this always bears repeating: never trust, always verify. The AI's output is a first draft, not a final answer.

Scaling Up with Customised AI Bots

If you're doing this kind of work repeatedly, you may benefit from creating a custom AI agent. The GIJC session walked through how to set these up without any coding.

Both Google Gemini and ChatGPT allow you to save custom instructions as reusable "Gems" or "GPTs." They can be set up without coding, and their advantage is consistency. Instead of rewriting a prompt each time, you create an agent with your instructions baked in. Name it something like "Conflict Data Normaliser," paste the prompt into the instructions field, and save it. From then on, you can simply upload new data, and the agent already knows the context, the format you want, and the rules you've established.

You can also upload reference material (e.g., a master list of armed groups with their standard names) and instruct the agent to match new data against it. The agent becomes a persistent tool rather than a one-off conversation.

Editorial Judgement to Stays Human

Cleaning datasets seems like a technical task, but it contains editorial decisions. When you standardise "Janjaweed" and "RSF" as the same entity, you're making an analytical choice that shapes what patterns emerge from your data. AI can propose these decisions. Only you can make them.

AI is a research tool, not a replacement for journalistic judgment. The pattern-recognition power is genuinely useful: processing thousands of entries to surface likely matches is work that would take hours manually. But the final determination of what's true, what's relevant, and what's publishable remains the journalist's responsibility. Your byline means you're accountable for the analysis, not the AI that helped you get there.

Resources

Session materials: http://reichaves.github.io/building-ai-assistants/ 

GIJN Resource Center: http://gijn.org/resource/ 

Session: "Building an AI Assistant for Investigative Journalists" by Reinaldo Chaves (Abraji) and Rune Ytreberg (iTromso Datajournalism Lab)

Conference: Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini

Fellowship support: iMEdD (Incubator for Media Education and Development)

Understanding Data Journalism

Data journalism is about much more than just sorting through facts and figures. In the first part of our series, we look at what constitutes data-based storytelling

Mohammed Haddad Published on: 16 Mar, 2023

Opinion Reports

Journalism and Artificial Intelligence: Who Controls the Narrative?

How did the conversation about using artificial intelligence in journalism become merely a "trend"? And can we say that much of the media discourse on AI’s potential remains broad and speculative rather than a tangible reality in newsrooms?

Mohammad Zeidan Published on: 23 Feb, 2025

Newsroom Opinion Reports

Weaponized Artificial Intelligence: The Unseen Threat to Fact-Checking

How has artificial intelligence emerged as a powerful tool during wartime, and what strategies are fact-checkers adopting to confront this disruptive force in newsrooms? The work of fact-checkers has grown significantly more challenging during the genocide in Palestine, as the Israeli occupation has relied heavily on artificial intelligence to disseminate misinformation.

Ahmad Al-Arja Published on: 18 May, 2025

Newsroom Innovation Reviews

When Journalism and Artificial Intelligence AI Come Face to Face

What does the future really hold for journalism in the age of artificial intelligence AI?

Amira Zahra Imouloudene Published on: 12 Oct, 2023

Field Newsroom Reports

Investigative Journalism: Handling Data and Gathering Evidence

Data is only one part of the investigative story. In Part 5 of our series on investigative journalism, we look at different methods of gathering evidence

Al Jazeera Journalism Review Published on: 9 Feb, 2023

Innovation Reviews Opinion

Ink Pen in the middle of digital screens reresenting human skills in writing in the age of AI

Generative AI in Journalism and Journalism Education: Promise, Peril, and the Global North–South Divide

Generative AI is transforming journalism and journalism education, but this article shows that its benefits are unevenly distributed, often reinforcing Global North–South inequalities while simultaneously boosting efficiency, undermining critical thinking, and deepening precarity in newsrooms and classrooms.

Carolyne Lunga Published on: 2 Jan, 2026

Field Newsroom

The Vanishing Foreign Desk: What U.S. Media Cuts Mean for South Asia

Recent restructuring at Voice of America and The Washington Post marks a significant withdrawal from global journalism, particularly affecting coverage in South Asia. As these major institutions cut staff and close foreign bureaus, the loss of experienced expertise threatens the visibility of critical regional issues like human rights and climate change. This shift forces a move toward narrative independence for local media, yet leaves a dangerous gap in the global conversation that smaller newsrooms struggle to fill.

Tauseef Ahmad, Sajid Raina Published on: 8 May, 2026

Field Newsroom

Journalists gather at the Press Club of India for a "Defend Journalism" protest, condemning the arrest of NewsClick staff and the legal crackdown on investigative reporting, marking a bold stand against the tightening state control of the media. (Photo: Vipin Kumar/Hindustan Times. New Delhi, India – Oct 2023)

Why Anonymous Sources Are Fading from Indian Journalism

A generation of reporters-built careers on confidential sources, many now write without them.

Kamran Yousuf Published on: 19 Apr, 2026

Field Newsroom Innovation

Prime Minister Narendra Modi inaugurates the India AI Impact Summit 2026 at Bharat Mandapam, bringing together global technology leaders, policymakers, and industry experts to discuss the future of innovation and regulation. The summit marks a pivotal moment in India's AI trajectory, addressing the profound shifts in digital infrastructure and the legal frameworks governing artificial intelligence.

AI, Copyright Reform and the Fragile Reinvention of Indian Journalism

India’s proposed AI copyright framework risks turning independent journalism into a pooled data resource, undermining the subscription-based models that sustain it. At a moment of political and economic fragility, the struggle over AI licensing is ultimately a struggle over who controls, values, and profits from journalistic work.

Arsalan Bukhari Published on: 31 Mar, 2026

Newsroom Innovation

2026 state election in Baden-Württemberg. Election night in the state parliament. Before the state press conference, photographers surround the election winner, top candidate Cem Özdemir from the Greens. Stuttgart, Baden-Württemberg, Germany By: Arnulf Hettrich / Date created: Mar 07 2026

Why Editorial Planning is Key in Today's Relentless News Cycle

In the past, having a detailed editorial plan was something extra that only some newsrooms did, but today it is a necessary part of surviving the non-stop news cycle. As newsrooms move away from old paper deadlines and into a digital world run by social media and search engines, having a clear plan is what keeps a team organised instead of stressed.

Faras Ghani Published on: 12 Mar, 2026

Field Newsroom Opinion

Investigating the Assassination of My Own Father

As a journalist, reporting on the murder of my father meant answering questions about my own position as an objective observer.

Diana López Zuleta Published on: 16 Jan, 2026

Newsroom Reports

A man in a dark office room is working on a computer with two screens.

Propaganda: Between Professional Conscience and Imposed Agendas

When media institutions first envisioned editorial charters and professional codes of conduct, their primary goal was to safeguard freedom of expression. However, experience has shown that these frameworks have often morphed into a "vast prison", one that strips journalists of their ability to confront authority in all its forms. In this way, Big Brother dons velvet gloves to seize what little space remains for the practice of true journalism.

Farah Radi Al-Daraawi Published on: 17 Oct, 2025

Newsroom Reports

A mobile screen with The Telegraph on it

Narrative Without Debate: The Telegraph’s Comment Ban on Gaza Coverage

What does it mean for readers when their voices are deliberately cut off? This content analysis of The Telegraph, a UK-based conservative newspaper known for its pro-establishment stance and alignment with right-leaning narratives, shows it systematically disabled Instagram comments on Israel-Gaza posts, blocking dissent and shaping a one-sided, pro-Israel narrative.

Mohammed Ramees Published on: 9 Oct, 2025

Newsroom

unmanned newspaper sales stand showcasing Build Newspaper on it's side.

Bild Newspaper: The Story of Israel’s Propaganda Machine Specializing in Anti-Palestinian Incitement

It labelled Al Jazeera journalist Anas Al-Sharif, killed by the occupation, a “terrorist”; denies famine in Gaza; trains its journalists in Israel to promote the Zionist narrative; published forged documents leaked from Netanyahu’s office; and belongs to a media group whose charter affirms “support for Israel’s right to exist”. This is Bild, Germany’s newspaper of incitement against Palestinians, cited by Israel’s president.

Al Jazeera Journalism Review Published on: 7 Oct, 2025

Newsroom Reports

Al Jazeera English News Control-room. A dimly lit newsroom control room with multiple operators seated at large desks, surrounded by numerous monitors and screens displaying live news feeds, control panels, and broadcasting software. The central wall is covered with a grid of television screens showing various live broadcasts and graphics, creating a high-tech, high-pressure environment.

Mental Health in Newsrooms

Newsrooms, long lauded as bastions of information, are quietly grappling with a mental health crisis, underscoring an urgent need for systemic support, emotional safety, and sustainable practices to protect those telling the world’s stories.

Faras Ghani Published on: 27 Sep, 2025

Newsroom Innovation Reports

Two hands pointing at their mobile phones

The Continent Experience: A New Kind of Newspaper for the Future of Journalism

The Continent is a new way of empowering people through quality journalism, blending the authority of newspapers with the reach of 21st-century distribution. Readers love it. That’s why we built it. It’s a model other newsrooms can learn from and one that comes with its own set of challenges.

Sipho Kings Published on: 28 Aug, 2025

Newsroom Reports

Anas Al-Sharif’s Killing and the Israeli Media Narrative

Following the assassination of journalist Anas Al-Sharif, Palestinian journalists have been framed in Israeli media as legitimate military targets—part of a deliberate strategy to silence those who bear witness to the truth. This article explores how Hebrew-language media outlets have engaged in rhetoric that incites and legitimises the killing of journalists in Gaza.

Anas Abu Arqoub Published on: 19 Aug, 2025

Field Newsroom Opinion

Bethlehem, Occupied Palestinian Territories - March 30, 2012: A Palestinian young woman waves a flag near the Bethlehem checkpoint during Land Day protests in Bethlehem.

Canadian Journalists for Justice in Palestine: A Call to Name the Killer, Not Just the Crime

How many journalists have to be killed before we name the killer? What does press freedom mean if it excludes Palestinians? In its latest strike, Israel killed an entire Al Jazeera news crew in Gaza—part of a systematic campaign to silence the last witnesses to its crimes. Canadian Journalists for Justice in Palestine (CJJP) condemns this massacre and calls on the Canadian government to end its complicity, uphold international law, and demand full accountability. This is not collateral damage. This is the targeted erasure of truth.

Samira Mohyeddin Published on: 14 Aug, 2025

Field Newsroom Reviews Reports

Journalist holding video camera wearing a blue helmet and a vest

Monitoring of Journalistic Malpractices in Gaza Coverage

On this page, the editorial team of the Al Jazeera Journalism Review will collect news published by media institutions about the current war on Gaza that involves disinformation, bias, or professional journalistic standards and its code of ethics.

Al Jazeera Journalism Review Published on: 11 Aug, 2025

Newsroom Reports

From "Death Announcement" to "Eulogy": The Obituary as a Journalistic Genre

Obituaries for influential public figures have become a recognised journalistic genre, handled by seasoned reporters in major media outlets. How did this practice evolve, what defines it professionally, and how ethically acceptable is writing obituaries in advance?

Mahfoud G. Fadili Published on: 17 Jul, 2025

Newsroom Reports

Mark Carney X account on the backdrop of Canadia flag

Canadian Mainstream Media’s Orientalist Stance Towards Palestinians

Canadian mainstream media manufactures consent to support Israel through biased language, withholding historical context, and conflating any criticism of the Israeli state with antisemitism. When the Canadian mainstream media covers the question of Palestine, they usually frame it as a religious issue and withhold the historical and socio-political context.

Sarah Samuel-Ayyash Published on: 14 Jul, 2025

Newsroom Innovation Opinion

Digital Dependency: Unpacking Tech Philanthropy’s Grip on Local News in the MENA

AI-driven journalism initiatives in the Middle East, often backed by philanthropic media development projects, are reshaping local newsrooms under the influence of global tech giants. These efforts, while marketed as support, risk deepening power asymmetries, fostering digital dependency, and reactivating colonial patterns of control through algorithmic systems and donor-driven agendas.

Sara Ait Khorsa Published on: 3 Jun, 2025

Newsroom Opinion Reports

News Fatigue and Avoidance: How Media Overload is Reshaping Audience Engagement

A study conducted on 12,000 American adults revealed that two-thirds feel “exhausted” by the overwhelming volume of news they receive. Why is the public feeling drained by the news? Are audiences actively avoiding it, and at what psychological cost? Most importantly, how can the media rebuild trust and reconnect with its audience?

Othman Kabashi Published on: 25 May, 2025

Field Newsroom Reports

Journalism Associations' Fragmentation Weakening Press Freedom in Cameroon

Cameroon's fragmented media landscape has weakened collective advocacy, allowing government repression of journalists to go largely unchallenged. As press freedom declines, voices like Samuel Wazizi's are silenced, while disunity among journalists enables impunity to thrive.

Njodzeka Danhatu Published on: 20 May, 2025

Newsroom Opinion Reports

Weaponized Artificial Intelligence: The Unseen Threat to Fact-Checking

Ahmad Al-Arja Published on: 18 May, 2025

Newsroom Reports

Image of Gaza total destruction birdview

Fact-Checking: The Last Line of Defense Against Occupation Propaganda in Palestine

Manipulation of information, intensive propaganda campaigns, and widespread disinformation were key features of the "narrative" battle that accompanied the war on Gaza. From the very beginning, the occupation sought to provide media cover for potential war crimes, but the work of fact-checkers exposed the foundations of its propaganda.

Khaled Attia Published on: 7 May, 2025

Newsroom

Verifying Information Is Not Just a Technical Process

From Context Manipulation to AI-Driven Digital Campaigns, Fact-Checkers Strive to Adapt to New Strategies and Methods of Fake and Misleading News Aimed at Constructing “Alternative Narratives.” On International Fact-Checking Day, colleague Hassan Khodary presents the experience of Sanad, Al Jazeera’s fact-checking agency, with a particular focus on its work in tracking the falsehoods within the Israeli narrative surrounding the genocide against Palestine.

Hassan Khodary Published on: 7 Apr, 2025

Newsroom Reviews

Back of woman holding placard that reads: Journalism is not a crime

Journalist Testimonies on Western Media Coverage of the Gaza War: The Other Narrative

In this article, we compile testimonies from journalists who have criticised their own media institutions as documented in reports, letters, or interviews. Most spoke anonymously out of fear of repercussions—because freedom of expression appears protected only until it reaches the borders of Israel. At that point, constraints emerge, editorial policies shift, and the system of double standards is activated.

Al Jazeera Journalism Review Published on: 29 Mar, 2025

Newsroom Reviews Opinion

Systematic Bias: How Western Media Framed the March 18 Massacre of Palestinians

On March 18, Israel launched a large-scale assault on Gaza, killing over 412 Palestinians and injuring more than 500, while Western media uncritically echoed Israel’s claim of “targeting Hamas.” Rather than exposing the massacre, coverage downplayed the death toll, delayed key facts, and framed the attacks as justified pressure on Hamas—further highlighting the double standard in valuing Palestinian lives.