Al Jazeera Journalism Review

A man looking at a computer screen that displays a pie chart and data, working on data journalism material.

How AI Can Clean Messy Data; and Where It Can't

Normalising inconsistent, messy, or incomplete data is tedious and time-consuming, but essential. AI can handle grunt work, but editorial decisions remain with the journalist.

 

As a visual journalist, I often find stories in data. Sometimes, in messy datasets: armed group names are spelt five different ways; date formats don't match; location names are inconsistent across sources. 

At the Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini, a session called "Building an AI Assistant for Investigative Journalists" offered a useful approach. Presented by Reinaldo Chaves and Rune Ytreberg, the workshop demonstrated how AI can handle the grunt work of data normalisation, freeing journalists to focus on what the data actually reveals. 

As an iMEdD fellow attending the conference, I was interested in how these techniques apply to the kind of data-driven analysis I work on regularly: cross-referencing conflict databases, standardising entity names across sources, and preparing datasets for visualisation. 

 

The Problem with Data

Data normalisation is the process of standardising inconsistent formats in datasets. It's the difference between reporting an accurate story and missing it entirely. Here's a familiar scenario: working with conflict data to track armed group activity. 

In one dataset, a militia appears as "Rapid Support Forces." In another source, the same group is listed as "RSF," "R.S.F.," "Rapid Support Forces (RSF)," or even transliterated variations from Arabic. A simple database join may miss these matches. This is precisely where AI can help. It excels at identifying which variations refer to the same entity, standardising formats across datasets, and flagging entries that need human review. 

 

Why Use AI For Cleaning? 

Data cleaning is one of the tasks best suited for AI, alongside document summarisation, data extraction, and cross-referencing. These tasks require pattern recognition across large volumes of data, and they're time-consuming for humans. 

AI can process thousands of entity names in minutes, identifying likely matches based on spelling variations, abbreviation patterns, and contextual clues. It can standardise date formats from "January 5, 2024", "5/1/24", and "2024-01-05" into a consistent format. It can flag location names that appear inconsistent across your sources. 

Obviously, the AI output isn't final. It's a first pass that still requires verification. But that verification is far faster than starting from scratch. 

 

Before Using AI: The Uncomfortable Questions 

  • Who owns my data? 

    Read the Terms of Service. Does the platform claim any rights to your prompts or uploaded data? 

  • Is my data used for training? 

    Most free AI tools use your data to train their models. Look for an opt-out, or use a paid tier that contractually guarantees your data won't be used for training. 

  • Where is my data stored? 

    Is it processed on a server, and if so, in what legal jurisdiction? When working with data from sensitive sources or related to conflict zones, this can be a matter of security. 

For sensitive datasets, consider paid enterprise tiers with contractual data protections, or local models running on your own hardware. 

 

Prompting for Cleaning Data

Here's a framework for effective prompting that applies to data cleaning tasks. 

  • Define the role AI needs to assume: 

    "You are a data journalism assistant specialising in conflict data analysis." 

  • Specify the task: 

    "Standardise these armed group names and identify which variations refer to the same organisation." 

  • Provide context: 

    "I'm working with incident data from multiple sources covering the Sudan conflict. Armed group names may include variations in transliteration from Arabic, abbreviations, and inconsistent use of full organisational names." 

  • Request the output format: 

    "Create a table with four columns: Original Name, Standardised Name, Confidence Level (High/Medium/Low), and Reasoning." 

  • Demand explanation: 

    "For each proposed match, explain why you believe these variations refer to the same group." 

  • Set rules: 

    "Do not merge entities unless you are confident they refer to the same organisation. When uncertain, flag for human review rather than guessing."

 

How Does This Work in Practice? 

Say you're analysing conflict incidents and your dataset contains variations of names. Paste the data into Claude, ChatGPT, or another AI assistant and write this prompt: 

"You are a data journalism assistant specialising in conflict data from Sudan. I need to standardise armed group names from multiple sources. 

Here are the names I need to reconcile: 

  • Rapid Support Forces 
  • RSF 
  • R.S.F. 
  • Rapid Support Forces (RSF) 
  • قوات الدعم السريع 
  • Sudanese Armed Forces 
  • SAF 
  • Sudan Armed Forces 
  • Janjaweed 
  • Janjaweed militia 

For each group of names that refer to the same organisation, provide: 

  • The original variations found 
  • A standardised name I should use 
  • Your confidence level (High/Medium/Low) 
  • A brief explanation of why these are the same entity 

Important: 

Do not merge groups that are actually distinct organisations. The RSF and SAF are opposing forces. Flag any ambiguous cases rather than guessing." 

The AI will return something like: 

Group 1: Rapid Support Forces 

  • Variations: "Rapid Support Forces," "RSF," "R.S.F.," "Rapid Support Forces (RSF)," "قوات الدعم السريع" 
  • Standardised name: Rapid Support Forces (RSF) 
  • Confidence: High 
  • Reasoning: These are all references to the same Sudanese paramilitary force. The Arabic text (قوات الدعم السريع) is the direct translation. RSF is the standard English abbreviation used in international reporting. 

Group 2: Sudanese Armed Forces 

  • Variations: "Sudanese Armed Forces," "SAF," "Sudan Armed Forces" 
  • Standardised name: Sudanese Armed Forces (SAF) 
  • Confidence: High 
  • Reasoning: These refer to Sudan's national military, distinct from the RSF. 

Flagged for review: Janjaweed 

  • The term "Janjaweed" historically refers to Arab militias in Darfur, some of which were later integrated into the RSF. However, "Janjaweed" and "RSF" are not synonymous in all contexts. Recommend human review to determine whether your dataset uses these terms interchangeably or distinctly. 

    That flagged entry is what you want from this prompt: the AI recognising the limits of its confidence and deferring to your editorial judgement. 

Standardising Location Names 

Conflict and humanitarian datasets often draw from multiple sources, UN agencies, local NGOs, government reports, news wire services, each with different transliteration conventions. The same town might appear as "El Fasher," "Al-Fashir," "El Fashir," or the Arabic "الفاشر." These inconsistencies can cause you to undercount violence in a region simply because sources spell the place differently. The AI prompt follows a similar structure to the armed groups example. 

Provide your list of location names, specify the region and context ("I'm working with humanitarian data from Darfur, Sudan"), and ask the AI to group variations that refer to the same place. 

Request that it include coordinates when confident, and flag cases where two similarly named locations might actually be distinct places. A town called "Al-Fashir" in North Darfur is not the same as a village with a similar name in a different state, and an AI that explains its reasoning will help you catch these distinctions before they corrupt your analysis. 

 

Catching the Errors 

AI systems can confidently generate incorrect output, or hallucinations. In data cleanup, this typically means incorrect merges (treating two distinct entities as the same) or missed matches (failing to recognise variations of the same entity). There's a technique for catching these errors: demand explanations. 

When the AI must articulate why it believes two names refer to the same entity, you can evaluate its reasoning. 

  • "Merged because both contain 'Sudan'" is weak reasoning that might indicate a false match. 
  • "Merged because both are abbreviations of 'Rapid Support Forces,' the paramilitary group commanded by Mohamed Hamdan Dagalo" is specific and verifiable. 

Other verification strategies include spot-checking a random sample of merges against your original sources and looking for patterns in the AI's errors (does it consistently confuse certain abbreviations?). Use the confidence levels to prioritise review, starting with the "Medium" and "Low" confidence matches. When AI is involved, this always bears repeating: never trust, always verify. The AI's output is a first draft, not a final answer. 

 

Scaling Up with Customised AI Bots 

If you're doing this kind of work repeatedly, you may benefit from creating a custom AI agent. The GIJC session walked through how to set these up without any coding. 

Both Google Gemini and ChatGPT allow you to save custom instructions as reusable "Gems" or "GPTs." They can be set up without coding, and their advantage is consistency. Instead of rewriting a prompt each time, you create an agent with your instructions baked in. Name it something like "Conflict Data Normaliser," paste the prompt into the instructions field, and save it. From then on, you can simply upload new data, and the agent already knows the context, the format you want, and the rules you've established. 

You can also upload reference material (e.g., a master list of armed groups with their standard names) and instruct the agent to match new data against it. The agent becomes a persistent tool rather than a one-off conversation. 

 

Editorial Judgement to Stays Human 

Cleaning datasets seems like a technical task, but it contains editorial decisions. When you standardise "Janjaweed" and "RSF" as the same entity, you're making an analytical choice that shapes what patterns emerge from your data. AI can propose these decisions. Only you can make them. 

AI is a research tool, not a replacement for journalistic judgment. The pattern-recognition power is genuinely useful: processing thousands of entries to surface likely matches is work that would take hours manually. But the final determination of what's true, what's relevant, and what's publishable remains the journalist's responsibility. Your byline means you're accountable for the analysis, not the AI that helped you get there. 

 

Resources 

Session materials: http://reichaves.github.io/building-ai-assistants/  

GIJN Resource Center: http://gijn.org/resource/  

Session: "Building an AI Assistant for Investigative Journalists" by Reinaldo Chaves (Abraji) and Rune Ytreberg (iTromso Datajournalism Lab) 

Conference: Global Investigative Journalism Conference (GIJC25), co-hosted by the Global Investigative Journalism Network and Malaysiakini 

Fellowship support: iMEdD (Incubator for Media Education and Development) 

Related Articles

Understanding Data Journalism

Data journalism is about much more than just sorting through facts and figures. In the first part of our series, we look at what constitutes data-based storytelling

Mohammed Haddad
Mohammed Haddad Published on: 16 Mar, 2023
Journalism and Artificial Intelligence: Who Controls the Narrative?

How did the conversation about using artificial intelligence in journalism become merely a "trend"? And can we say that much of the media discourse on AI’s potential remains broad and speculative rather than a tangible reality in newsrooms?

Mohammad Zeidan
Mohammad Zeidan Published on: 23 Feb, 2025
Weaponized Artificial Intelligence: The Unseen Threat to Fact-Checking

How has artificial intelligence emerged as a powerful tool during wartime, and what strategies are fact-checkers adopting to confront this disruptive force in newsrooms? The work of fact-checkers has grown significantly more challenging during the genocide in Palestine, as the Israeli occupation has relied heavily on artificial intelligence to disseminate misinformation.

Ahmad Al-Arja
Ahmad Al-Arja Published on: 18 May, 2025
When Journalism and Artificial Intelligence AI Come Face to Face

What does the future really hold for journalism in the age of artificial intelligence AI?

Amira
Amira Zahra Imouloudene Published on: 12 Oct, 2023
Investigative Journalism: Handling Data and Gathering Evidence

Data is only one part of the investigative story. In Part 5 of our series on investigative journalism, we look at different methods of gathering evidence

A picture of the Al Jazeera Media Institute's logo, on a white background.
Al Jazeera Journalism Review Published on: 9 Feb, 2023
Generative AI in Journalism and Journalism Education: Promise, Peril, and the Global North–South Divide

Generative AI is transforming journalism and journalism education, but this article shows that its benefits are unevenly distributed, often reinforcing Global North–South inequalities while simultaneously boosting efficiency, undermining critical thinking, and deepening precarity in newsrooms and classrooms.

Carolyne Lunga
Carolyne Lunga Published on: 2 Jan, 2026

More Articles

Why Editorial Planning is Key in Today's Relentless News Cycle

In the past, having a detailed editorial plan was something extra that only some newsrooms did, but today it is a necessary part of surviving the non-stop news cycle. As newsrooms move away from old paper deadlines and into a digital world run by social media and search engines, having a clear plan is what keeps a team organised instead of stressed.

Faras Ghani Published on: 12 Mar, 2026
Investigating the Assassination of My Own Father

As a journalist, reporting on the murder of my father meant answering questions about my own position as an objective observer.

Diana López Zuleta
Diana López Zuleta Published on: 16 Jan, 2026
Propaganda: Between Professional Conscience and Imposed Agendas

When media institutions first envisioned editorial charters and professional codes of conduct, their primary goal was to safeguard freedom of expression. However, experience has shown that these frameworks have often morphed into a "vast prison", one that strips journalists of their ability to confront authority in all its forms. In this way, Big Brother dons velvet gloves to seize what little space remains for the practice of true journalism.

فرح راضي الدرعاوي Farah Radi Al-Daraawi
Farah Radi Al-Daraawi Published on: 17 Oct, 2025
Narrative Without Debate: The Telegraph’s Comment Ban on Gaza Coverage

What does it mean for readers when their voices are deliberately cut off? This content analysis of The Telegraph, a UK-based conservative newspaper known for its pro-establishment stance and alignment with right-leaning narratives, shows it systematically disabled Instagram comments on Israel-Gaza posts, blocking dissent and shaping a one-sided, pro-Israel narrative.

Mohammed Ramees
Mohammed Ramees Published on: 9 Oct, 2025
Bild Newspaper: The Story of Israel’s Propaganda Machine Specializing in Anti-Palestinian Incitement

It labelled Al Jazeera journalist Anas Al-Sharif, killed by the occupation, a “terrorist”; denies famine in Gaza; trains its journalists in Israel to promote the Zionist narrative; published forged documents leaked from Netanyahu’s office; and belongs to a media group whose charter affirms “support for Israel’s right to exist”. This is Bild, Germany’s newspaper of incitement against Palestinians, cited by Israel’s president.

Al Jazeera Journalism Review
Al Jazeera Journalism Review Published on: 7 Oct, 2025
Mental Health in Newsrooms

Newsrooms, long lauded as bastions of information, are quietly grappling with a mental health crisis, underscoring an urgent need for systemic support, emotional safety, and sustainable practices to protect those telling the world’s stories.

Faras Ghani Published on: 27 Sep, 2025
The Continent Experience: A New Kind of Newspaper for the Future of Journalism

The Continent is a new way of empowering people through quality journalism, blending the authority of newspapers with the reach of 21st-century distribution. Readers love it. That’s why we built it. It’s a model other newsrooms can learn from and one that comes with its own set of challenges.

Sipho Kings
Sipho Kings Published on: 28 Aug, 2025
Anas Al-Sharif’s Killing and the Israeli Media Narrative

Following the assassination of journalist Anas Al-Sharif, Palestinian journalists have been framed in Israeli media as legitimate military targets—part of a deliberate strategy to silence those who bear witness to the truth. This article explores how Hebrew-language media outlets have engaged in rhetoric that incites and legitimises the killing of journalists in Gaza.

Anas Abu Arqoub
Anas Abu Arqoub Published on: 19 Aug, 2025
Canadian Journalists for Justice in Palestine: A Call to Name the Killer, Not Just the Crime

How many journalists have to be killed before we name the killer? What does press freedom mean if it excludes Palestinians? In its latest strike, Israel killed an entire Al Jazeera news crew in Gaza—part of a systematic campaign to silence the last witnesses to its crimes. Canadian Journalists for Justice in Palestine (CJJP) condemns this massacre and calls on the Canadian government to end its complicity, uphold international law, and demand full accountability. This is not collateral damage. This is the targeted erasure of truth.

Samira Mohyeddin
Samira Mohyeddin Published on: 14 Aug, 2025
Monitoring of Journalistic Malpractices in Gaza Coverage

On this page, the editorial team of the Al Jazeera Journalism Review will collect news published by media institutions about the current war on Gaza that involves disinformation, bias, or professional journalistic standards and its code of ethics.

A picture of the Al Jazeera Media Institute's logo, on a white background.
Al Jazeera Journalism Review Published on: 11 Aug, 2025
From "Death Announcement" to "Eulogy": The Obituary as a Journalistic Genre

Obituaries for influential public figures have become a recognised journalistic genre, handled by seasoned reporters in major media outlets. How did this practice evolve, what defines it professionally, and how ethically acceptable is writing obituaries in advance?

Mahfoud G. Fadili
Mahfoud G. Fadili Published on: 17 Jul, 2025
Canadian Mainstream Media’s Orientalist Stance Towards Palestinians

Canadian mainstream media manufactures consent to support Israel through biased language, withholding historical context, and conflating any criticism of the Israeli state with antisemitism. When the Canadian mainstream media covers the question of Palestine, they usually frame it as a religious issue and withhold the historical and socio-political context.

Sarah Samuel
Sarah Samuel Published on: 14 Jul, 2025
Digital Dependency: Unpacking Tech Philanthropy’s Grip on Local News in the MENA

AI-driven journalism initiatives in the Middle East, often backed by philanthropic media development projects, are reshaping local newsrooms under the influence of global tech giants. These efforts, while marketed as support, risk deepening power asymmetries, fostering digital dependency, and reactivating colonial patterns of control through algorithmic systems and donor-driven agendas.

Sara Ait Khorsa
Sara Ait Khorsa Published on: 3 Jun, 2025
News Fatigue and Avoidance: How Media Overload is Reshaping Audience Engagement

A study conducted on 12,000 American adults revealed that two-thirds feel “exhausted” by the overwhelming volume of news they receive. Why is the public feeling drained by the news? Are audiences actively avoiding it, and at what psychological cost? Most importantly, how can the media rebuild trust and reconnect with its audience?

Othman Kabashi
Othman Kabashi Published on: 25 May, 2025
Journalism Associations' Fragmentation Weakening Press Freedom in Cameroon

Cameroon's fragmented media landscape has weakened collective advocacy, allowing government repression of journalists to go largely unchallenged. As press freedom declines, voices like Samuel Wazizi's are silenced, while disunity among journalists enables impunity to thrive.

Njodzeka Danhatu
Njodzeka Danhatu Published on: 20 May, 2025
Weaponized Artificial Intelligence: The Unseen Threat to Fact-Checking

How has artificial intelligence emerged as a powerful tool during wartime, and what strategies are fact-checkers adopting to confront this disruptive force in newsrooms? The work of fact-checkers has grown significantly more challenging during the genocide in Palestine, as the Israeli occupation has relied heavily on artificial intelligence to disseminate misinformation.

Ahmad Al-Arja
Ahmad Al-Arja Published on: 18 May, 2025
Fact-Checking: The Last Line of Defense Against Occupation Propaganda in Palestine

Manipulation of information, intensive propaganda campaigns, and widespread disinformation were key features of the "narrative" battle that accompanied the war on Gaza. From the very beginning, the occupation sought to provide media cover for potential war crimes, but the work of fact-checkers exposed the foundations of its propaganda.

Khaled Attia Published on: 7 May, 2025
Verifying Information Is Not Just a Technical Process

From Context Manipulation to AI-Driven Digital Campaigns, Fact-Checkers Strive to Adapt to New Strategies and Methods of Fake and Misleading News Aimed at Constructing “Alternative Narratives.” On International Fact-Checking Day, colleague Hassan Khodary presents the experience of Sanad, Al Jazeera’s fact-checking agency, with a particular focus on its work in tracking the falsehoods within the Israeli narrative surrounding the genocide against Palestine.

 Hassan KHodary
Hassan Khodary Published on: 7 Apr, 2025
Journalist Testimonies on Western Media Coverage of the Gaza War: The Other Narrative

In this article, we compile testimonies from journalists who have criticised their own media institutions as documented in reports, letters, or interviews. Most spoke anonymously out of fear of repercussions—because freedom of expression appears protected only until it reaches the borders of Israel. At that point, constraints emerge, editorial policies shift, and the system of double standards is activated.

Al Jazeera Journalism Review
Al Jazeera Journalism Review Published on: 29 Mar, 2025
Systematic Bias: How Western Media Framed the March 18 Massacre of Palestinians

On March 18, Israel launched a large-scale assault on Gaza, killing over 412 Palestinians and injuring more than 500, while Western media uncritically echoed Israel’s claim of “targeting Hamas.” Rather than exposing the massacre, coverage downplayed the death toll, delayed key facts, and framed the attacks as justified pressure on Hamas—further highlighting the double standard in valuing Palestinian lives.

Mei Shigenobu مي شيغينوبو
Mei Shigenobu Published on: 18 Mar, 2025
I Resigned from CNN Over its Pro-Israel Bias

  Developing as a young journalist without jeopardizing your morals has become incredibly difficult.

Ana Maria Monjardino
Ana Maria Monjardino Published on: 2 Jan, 2025
Is Pakistan’s Media Ignoring Climate Change?

Pakistan's media, despite its wide reach, largely neglects climate change in favor of political and economic issues, leaving the public under-informed about the causes and consequences of climate-related disasters. As a result, many Pakistanis remain unaware of the growing threats posed by climate change, which has devastating effects on the country's economy and population, as seen in the catastrophic floods of 2022.

Faras Ghani Published on: 3 Dec, 2024
What Explains the Indian Media’s Silence on Muslim Lynchings?

A review of why the Indian media is biased in its coverage of cow vigilantes' lynchings, highlighting how the killing of a Hindu boy by such vigilantes sparked widespread outrage, while the lynching of a Muslim man over similar allegations was largely ignored, reflecting deeper anti-Muslim bias under the ruling BJP government.

Saif Khaled
Saif Khalid Published on: 11 Nov, 2024
Corporate Dominance and the Erosion of Editorial Independence in Indian Media

Corporate influence in Indian media has led to widespread editorial suppression, with media owners prioritising political appeasement over journalistic integrity, resulting in a significant erosion of press freedom and diversity in news reporting.

headshot
AJR Correspondent Published on: 3 Nov, 2024