Automated Data Pipeline: Towards data quality standards

Automating a country’s basic data to gain time in a disaster involves many complex variables. Since 2022, MapAction, the United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA) and the University of Georgia, Information Technology Outreach Services (ITOS) have been working on strengthening data quality for what are referred to as Common Operational Datasets (COD): ‘best available’ shared datasets that ensure consistency and simplify the discovery and exchange of key data among humanitarian organisations. In this article, we briefly present the methodology.

Common Operational Datasets, or CODs, are authoritative reference datasets needed to support operations and decision-making for all actors in a humanitarian response. CODs are ‘best available’ datasets that ensure consistency and simplify the discovery and exchange of key data. The data is typically geo-spatially linked using a coordinate system and has unique geographic identification codes. These datasets are often derived from data collected by local authorities and international partners to ensure quality, but most vitally, local ownership. CODs can be collected on administrative boundaries, population and more.

The three main datasets available within the CODs for most countries are COD-AB (Administrative Boundaries), COD-PS (Population Statistics) and COD-HP (Humanitarian Profile). Country-specific CODs, such as the road network and health facilities may also be available. They can be easily found on HDX (Humanitarian Data Exchange) on the COD page or through the service API managed by ITOS.

Moreover, OCHA routinely evaluates the quality and availability of Common Operational Databases. The results of this analysis can be found on the COD Portal. The Portal documents: 

  1. The quality of administrative boundary (COD-AB) layers and population statistics (COD-PS) tables, and;
  2. Their availability on HDX and for COD-AB as an ITOS geoservices.  The portal is generally refreshed after any COD-AB or COD-PS is added or updated. 

Keeping humanitarian data fresh

This project focused on the COD-AB. As shown in the picture below, each available country has different level administrative boundaries together with metadata and p-codes.

READ ALSO: Accelerating humanitarian response: Inside MapAction’s Automated Data Pipeline

The proposed methodology evaluates both the geospatial features as well as the metadata.

The objective of this project was to create a quantitative assessment mechanism that enables the prioritization of work to update or enhance existing COD-AB datasets. The output is a quality index for each COD-AB data set based on tests for features (geographical, metadata, etc), targeted at both OCHA and ITOS teams and available to the public as an open-source tool.

The main deliverable of this project is the COD-AB Data Quality dashboard. Through this webpage hosted on FieldMaps — Humanitarian Maps & Data , the quality multi-index per country can be visualized through a dedicated dashboard, downloaded as a spreadsheet or as individual country pdf reports detailing the methodology. Moreover, the Python code is publicly available on a GitHub code repository for advanced users.

Methodology

The methodology for computing each country COD-AB quality index is based on a multi-index approach where separate scores are computed for each category, covering both geospatial features as well as metadata such as languages and p-codes.

The overall score for a given country is a value between 0% and 100% and is computed as the average value of 10 different categories. For each category, the score is computed as the proportion of layers matching a set of quality criteria. Within the dashboard, users can select which categories they wish to use when computing the overall score.

Scores computed for each country as well as a detailed country report can be found on the publicly available dashboard webpage hosted on fieldmaps.

Methodology description

The complete list of indicators used within each category can be found on the dedicated country reports on the project dashboard. Below is a list of descriptions for each category used in the overall score calculation.

Valid Geometry: Valid geometry is defined by having no empty geometries, only containing polygons (no points or lines), not containing self-intersecting rings, using WGS84 CRS (EPSG:4326), and a valid bounding box.

Valid Topology: Valid topology is defined as having no triangle polygons, sliver gaps or overlaps within a layer, with each polygon being fully contained within their parent.

Equal Area: Layers which all share the same area. Layers not sharing the same area may have empty areas representing water bodies whereas other layers have them filled out.

Sq. km: Layers which have an area attribute in square kilometers and value matches area calculated above using NASA EASE-Grid 2.0.

P-Codes: Layers which have all required P-Code columns (ADM2_PCODE), with no empty cells, only alphanumeric characters, starting with a valid ISO-2 code, no duplicate codes, all codes within a column having the same length, and hierarchical nesting codes.

Names: Layers which have all required name columns (ADM2_EN), with no empty cells, no duplicate rows, no double / leading / trailing spaces, no columns all uppercase / lowercase, no cells lacking alphabetic characters, and all characters matching the language code.

Languages: Layers which have at least 1 language column detected, all language codes used are valid, a romanized language is featured first, and layers don’t have more languages than their parents.

Date: Layers which have a valid date value for their source.

Valid On: Layers which have been validated on within the last 12 months.

Other: Layers which have no fields other than expected values.

READ ALSO: MapAction Anticipatory Action Sustainability Methodology

Conclusion and way forward

The main target audience for this project are both OCHA FISS, OCHA HDX and ITOS teams that are directly involved in evaluating and improving the quality of the CODs. This project was tailored to their needs and it’s hoped to streamline the initial quality process. By making both the results and methodology publicly available online, we hope other CODs stakeholders will integrate quality analysis in their decision making.

This project has also been a great opportunity for collaboration between the MapAction team and Maxym Malynowsky, humanitarian data expert responsible for, among others, fieldmaps.io. Maxym joined the project together with MapAction volunteers and played an essential role on designing, implementing and promoting this work. Within his new role as Data Engineer Advisor at the OCHA Centre for Humanitarian Data, Maxym will be able to link this work with the quality needs of HDX datasets.

MapAction will continue to be involved on the COD data quality topic by both taking part in the sector discussions with key stakeholders and also as a direct user in the context of emergency responses.

The dashboard will continue to be hosted by FieldMaps — Humanitarian Maps & Data and Maxym Malynowsky will be the main point of contact.

This work is funded by USAID’s Bureau for Humanitarian Assistance (BHA).

Find out more about MapAction’s data work in another field, Anticipatory Action, in the video below.

Accelerating humanitarian response: Inside MapAction’s Automated Data Pipeline

In times of crisis, timely and accurate geospatial data is crucial for effective humanitarian response. This GIS Day 2024, we discuss a new project: an automated data pipeline to streamline the collection and preparation of essential geospatial datasets for emergencies. By replicating the data scramble process our GIS teams typically perform during emergencies, the MapAction Automated Data Pipeline aims to expedite the delivery of critical information to those who need it most.

By: Evangelos Diakatos, MapAction Data Engineer

By automating the acquisition of these datasets, the pipeline aims to improve efficiency by reducing the time required to gather and prepare data during emergencies. It enhances accuracy by providing up-to-date and consistent datasets for mapping and analysis, enabling the GIS team to focus on critical analysis and map production rather than manual data collection. This supports a more rapid and effective humanitarian response.

Data Sources

The pipeline integrates data from several key sources. One of the primary sources is the Humanitarian Data Exchange (HDX), a platform hosted by OCHA that offers a wide range of humanitarian datasets. HDX provides access to critical information necessary for planning and coordinating emergency responses.

Another important source is Google Earth Engine (GEE), a cloud-based platform that facilitates the processing of satellite imagery and other geospatial datasets.  Additionally, the pipeline retrieves data from OpenStreetMap (OSM), a collaborative project aimed at creating a free, editable map of the world. OSM provides detailed geographical information, including roads, buildings, and points of interest.

Figure: Example of baseline map made by MapAction on past emergency responses. Our new data pipeline aims to automate the acquisition and processing of data used in these maps, such as administrative boundaries, transport infrastructure and geographic features

The datasets collected and processed by the pipeline are mainly the data needed in the first moments after the onset of an emergency. They describe the country or region’s situation before the emergency and form the baseline of our maps, which will be enriched with situational information as the emergency develops. One can mention for example administrative boundaries, geographic features such as rivers and lakes, population distribution and infrastructure (e.g., roads, airports, hospital).

Technology stack

All of these datasets are gathered mainly through APIs. APIs, or Application Programming Interfaces, are sets of rules that allow different software applications to communicate with each other. By interfacing with various APIs, the pipeline is able to fetch the latest data directly from the source. This ensures that the information used in analyses is both up-to-date and consistent, providing a reliable foundation for emergency response efforts.

Pipeline architecture showing the different steps, from data acquisition to storage.

The MapAction data pipeline  is constructed using a combination of Python and Bash scripts. Python is a versatile programming language known for its readability and extensive libraries, making it ideal for data processing tasks. Bash scripts facilitate the automation of command-line operations in a Linux environment.

To ensure portability and consistency across different computing environments, the pipeline operates within a Linux Docker container. Docker is a platform that uses containers to package applications and their dependencies, allowing for seamless deployment across various systems .

Process orchestration is handled by Apache Airflow, an open-source workflow management platform. It enables the scheduling and monitoring of workflows, managing task dependencies, and ensuring that data processing steps occur in the correct order.

Next steps

The next phase for the MapAction Automated Data Pipeline involves rigorous validation of the results and testing during actual emergency responses. By integrating the pipeline into live operations, we can assess its effectiveness and make necessary adjustments. Initially, the tool will be made available for internal use within MapAction, allowing our GIS team to benefit from its capabilities while we continue to refine its functionality.

In the future, we aim to adopt an event-driven pipeline approach, enabling automatic initiation of data processing in response to specific triggers such as GDACS disaster alerts. Additionally, we plan to develop an interactive dashboard that allows for manual configuration of pipeline runs, giving users greater control over data collection parameters.

Ultimately, after thorough internal testing and refinement, we hope to make the pipeline available to the broader humanitarian community. By sharing this tool publicly, we aim to support other organisations in enhancing their emergency response efforts through improved data accessibility and efficiency.

MapAction’s work in humanitarian response is funded by USAID’s Bureau for Humanitarian Assistance (BHA) and the German Federal Foreign Office’s Programme for Humanitarian Assistance.

Strengthening data quality for shared humanitarian data sets can reduce human suffering

Since 2022, MapAction, the United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA) and the University of Georgia, Information Technology Outreach Services (ITOS) have been working on strengthening data quality for what are referred to as Common Operational Datasets (COD): ‘best available’ shared datasets that ensure consistency and simplify the discovery and exchange of key data among humanitarian organisations.

Bad data equals more human suffering. Each time a disaster or a health epidemic strikes, data is at the heart of the first response. Yet public data is often full of gaps. If a population census is old, inaccurate or non-existent, people fall off the map. If a certain area is contested and the admin boundaries are disputed, people pay the price. Data is crucial to estimate people’s vulnerability and wellbeing in an emergency.

Why data quality is important

People overlooked due to data gaps will struggle to gain access to emergency services such as food, water, housing, health or parachute financial aid packages. Emergency service providers need good data and data management tools to make the best decisions under extreme pressure.

People overlooked due to data gaps will struggle to gain access to emergency services such as food, water, housing, health or parachute financial aid packages.

READ ALSO: MapAction Emergency Humanitarian Mapping Response Appeal – £105k needed ASAP

Poor data quality can undermine the effectiveness of evidence-based decision-making in the humanitarian sector. The path to improvement starts with identifying the challenges. 

Meet MapAction Head of Data Science Daniel Soares.

Humanitarian data highway stakeholders convene

In April 2024, MapAction hosted a panel with other organisations working to build ‘the humanitarian data highway’ to address the challenges on data quality. The panel was hosted at Humanitarian Networks and Partnerships Week (HNPW) in Geneva, Switzerland, and featured panellists from OCHA’s Centre for Humanitarian Data, Flowminder, ACAPS, Impact, CartONG, Heidelberg Institute for Geoinformation Technology (HeiGIT) and Start Network.

Meet MapAction Head of Data Science Daniel Soares.

The panel, titled ‘Data quality challenges and their impact on the humanitarian response cycle‘, explored data quality for emergency response, preparedness and anticipatory action. All phases of a data project were covered, from data collection to visualisation and accessibility but also the specifics of working with open data and quality frameworks. The panellists also noted that limited resources for COD-related maintenance projects make it hard to fill this gap. 

Data analytics ecosystem

More data is needed; although not all data is good data. If we don’t compile and process data by age or gender we’ll fail to understand the specific needs of women and mothers; miscalculating the trajectory of a storm or its geographic boundaries could mean communities remain displaced, stranded and exposed. If we can’t automate the proximity of communities to vital services and support mechanisms our partners cannot maximise the impact of emergency aid delivery. These are just some of many considerations that affect the quality of data, and ultimately, the quality of disaster response reduction efforts.

If we don’t compile and process data by age or gender we’ll fail to understand the specific needs of women and mothers; miscalculating the trajectory of a storm or its geographic boundaries could mean communities remain displaced, stranded and exposed. If we can’t automate the proximity of communities to vital services and support mechanisms our partners cannot maximise the impact of emergency aid delivery.

READ ALSO: Why we must address the gender gap in humanitarian data

Where are the borders?

Within this context, MapAction has been working together with OCHA and the University of Georgia Information Technology Outreach Services (ITOS) on data quality analysis for the Common Operational Datasets (COD) for Administrative Boundaries (COD-AB).

Common Operational Datasets, or COD, are authoritative reference datasets needed to support operations and decision-making for all actors in a humanitarian response. COD are ‘best available’ datasets that ensure consistency and simplify the discovery and exchange of key data. The data is typically geo-spatially linked using a coordinate system (especially administrative boundaries) and has unique geographic identification codes (P-codes). 

These data sets are often derived from data collected by local authorities and international partners to ensure quality, but most vitally, local ownership. COD can be collected on administrative boundaries, population and more. 

LEARN MORE: Still confused as to what a COD is? This brief song should help! 

After a first discussion with OCHA and ITOS partners in 2022, a preliminary methodology proposal has been developed by MapAction to assess the geospatial quality for Common Operational Data-Administrative Boundaries (COD-AB). COD-AB sets can become outdated or less effective if they are not regularly updated. A country’s administrative boundaries can be redrawn more than a dozen times in a tumultuous year; an outdated data set becomes a means of exclusion. 

COD-AB sets can become outdated or less effective if they are not regularly updated. A country’s administrative boundaries can be redrawn more than a dozen times a tumultuous year; an outdated data set becomes a means of exclusion. 

With this partnership with OCHA and ITOS we aim to create a quantitative assessment mechanism that enables the prioritisation of work to update or enhance existing COD-AB data sets.  The expected output is a quality index for each COD-AB data set based on tests for features (geographical, metadata, etc). A diverse team of MapAction volunteers has been formed to tackle this project, with GIS, data engineering, data science and software volunteers.

How does this partnership work?

ITOS analyses and enhances the quality of COD-AB countries proposed by OCHA. The recent discussions have raised, in particular, the need for support on a prioritisation queue. This queue would combine a geospatial quality index and risk index per country in order to help OCHA to select priority countries and ITOS to focus their high level analysis on the most important countries.

Blossoming InnovationHub

As a fledgling branch of any organisation, our InnovationHub is growing wings and beginning to expand its focus areas. The number of partnerships with other organisations working on similar goals is growing. 

As climate scientists continue to predict that natural disasters will increase in severity in the coming years, our goal is to make sure no one is left behind and falls off the map. “Any sufficiently advanced technology is indistinguishable from magic,” wrote Arthur C. Clarke. Only by pushing the boundaries and committing resources can we come up with ‘magic’ models and frameworks to combat the climate emergency and mitigate seasonal hazards and conflicts. Our InnovationHub marks our sustainable commitment to that cause.

This work is funded by USAID’s Bureau for Humanitarian Assistance (BHA).

MapAction would like to acknowledge the generous contribution that Max Malynowsky is making to the COD Quality project.  Max has dedicated years to contributing to administrative boundaries in the humanitarian sector.  His multiple projects, hosted at fieldmaps.io, include significant work towards making it easier to access global boundary data sets, including COD-AB.

Background: About MapAction and building the ‘humanitarian data highway’

There are dozens of agencies and organisations working in what is known as the ‘humanitarian data analytics’ sector: essentially the delivery of data sets, information management systems and in MapAction’s case, geospatial data and expertise, as well as anticipatory action risk models, applied to disaster reduction. They range from large UN agencies to small local civil society organisations. The International Organization for Migration Displacement Tracking Matrix alone brings together 7,000 data collectors and over 600 technical experts serving in over 80 countries.

Since MapAction’s inception in 2002, the organisation has mapped for people in crises in more than 150 emergencies. We have created 1000s of maps used by emergency service providers.

In 2022, MapAction launched an InnovationHub: to tackle some of the biggest challenges confronting data scientists, data analysts and humanitarian responders working with data in emergency relief. We still don’t have all the answers, but we believe that the InnovationHub will find solutions by posing the right questions. Our work is increasingly in anticipatory action: supporting countries to co-build risk models, with the right data, to help mitigate future hazards. 

LEARN MORE: MapAction and anticipatory action