By: ABW Data Scientist Dr. Claudia Orellana-Rodriguez , ABW Outreach and Event Manager Irina Ioniță, Pipple Data Scientist Sanne van den Bogaart & MapAction Head of Data Science Daniel Soares
The key step of any data and analytics project? Data, data, data, and did we mention collecting the right data? Easier typed than executed! A new partnership between MapAction and Netherlands-based Analytics for a Better World sought to give a new life to data trapped in PDFs – unlocking another data barrier for humanitarian response.
When data sources are in well-structured, actionable, and most importantly accessible formats this task is very simple, otherwise however… it can be as time-consuming as working with Internet Explorer in 2024. So, how to keep data-collection a task that can boost your efficiency (and enthusiasm) for a data and analytics project?
MapAction, Analytics for a Better World, and Pipple joined forces in 2024 for an information extraction pilot project with the aim of speeding-up the data collection for risk projects to see how exactly that can be accomplished.
How does MapAction take action?
MapAction is an international charity based in the United Kingdom specialising in Information Management for Disaster Response and Preparedness. Founded in 2002, MapAction staff and 75+ volunteers and have since deployed to more than 140 emergency responses and 500 preparedness and capacity building missions.
MapAction has a hybrid make-up, with 32 staff members and 75+ highly skilled GIS and data volunteers. In recent years, MapAction has developed new areas of work, including Disaster Risk Reduction, Anticipatory Action, Health and Technology and Innovation. In the last months we have, for example, deployed to emergency responses in The Gambia, Belize, Grenada and St Vincent and the Grenadines, developed risk and anticipatory action projects in Eswatini, Madagascar, Ecuador and Colombia, worked with children vaccination information in several countries in West and Central Africa and supported data quality standards with OCHA.
Collaborations for Impact
Analytics for a Better World is a Netherlands based non-profit organisation as a joint effort between ORTEC and the University of Amsterdam. Their vision centres analytics as a powerful tool to reach the Sustainable Development Goals. Analytics for a Better World brings together the combined strengths of nonprofits, the academic, and the business world around the theme of SDG-related analytics in several activities. To empower nonprofits, they educate their C-level executives, management, and specialists in how to use analytics to further their objectives. To deepen their support for nonprofits in creating impact with analytics, they build analytical roadmaps and jointly deploy analytics projects. Some of their most inspiring collaborations have been with the Ocean CleanUp, the 510 Dutch Red Cross Initiative, and their annual fellowship bringing together NGO workers and data specialists from all over the world. Next to that, they conduct and stimulate applied research on analytics aimed to contribute to the SDGs. Because accessibility of knowledge has an impactful role, they share everything they create as open source through their repository.
Pipple is a data & AI agency based in Eindhoven, the Netherlands. They are a team of creative mathematicians and engineers, specialised in solving complex issues through data & AI. Pipple was founded in 2016 and since then has provided more than 200 successful solutions to their customers. They have also had ongoing collaborations with various nonprofit organisations such as the Red Cross, the Ocean Clean-Up and more recently with Analytics for a Better World.
Working with the INFORM Subnational Risk Index
Firstly, what exactly is the INFORM Subnational Risk Index?
An INFORM Subnational risk index shows a detailed picture of risk and its components within a single region or country. It covers not only hazards exposure (e.g. earthquakes, floods and conflicts) but also a country’s vulnerabilities, such as diseases prevalence and poverty, as well as its coping capacity.
An INFORM Subnational risk index shows a detailed picture of risk and its components within a single region or country. It covers not only hazards exposure (e.g. earthquakes, floods and conflicts) but also a country’s vulnerabilities, such as diseases prevalence and poverty, as well as its coping capacity.
Since July 2023, in partnership with the European Commission’s INFORM Risk Index, MapAction is working to support national and subnational disaster managers to update or rebuild their disaster forecasts, mitigating tools, and risk atlases. During this period, it has worked on projects in Eswatini, Saint Kitts and Nevis, Niger, Lebanon and Madagascar.
An important part of an INFORM Risk project is Data Collection on hazards, vulnerability and coping capacity. Because we are looking for subnational data (region, department, district, etc.), we sometimes find this data in a PDF-text report instead of a spreadsheet or a geospatial file. Data often finds its death in PDF-texts due to its format inaccessibility, therefore having a basic tool that allows one to scan big reports in search of tables and/or key indicators would save lots of time.
Data often finds its death in PDF-texts due to its format inaccessibility, therefore having a basic tool that allows one to scan big reports in search of tables and/or key indicators would save lots of time.
A new life for data trapped in PDFs
For this pilot project between MapAction, ABW and Pipple, the target was to develop a tool that takes as input one PDF file and gives as an output one or several spreadsheets with the data indicators per administrative division, plus any relevant metadata. The tool should be able to accommodate multiple languages and be written as a Python script.
READ ALSO: Accelerating humanitarian response: Inside MapAction’s Automated Data Pipeline
This project was expected to have following impacts:
- Reduce the time needed to collect data from national or subnational reports,
- Enable the exploration of a larger set of reports and sources,
- Increase risk model completeness and enable national and regional disaster management agencies to make more informed decisions.
After the initial development phase by ABW and Pipple, MapAction now enters the utilisation phase where their team will adapt the scripts to its workflow and ongoing projects.
Because accessibility of knowledge betters the world, Python scripts and user instructions are open source and available on ABW GitHub public’s repository.
How did we make this happen?
Upon first inspection of the sample PDF files, it became clear that we needed to explore different open-source Python libraries to see how they handled non-standard table structures within the documents.
Among the first packages that we explored were:
- Camelot – https://camelot-py.readthedocs.io/en/master/index.html
- Tesseract-ocr – https://github.com/tesseract-ocr/tesseract
Both libraries worked well with standard tables, e.g., vertical tables with a single row header and well defined columns; however, when it came to more varied formats, they did not manage to identify the tables correctly.
In continuing our exploration, we tested one more library:
GMFT is a toolkit for converting PDF tables to many formats. It is lightweight, modular, and performant. While still under development, it already works very well and has outperformed the packages tried before. Thus, it was chosen as our final approach in the project. The package works out of the box; however, small alterations were required for better performance for our specific project.
What comes next?
The information extraction code developed over this pilot project will have its first application on MapAction’s support to the Southern African Development Community (SADC) regional subnational INFORM Risk Index. SADC is composed of 16 countries, home to over 360 millions people and over 200 level-1 administrative divisions, which is the granularity of the model. Given the scale and level of detail of this model, information will be assembled from different national reports and the tool developed during this project will be very useful to process a large amount of data efficiently. The project will run from September 2024 to July 2025 in a collaboration between the SADC DRR unit, MapAction, GIZ and UNDP, building from the experience gathered on recent projects in Eswatini and Madagascar.
READ ALSO: How MapAction is using data to reduce human suffering in Madagascar
VIEW ALSO: Madagascar (Video): Impact of MapAction Anticipatory Action programmes
The information extraction code developed over this pilot project will have its first application on MapAction’s support to the Southern African Development Community (SADC) regional subnational INFORM Risk Index.
Better Together – MapAction & Analytics for a Better World
We hope this first pilot project will be the beginning of a long-term collaboration between MapAction and Analytics for a Better World. Both organisations share the same core values of improving lives and reducing suffering through data and technical expertise. This project speaks directly to ABW’s vision of connecting the private sector with the non-profit one, with Pipple’s Data Scientist, Sanne van den Bogaart, playing a key role in the development of the tool.
We hope this first pilot project will be the beginning of a long-term collaboration between MapAction and Analytics for a Better World.
In June 2024, MapAction held its own annual disaster simulation exercise, MapEx, in the Peak District in the UK. This was the 18th and largest ever edition, featuring partners from the British Red Cross, UN agencies and more. More than 100 MapAction staff, volunteers, partners, donors and observers took part in the two-day ‘emergency datathon’ simulation, with some teams working on anticipatory action components for the first time.
READ MORE: Simulating anticipatory actions as part of disaster management: MapEx 2024
After the success of 2023’s edition, the ABW annual conference had its 2024 edition on May 14th. At this conference, an array of speakers and panellists representing ABW’s key stakeholders was gathered: nonprofits, researchers, and companies. Together, we reflected on the impact and progress of ABW, sharing achievements, and outlining future plans. Engaging discussions delved into pressing topics in analytics, including the challenges posed by AI.
READ MORE: The ABW Annual Conference
The MapAction side of this work was part of a broader programme on Anticipatory Action funded by the German Federal Foreign Office’s Humanitarian Assistance programme.