A US journalist who founded a not-for-profit news site is to launch an open-source programme that will turn scanned documents – such as receipts – into structured data.
Charles C. Duncan Pardo, founding editor of Raleigh Public Record, plans to launch the tool DocHive in exactly one month.
Duncan, who launched the online-only site dedicated to covering the capital city of North Carolina four years ago, told Journalism.co.uk he started developing the tool to speed up the process of extracting data from scanned campaign finance returns.
Faced with manually entering the data from receipts into Excel "one row at a time", Duncan decided he needed to find a more efficient way of pulling data from scanned documents and converting it to a spreadsheet.
Raleigh Public Record, which has a budget of $80,000-a-year, has three part-time members of staff and "relies very heavily on freelancers".
"We don't have a lot of resources, but we do have a lot of friends and connections", Duncan said, including his brother Edward, a software developer, who was keen to help him tackle the problem.
Duncan was successful in an application for grant funding to "really get rolling".
In this announcement post on the Reporters' Lab site, Duncan explains how it works. The programme converts the PDF into an image file and then "uses a template to break a page up into smaller sections".
"For example, in the campaign finance documents, DocHive will make separate sections for donor name, occupation, donation amount and all the other fields. Then, the programme will take each of those sections and turn it into a separate image file.
"The software takes that small image and uses optical character recognition technology to read the words or numbers and insert them into a CSV file.
DocHive is due to launch at the NICAR conference next month (28 February), and Duncan hopes journalists and developers working for news sites elsewhere will be able to adapt it to suit specific purposes.
"We are creating a wiki for the documentation," he said. "The hope is as other people tackle different documents, they will share those templates with others who are facing similar problems with documents."
Duncan will add updates on DocHive to Reporters Lab and Raleigh Public Record.
Free daily newsletter
If you like our news and feature articles, you can sign up to receive our free daily (Mon-Fri) email newsletter (mobile friendly).
Related articles
- Why DC Thomson's data journalists are keeping tabs on high street businesses
- Tackling new challenges for data journalism, with DC Thomson's Lesley-Anne Kelly and Ema Sabljak
- 40 essential newsletters every journalist should read
- 15 free sources of data on the media industry
- How AI can help journalists track MPs financial interests