Parsing PDFs is a process of extracting, analyzing, and structuring data.
At DataOx, we divide all PDF documents into two types depending on level of structuring. The first category, called structured, contains PDF files that have electronic text and tables that were written in a format developed for PDF. The second category, unstructured, contains PDF files that have texts and tables that were put into the document as photos or images.
Extracting structured data from PDF files is not a complex task, but it requires a lot of manual quality assurance work. We use Tabula to extract the text and tables from PDFs.
Unstructured data, as I mentioned above, needs to use optical character recognition (OCR) technologies to parse data. OCR is a quite complex solution that requires experimentation to recognize data correctly, and even so, it does not guarantee 100% data accuracy. So, it’s very important to set up a quality assurance (QA) process to make sure that data is extracted correctly and no piece of information is missed. In addition to basic QA, data often has to be cleansed, as OCR technology may miss a lot of “garbage data.”