Friday, January 9, 2015

Reading fields from PDF forms


On a recent project we needed to setup a template to gather information off participants. At first I thought I’d just set up a Microsoft Word form and get a small script together to extract the information. A while ago I did a lot of scripts for extracting information from Word documents so I felt that this would be easy. However, most of the team are on Macs and the Word forms on Mac are the old variety. I also had a go at a Groovy+docx4j script to extract the form data but I failed to get very far in my time box so gave it away as too much effort.

I then looked at the Forms Central app that comes with Adobe Acrobat Pro 11. I’d not used it before but it was quite straightforward to setup a form and export it as a PDF. I then grabbed the Apache PDFBox library and used it to extract the fields. In all it was a pretty straightforward bit of work.



No major design here - it’s really just a script.


The following components are utilised in the solution:


The code for this article is located in the Workbench Bitbucket repository. The code is really one script (<50 lines) and a sample PDF form that gets read in.

The script (extract.groovy) is as follows:

 * A basic script that extracts form field data from a PDF form

@Grab(group='org.apache.pdfbox', module='pdfbox', version='1.8.8')
import org.apache.pdfbox.pdmodel.PDDocument

//Load the document
def pdf = PDDocument.load(new File('DataGulcher.pdf'), null)

//Get the form data
def form = pdf.getDocumentCatalog().getAcroForm()

def record = [:]

//Process the form
if (form) {
    for (field in form.getFields()) {
        def name = field.getPartialName()
        if (name ==~ /fc-int01-.*/) {
            //Just ignore these as they're control fields
        } else {
            //Small tidy for the keys - make sure we replace the spaces with underscores
            def key = name.replaceAll(' ', '_')
            //Use normalize to tidy up multi-line fields
            def val = field.getValue()?.normalize()
            val = val?: '' 
            record[key] = val
} else {
    println 'The PDF doesn\'t appear to contain a form.'

//Close the document

//Output a YAML record
@Grab(group='org.yaml', module='snakeyaml', version='1.14')
import org.yaml.snakeyaml.Yaml
Yaml yaml = new Yaml()
print yaml.dump(record)


The code I’ve included is pretty straight-forward. I output the data using the YAML format as an example but I could have also pushed out XML or CSV.

You may notice that the PDF field names are a little odd (Name_uVH8IPMbm6VsY*FfF09oJg) - I’ve kept these as-is in my script but it’d be easy (_uVH8IPMbm6VsY*FfF09oJg) to strip out those identifiers at the end.

Lastly, I got a few questions as to why I focussed on a file-based format and I thought I’d note my answers below:

  • Google Forms
    • This would have been ideal but these weren’t one-shot forms. It’d be likely that the interviewer would revisit the form with further information and Google Forms doesn’t appear to provide this functionality
  • Survey Monkey
    • As for Google Forms
  • A small web application (e.g. in Grails)
    • I really didn’t see the need for this level of effort in this project

There’s a huge number of form building software out there that have heaps of features. For this project however, I just needed a file-based format that works across platforms. PDF suits this well and the extraction code is pretty straight-forward.