Sunday, February 22, 2015

A note on strings and Groovy methods

Just a very brief post about Groovy methods and using strings both in the method name and in the method call.

Firstly, you can use sentences for method names. Double and single quotes work but you can’t use GStrings and interpolation:

def "check the temperature in Brisbane"() {
    return 31
}

def city = 'Brisbane'
assert "check the temperature in $city"() == 31

Secondly, as that previous assert hints at, it’s possible to call a method using strings and interpolation:

def runProjectX() {

}

def method = 'runProjectX'
"$method"()

That’s all - just a quick one.

Friday, January 9, 2015

Reading fields from PDF forms

Introduction

On a recent project we needed to setup a template to gather information off participants. At first I thought I’d just set up a Microsoft Word form and get a small script together to extract the information. A while ago I did a lot of scripts for extracting information from Word documents so I felt that this would be easy. However, most of the team are on Macs and the Word forms on Mac are the old variety. I also had a go at a Groovy+docx4j script to extract the form data but I failed to get very far in my time box so gave it away as too much effort.

I then looked at the Forms Central app that comes with Adobe Acrobat Pro 11. I’d not used it before but it was quite straightforward to setup a form and export it as a PDF. I then grabbed the Apache PDFBox library and used it to extract the fields. In all it was a pretty straightforward bit of work.

Implementation

Design

No major design here - it’s really just a script.

Components

The following components are utilised in the solution:

Code

The code for this article is located in the Workbench Bitbucket repository. The code is really one script (<50 lines) and a sample PDF form that gets read in.

The script (extract.groovy) is as follows:

/*
 * A basic script that extracts form field data from a PDF form
 */

@Grab(group='org.apache.pdfbox', module='pdfbox', version='1.8.8')
import org.apache.pdfbox.pdmodel.PDDocument

//Load the document
def pdf = PDDocument.load(new File('DataGulcher.pdf'), null)

//Get the form data
def form = pdf.getDocumentCatalog().getAcroForm()

def record = [:]

//Process the form
if (form) {
    for (field in form.getFields()) {
        def name = field.getPartialName()
        if (name ==~ /fc-int01-.*/) {
            //Just ignore these as they're control fields
        } else {
            //Small tidy for the keys - make sure we replace the spaces with underscores
            def key = name.replaceAll(' ', '_')
            
            //Use normalize to tidy up multi-line fields
            def val = field.getValue()?.normalize()
            val = val?: '' 
            record[key] = val
        }        
    }
} else {
    println 'The PDF doesn\'t appear to contain a form.'
}

//Close the document
pdf.close()

//Output a YAML record
@Grab(group='org.yaml', module='snakeyaml', version='1.14')
import org.yaml.snakeyaml.Yaml
Yaml yaml = new Yaml()
print yaml.dump(record)

Discussion

The code I’ve included is pretty straight-forward. I output the data using the YAML format as an example but I could have also pushed out XML or CSV.

You may notice that the PDF field names are a little odd (Name_uVH8IPMbm6VsY*FfF09oJg) - I’ve kept these as-is in my script but it’d be easy (_uVH8IPMbm6VsY*FfF09oJg) to strip out those identifiers at the end.

Lastly, I got a few questions as to why I focussed on a file-based format and I thought I’d note my answers below:

  • Google Forms
    • This would have been ideal but these weren’t one-shot forms. It’d be likely that the interviewer would revisit the form with further information and Google Forms doesn’t appear to provide this functionality
  • Survey Monkey
    • As for Google Forms
  • A small web application (e.g. in Grails)
    • I really didn’t see the need for this level of effort in this project

There’s a huge number of form building software out there that have heaps of features. For this project however, I just needed a file-based format that works across platforms. PDF suits this well and the extraction code is pretty straight-forward.