20 Jan 2012

Grails & HtmlUnit

I was looking for a Java HTML screen scraping library to use with Grails. One post on StackOverflow was particularly helpful and I decided to try HtmlUnit based largely on that.

There's a brief example here and an good introduction on the HTMLUnit page.

Install and Configure HTMLUnit for your Grails project

There are lots of posts on StackOverflow and other Grails discussion forums about how to add JARs to a Grails project.  For HTMLUnit here are the steps.

The first step was to download the dependency JAR files listed here and place the files in your project lib directory (default on Linux is something like /home/Grails/<project>/lib).

Next, edit the BuildConfig.groovy file and add this line in the dependencies section:

dependencies {
        compile 'net.sourceforge.htmlunit:htmlunit:2.9'

You should now be able to build your Grails project and use HTMLUnit.

HTML WebClient Settings

I also found that the following settings helped resolve (and suppress) a number of JavaScript related issues I encountered on the sites I wanted to scrape:

def webClient = new WebClient(BrowserVersion.FIREFOX_3)