
Data Scraping To Give Us the Inside Running
I’ve recently had a bash at data scraping and crawling – which is a mysterious black art with the objective of hoovering up information from public websites – so some analytics can be performed – of course using Tableau .
I was guided in my first scraping adventure by this tableau public blog – which pointed me in the direction of using a free product/service called Import IO.
Essentially I downloaded and installed locally an ‘io browser‘ that then allows me to navigate to a web page; nominate the bits and pieces of information I wanted to capture, then repeat on a few more pages until the tool learns exactly what data I want. Once I’m happy, I kick off a crawling process that automatically looks for similar data on the web site in question – and collects all the info in a table – which I then download into a spreadsheet. Quite clever hey.
For my first exercise, I chose to capture some Horse Racing stats from RaceNet.com.au – which I wanted to put to some use for me and my league, in the Star Stable Spring Carnival tipping competition. (join up to league 673657 if you dare.)
I have over 500 horses for which I wanted to grab data like horse name, trainer, starts, wins, prizemoney, distance won at etc etc. My first attempt at crawling the site from the horse stats home page scraped only 90 rows of data, many of them duplicates – but it did this in an impressive 15 minutes or so. So, with much effort, lots of trials and even more errors, I reset up the crawl, and configured a much ‘deeper’ crawl – starting at the base site. This crawl ran for many hours, but came back with over 1800 horses.
This gave me enough info to analyse the top tiers of horses in my potential stable – the dashboard turned out a treat – here is a thumbnail:
but even better – here is a link to the original.
Unfortunately only 200 of the scraped horses were on my list of 500!! So…. the next step is to re-start my crawler from a page with just the urls of the 500 horse names I want – and only one one layer of depth. That is what pushed me to create this blog page – I’m going to add the hyperlist below- and kick-off a crawler on my very own first blog and see what happens…..maybe there will be some more blogging – stay tuned.
27-10-2013 Update!! Well – the ‘recrawl’ worked. Only took around 20mins to get just the list of horses i wanted – using this page as the kick-off. Unfortunately any urls with apostrophes did not get picked up – even though I hard wired the url to be %27 instead of ‘ Will be sure to let the import.io folks know. About to do another recrawl and refresh the data.