When you are interested in spending as little money as possible getting as many Lego sets as possible, it’s often useful to get updates when things go on sale. Or to be able to snag a set before it becomes retired.
So I decided to throw together a small collection of scripts to scrape the https://shop.lego.com and send me an email every time a price changes, a set gets marked as retired, or new sets get released. I’ve also thrown together a small page so I can check out the contents of the database and do some simple filtering.
Now, while I do not like scraping, The Lego Group made it fairly easy to do so. Their site is decently laid out and there is a clear structure.
I developed the initial scripts on my personal machine. Everything was scraping fine. The plan was to host this here and have the results available online. However, my current webhost didn’t have lxml (IIRC) and since it’s a shared host situation I’m not allowed to install my own modules. So what I decided to do was to take a Raspberry Pi I had laying around and put it to work.
The main grabber file is legofinder.py. It’s run on a cron job at midnight every night. It checks three pages on Lego’s website; the sales site, the retiring site, and the new site. It stores every set checked in a dictionary. Each of these sites could have multiple pages, now while we could parse the page and get the last page number and then iterate from 1 to N, a too high page number just returns an error. So we can just start at page 1 and then increment until a page returns an error. This means we don’t have to treat the first page as an exception.
Then we just get a list of all sets we haven’t checked that exist in our database and check those as well. So we know if a set is no longer on sale, etc. Once we’ve checked all the sets, we take all the sets that have changed and email the results.
The legoset.py file is mostly a helper file. It will handle saving the result to the database. It can also spit out a json representation of the class. It can also see if the current object is different from the version held in the database.
legoSettings.py handles all of the database and settings access. I’ll probably want to split off the database stuff into its own module. At the very least, I can reduce insert, update, delete into a more generic execute. When mocking it up, I believed the differences between each would be more significant. After implementing them, I was able to see they’re all essentially the same function.
Honestly, the biggest difference between all the functions are whether it calls fetchall, fetchone, or commit. I can probably just use fetchall and deal with single row lists.
lego.py is the main script that handles interactions between the viewing web site and the database. It can handle GET and POST requests. Most GET requests are basically filters. However, if you pass it an action of check and a set id, it will check Lego’s site for that set. This is good for getting information of sets that haven’t been in any of the three categories the cron job checks.
The POST request allows you to mark the set as either a Have or Track. The idea behind this is that I’ll be able to limit which sets I get price and retiring updates for. It’s a job for future me.
The link to the repository is below.