sitemaps_parser, a gem for parsing XML sitemaps

published
permalink
https://accidental.cc/notes/2016/sitemaps-rb/

sitemaps_parser is a simple gem for discovering, fetching and parsing XML sitemaps according to the spec at sitemaps.org. There are a ton of gems out there that can generate a sitemap file, but I couldn’t find any that supported parsing.

Current Version

require 'sitemaps'

# parse a sitemap from a string
Sitemaps.parse("<xml ns=\"...")

# fetch and parse a sitemap from a known url
sitemap = Sitemaps.fetch("http://termscout.com/sitemap.xml")

# fetch and parse sitemaps, excluding paths matching a filter, and limiting to the top 200
sitemap = Sitemaps.fetch("https://www.digitalocean.com/sitemaps.xml.gz", max_entries: 200) do |entry|
  entry.loc.path !~ /blog/i
end

# attempt to discover sitemaps for a site without a known sitemap location. Checks robots.txt and some common locations.
sitemap = Sitemaps.discover("https://www.digitalocean.com", max_entries: 200) do |entry|
  entry.loc.path !~ /blog/i
end

# sitemap usage
sitemap.entries.first
#> Sitemaps::Entry(loc: 'http://example.com/page', lastmod: DateTime.utc, changefreq: :monthly, priority: 0.5)

urls = sitemap.entries.map(&:loc)
#> ['http://example.com/page', 'http://example.com/page2, ...]

Why?

I’ve been working on a project at work lately that involves a lot of web crawling. Given a domain, we crawl the site looking for information we can index: sort of like a search engine, except we crawl on demand, rather than storing a huge index perpetually. Our particular use-case involves finding a handful of “important” pages to extract a specific kind of data - imagine a maps crawler trying to find the locations page of a pizza chain, or a calendar for extracting events.

There’s an inherit trade-off in those requirements: finding “important” pages when you don’t know the structure of the site is hardslow. I actually think it might be impossible in general, given it’s similarity to the Halting Problem, but even in the relatively normal case, you have to crawl a good bit of the site before you can say with confidence that you either found the most important pages or they’re not present. And since the only way that confidence goes up is to crawl more, the slower the process gets. But when it’s slow, it’s less useful for the on-demand use-case.1

Yay Sitemaps

Luckily for us, many sites implement XML sitemaps based on the spec at sitemaps.org, which can be submitted to search engines to help them better crawl your site for indexing. A sitemap presents a set of urls, with metadata about frequency of updates, and indexing priority. That means that we can make an initial pass over all the urls a website advertises, trying to preselect pages we might be interested before in we even start. If we don’t find any urls that look like they might be event pages, for example, we can even ignore the domain altogether, or kick it to a background job.

Unfortunately, there’s no standard well-known location for a sitemaps.xml to live on a host. Sitemaps are usually submitted directly to a search engine, and while there’s some non-standard support for a Sitemap: directive in a robots.txt file, in practice it’s not used very often. That being said there’s some commonality: if not present in the robots.txt, look in /sitemaps.xml, for example.2

The Gem

sitemaps_parser exists because I couldn’t find anything else to do this. I’m attempting to keep dependencies small, with activesupport as it’s only required dependency, though it supports nokogiri as well, which significantly speeds up parsing over rexml. REXML is a dog in benchmarks3, but it’s in the standard library.

So yeah, give it a try!

Footnotes

  1. Either it doesn’t get used, or gets complained about endlessly 😉

  2. It’s hard to tell if this is a good hueristic, since I have no idea what percentage of sites actually have a sitemap that I can’t find, but it’s been helpful so far.

  3. Yes, I understand that this benchmark is 8 years old, but it’s what I could find and anecdotally it’s still that slow.