sitemaps_parser, a gem for parsing XML sitemaps
- published
- permalink
- https://accidental.cc/notes/2016/sitemaps-rb/
sitemaps_parser
is a simple gem for discovering, fetching and
parsing XML sitemaps according to the spec at sitemaps.org. There are a
ton of gems out there that can generate a sitemap file, but I couldn’t find any that
supported parsing.
require 'sitemaps'
# parse a sitemap from a string
Sitemaps.parse("<xml ns=\"...")
# fetch and parse a sitemap from a known url
sitemap = Sitemaps.fetch("http://termscout.com/sitemap.xml")
# fetch and parse sitemaps, excluding paths matching a filter, and limiting to the top 200
sitemap = Sitemaps.fetch("https://www.digitalocean.com/sitemaps.xml.gz", max_entries: 200) do |entry|
entry.loc.path !~ /blog/i
end
# attempt to discover sitemaps for a site without a known sitemap location. Checks robots.txt and some common locations.
sitemap = Sitemaps.discover("https://www.digitalocean.com", max_entries: 200) do |entry|
entry.loc.path !~ /blog/i
end
# sitemap usage
sitemap.entries.first
#> Sitemaps::Entry(loc: 'http://example.com/page', lastmod: DateTime.utc, changefreq: :monthly, priority: 0.5)
urls = sitemap.entries.map(&:loc)
#> ['http://example.com/page', 'http://example.com/page2, ...]
Why?
I’ve been working on a project at work lately that involves a lot of web crawling. Given a domain, we crawl the site looking for information we can index: sort of like a search engine, except we crawl on demand, rather than storing a huge index perpetually. Our particular use-case involves finding a handful of “important” pages to extract a specific kind of data - imagine a maps crawler trying to find the locations page of a pizza chain, or a calendar for extracting events.
There’s an inherit trade-off in those requirements: finding “important” pages when
you don’t know the structure of the site is hardslow. I actually think it might
be impossible in general, given it’s similarity to the Halting Problem, but even in
the relatively normal case, you have to crawl a good bit of the site before you can
say with confidence that you either found the most important pages or they’re not
present. And since the only way that confidence goes up is to crawl more, the slower
the process gets. But when it’s slow, it’s less useful for the on-demand use-case.1
Yay Sitemaps
Luckily for us, many sites implement XML sitemaps based on the spec at sitemaps.org, which can be submitted to search engines to help them better crawl your site for indexing. A sitemap presents a set of urls, with metadata about frequency of updates, and indexing priority. That means that we can make an initial pass over all the urls a website advertises, trying to preselect pages we might be interested before in we even start. If we don’t find any urls that look like they might be event pages, for example, we can even ignore the domain altogether, or kick it to a background job.
Unfortunately, there’s no standard well-known location for a sitemaps.xml
to live
on a host. Sitemaps are usually submitted directly to a search engine, and while
there’s some non-standard support for a Sitemap:
directive in
a robots.txt
file, in practice it’s not used very often. That being said
there’s some commonality: if not present in the robots.txt
, look in
/sitemaps.xml
, for example.2
The Gem
sitemaps_parser
exists because I couldn’t find anything else to do this. I’m
attempting to keep dependencies small, with activesupport
as it’s only required
dependency, though it supports nokogiri
as well, which significantly speeds up
parsing over rexml
. REXML is a dog in benchmarks3, but it’s in
the standard library.
So yeah, give it a try!
Footnotes
-
Either it doesn’t get used, or gets complained about endlessly 😉 ↩
-
It’s hard to tell if this is a good hueristic, since I have no idea what percentage of sites actually have a sitemap that I can’t find, but it’s been helpful so far. ↩
-
Yes, I understand that this benchmark is 8 years old, but it’s what I could find and anecdotally it’s still that slow. ↩