sitemaps_parser, a gem for parsing XML sitemaps
- published
- permalink
- https://accidental.cc/notes/2016/sitemaps-rb/
sitemaps_parser
is a simple gem for discovering, fetching and
parsing XML sitemaps according to the spec at sitemaps.org. There are a
ton of gems out there that can generate a sitemap file, but I couldnāt find any that
supported parsing.
require 'sitemaps'
# parse a sitemap from a string
Sitemaps.parse("<xml ns=\"...")
# fetch and parse a sitemap from a known url
sitemap = Sitemaps.fetch("http://termscout.com/sitemap.xml")
# fetch and parse sitemaps, excluding paths matching a filter, and limiting to the top 200
sitemap = Sitemaps.fetch("https://www.digitalocean.com/sitemaps.xml.gz", max_entries: 200) do |entry|
entry.loc.path !~ /blog/i
end
# attempt to discover sitemaps for a site without a known sitemap location. Checks robots.txt and some common locations.
sitemap = Sitemaps.discover("https://www.digitalocean.com", max_entries: 200) do |entry|
entry.loc.path !~ /blog/i
end
# sitemap usage
sitemap.entries.first
#> Sitemaps::Entry(loc: 'http://example.com/page', lastmod: DateTime.utc, changefreq: :monthly, priority: 0.5)
urls = sitemap.entries.map(&:loc)
#> ['http://example.com/page', 'http://example.com/page2, ...]
Why?
Iāve been working on a project at work lately that involves a lot of web crawling. Given a domain, we crawl the site looking for information we can index: sort of like a search engine, except we crawl on demand, rather than storing a huge index perpetually. Our particular use-case involves finding a handful of āimportantā pages to extract a specific kind of data - imagine a maps crawler trying to find the locations page of a pizza chain, or a calendar for extracting events.
Thereās an inherit trade-off in those requirements: finding āimportantā pages when
you donāt know the structure of the site is hardslow. I actually think it might
be impossible in general, given itās similarity to the Halting Problem, but even in
the relatively normal case, you have to crawl a good bit of the site before you can
say with confidence that you either found the most important pages or theyāre not
present. And since the only way that confidence goes up is to crawl more, the slower
the process gets. But when itās slow, itās less useful for the on-demand use-case.1
Yay Sitemaps
Luckily for us, many sites implement XML sitemaps based on the spec at sitemaps.org, which can be submitted to search engines to help them better crawl your site for indexing. A sitemap presents a set of urls, with metadata about frequency of updates, and indexing priority. That means that we can make an initial pass over all the urls a website advertises, trying to preselect pages we might be interested before in we even start. If we donāt find any urls that look like they might be event pages, for example, we can even ignore the domain altogether, or kick it to a background job.
Unfortunately, thereās no standard well-known location for a sitemaps.xml
to live
on a host. Sitemaps are usually submitted directly to a search engine, and while
thereās some non-standard support for a Sitemap:
directive in
a robots.txt
file, in practice itās not used very often. That being said
thereās some commonality: if not present in the robots.txt
, look in
/sitemaps.xml
, for example.2
The Gem
sitemaps_parser
exists because I couldnāt find anything else to do this. Iām
attempting to keep dependencies small, with activesupport
as itās only required
dependency, though it supports nokogiri
as well, which significantly speeds up
parsing over rexml
. REXML is a dog in benchmarks3, but itās in
the standard library.
So yeah, give it a try!
Footnotes
-
Either it doesnāt get used, or gets complained about endlessly š ā©
-
Itās hard to tell if this is a good hueristic, since I have no idea what percentage of sites actually have a sitemap that I canāt find, but itās been helpful so far. ā©
-
Yes, I understand that this benchmark is 8 years old, but itās what I could find and anecdotally itās still that slow. ā©