Fetching and serving billions of URLs with Aragog
Every week, we process billions of URLs to create a rich, relevant and safe experience for Pinners. The web pages, linked through Pins, contain rich signals that enable us to display useful information on Pins (like recipe ingredients, a product’s price and location data), infer better recommendations and fight spam. To take full advantage of these signals, not only do we need to fetch, store and process the page content, but we also need to serve the processed content at low latencies.
The answer to our needs was building Aragog, a suite of systems that fetch, store, process and serve content from our growing repository of URLs, at low latencies to Pinners.
There are several important considerations that must be realized when building infrastructure that deals with billions of URLs:
- Normalization/canonicalization: The same URL can be represented in many different forms, and several URLs may eventually redirect to the same URL. URL normalization (deduplication of different URL representations) and canonicalization (deduplication of URLs pointing to the same page) play a significant role in reducing the amount of data storage required for serving.
- Crawl politeness: At this scale, it’s important to rate limit and smooth out the traffic going out to each particular domain. Furthermore, robots.txt rules need to be respected appropriately.
- Modeling URL data: One may want to store pieces of extracted metadata associated with a URL or store and update the inlinks and outlinks associated with a URL.
Aragog is composed of two services: the Aragog Fetcher, which fetches the web pages, respecting appropriate rate limits and canonicalizing URLs appropriately, and the Aragog UrlStore, which stores and serves all of the processed metadata and signals about the URLs. The figure below depicts some of many interactions between our crawl pipelines/frontend and Aragog.
The Aragog Fetcher is a Thrift service responsible for fetching the URLs politely. Aragog Fetcher issues the HTTP requests, follows redirects and retrieves the page content and the HTTP headers. The fetcher returns a Thrift struct enclosing the page content, HTTP headers, fetch latency, the redirect chain and other data.
Implementing crawl politeness requires two things:
- Respecting the rules in robots.txt
- Smoothing and rate limiting traffic to a particular domain
The Aragog Fetcher retrieves the robots.txt file on a particular domain, caching its contents for seven days. When a request is made to fetch a URL, it applies the fetch/don’t-fetch rules from robots.txt. If the robots.txt allows fetching, it calls out to the rate limiter with the URL’s domain.
The rate limiter may allow the request immediately, insist the fetcher to delay the request for a period of milliseconds to smooth out the URL fetching or force it to fail because the rate has been exceeded. To ensure Aragog Fetcher doesn’t overburden a domain with too many requests, the rate limiters allow up to 10 QPS to a single domain. We override this limit for some popular or more frequently crawled domains as necessary. The overrides are propagated as a configuration file to the pool of rate limiters using our config management system.
The rate limiter is served by a pool of machines sharded using consistent hashing by the URL domain. As a result, a single machine is responsible for making rate limiting decisions on a single domain. It also minimizes the amount of rate limiting state moving around when a rate limiter process/machine is added or decommissioned. Each rate limiter machine stores a mapping from the domain to the timestamp when a fetch was last scheduled. The rate limiter retrieves this timestamp (let’s call it lastScheduledFetchTimeMs) and schedules the next fetch accordingly. For example if the allow QPS is 10, the rate limiter will schedule a fetch for this URL at lastScheduledFetchTimeMs + 100 (since we want to space out the requests at 100ms). The rate limiter uses a CAS update to optimistically update the last scheduled fetch time for the URL and retries if the CAS operation fails. It calculates the delay by subtracting the current time from lastScheduledFetchTimeMs. When there’s a large burst of requests, the delay will be large (more than one second). When this happens, the rate limiter throws an exception back to the fetcher. Storing an 8 byte timestamp makes for very little overhead per domain.
Whenever a URL is rate limited, the client simply reschedules the fetch to a later time, a feature intrinsically supported by our asynchronous task execution system called PinLater.
Every time you see a Rich Pin, you’re looking at data served from the Aragog UrlStore. The UrlStore is the storage and serving system that holds the metadata extracted from pages fetched. It holds the page content itself, semi-structured data extracted from the page and web graph metadata such as inlinks and outlinks. We created this shared system so that product teams can rapidly build functionality that uses this metadata without the burden of building their own scalable serving infrastructure.
There were a couple of key design considerations we made when designing the system. First, we wanted a one-stop shop for all URL metadata across the organization. Second, we wanted to serve Pinterest’s full online read traffic from our API tier at acceptably low latencies, as well as read-write traffic from our offline processing systems which are a combination of batch and real-time processing.
To accomplish this, we built a federated storage system that provides a comprehensive data model while storing metadata efficiently in systems that have an appropriate size, latency, durability and consistency.
Here are a few examples of how we made the tradeoff between latency, durability and consistency.
We store the full content of web pages fetched. These are large blobs of data that get retrieved infrequently and only for offline pipelines for processing. We choose to store this data in S3 because affordable large storage size was more important than low-latency.
Each web page is stored as a separate S3 file. We use a hash of the URL (normalized and canonical) as the key, but we found that S3 is susceptible to key hot spots. When you create many keys with long common prefixes, you can overload individual servers within the S3 cluster, degrading the performance for some of the keys in your bucket (using the URLs as keys will create these hotspots). We initially tried to use the URL with reverse domain (imagine a million keys in a single S3 bucket that all begin “com.etsy/...”) but ended up receiving hotspotting complaints from Amazon.
Metadata for Pins
The dynamic data shown on Rich Pins is retrieved, processed and stored when the object is initially Pinned, and refreshed periodically as long as the Pin and underlying page exists. When the Pin appears on a board or in search results, this data is retrieved and displayed to the Pinner.
The list of data generated is always evolving, so we have a flexible data model for URL metadata in Aragog UrlStore, including a map of field names and values. For example, a product Pin may have a “product_name” field and a “price” field. Application teams are responsible for field name and the value’s binary format. In most cases, the metadata for a single URL is updated incrementally (a few fields at a time) by our offline processing systems, and subsets of the metadata are served to the users under varying circumstances.
Given the low latency requirements to access URL metadata and the fact that URLs with inlinks/outlinks can be conveniently modeled as a graph, we use Zen (backed by HBase) as the underlying storage system. Zen is Pinterest’s graph storage service which allows defining nodes with and linking these nodes through edges. Zen properties can be used for storing metadata associated with nodes and Zen edges can be used to model inlinks/outlinks. Zen provides fast, efficient CRUD operations and indexing over the nodes and edges.
The UrlStore is responsible for performing URL normalization/canonicalization before issuing CRUD operations against Zen for metadata or S3 for page content (as illustrated below). The canonical URL information is stored in a Zen node property named “canonical_url” (also updated by the crawl pipeline). URL normalization and canonicalization is a complex topic too detailed for this post.
In the following figure, the client issues a request to perform a CRUD operation on abc.com. We normalize the URL, lookup the Zen node and determine the canonical URL. Once the canonical URL is determined, we issue corresponding CRUD operations against the canonical URL (i.e. xyz.com). These can include adding/removing Zen edges (inlinks/outlinks) or adding/removing zen node properties (metadata).
Since the launch of Aragog, various pipelines have used it for fetching, processing and serving URL content in online/offline scenarios. We fetch millions of URLs and serve billions of online URL requests through Aragog every day.
Acknowledgements: This work is a joint effort by Jonathan Hess, Jiacheng Hong and Varun Sharma.