ZooKeeper Resilience at Pinterest
Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.
It’s important to structure the usage of ZooKeeper in a way that prevents outages and data loss, so it doesn’t become a single point of failure (SPoF). Here, you’ll learn how Pinterest uses ZooKeeper, the problems we’ve dealt with, and a creative solution to benefit from ZooKeeper in a fault-tolerant and highly resilient manner.
Service discovery and dynamic configuration
Like many large scale web sites, Pinterest’s infrastructure consists of servers that communicate with backend services composed of a number of individual servers for managing load and fault tolerance. Ideally, we’d like the configuration to reflect only the active hosts, so clients don’t need to deal with bad hosts as often. ZooKeeper provides a well known pattern to solve this problem.
Each backend service host registers an ephemeral node in ZooKeeper in a path specific to that service. Clients can watch that path to determine the list of hosts. Since each node is ephemeral, it will automatically be removed by ZooKeeper if the host registering fails or shuts down. New hosts brought up automatically register themselves at the correct path, so clients will notice within a few seconds. Thus the list stays up to date and reflects the hosts that are active, which addresses the issues mentioned above.
Another use case is any type of application configuration that needs to be updated dynamically and applied within a few seconds to multiple machines.
Imagine a distributed database fronted by a service layer, referred to as ‘Data Service’. Let’s assume the database is partitioned by a user. When a user request comes in, it needs to know which database holds the information for that user. This information could be deployed statically with the application, but that has the same problems as described above with service discovery. Additionally, it’s important that database configuration changes converge quickly across all the Data Service machines. Otherwise, we wouldn’t be able to apply the update quickly.
Here again, ZooKeeper is of help. The configuration can be stored in a node in ZooKeeper that all the Data Service servers watch. A command line tool or GUI can update the node, and within a few seconds, all the Data Service machines will reflect the update. Voila!
ZooKeeper Failure Modes
While ZooKeeper can play a useful role in a backend infrastructure stack as shown above, like all software systems, it can fail. Here are some possible reasons:
- Too many connections: Let’s say someone brought up a large Hadoop job that needs to communicate with some of the core Pinterest services. For service discovery, the workers need to connect to ZooKeeper. If not properly managed, this could temporarily overload the ZooKeeper hosts with a huge volume of incoming connections, causing it to get slow and partially unavailable.
- Too many transactions: When there’s a surge in ZooKeeper transactions, such as a large number of servers restarting in a short period and attempting to re-register themselves with ZooKeeper (a variant of the thundering herd problem). In this case, even if the number of connections isn’t too high, the spike in transactions could take down ZooKeeper.
- Protocol bugs: Occasionally under high load, we’ve run into protocol bugs in ZooKeeper that result in data corruption. In this case, recovery usually involves taking down the cluster, bringing it back up from a clean slate and then restoring data from backup.
- Human errors: In all software systems, there’s a possibility of human error. For example, we’ve had a manual replacement of a bad ZooKeeper host unintentionally take the whole ZooKeeper quorum offline for a short time due to erroneous configuration being put in place.
- Network partitions: While relatively rare, network connectivity issues resulting in a network partition of the quorum hosts can result in downtime till the quorum can be restored.
While site outages due to service failures are never completely unavoidable when running a web service as large and fast moving as Pinterest, there were a few reasons why ZooKeeper issues were particularly problematic:
- There was no easy immediate mitigation. If bad code is deployed, we can rollback immediately to resolve the issue and investigate what happened later. If a particular part of the site fails, we can disable that service temporarily to mitigate impact. There wasn’t such an option available for ZooKeeper failures.
- Recovery would take multiple hours in some cases, due to a combination of the thundering herd problem and having to restart from a clean slate and recover from backup.
- The centrality of ZooKeeper in our stack meant a pretty wide impact on the site. That is, the outages weren’t isolated to a particular part of the site or feature, but rather impacted the site as a whole.
Initial Mitigation Attempts
A few strategies to mitigate the problem were suggested, but would only provided limited relief.
- Add capacity: The simplest strategy is to tackle load issues is to add capacity. Unfortunately, this doesn’t quite work since the ZooKeeper design and protocol is such that the quorum (voting members) cannot scale out very well. In practice, we have observed that having more than 7-10 hosts tends to make write performance significantly worse.
- Add observers: ZooKeeper observers offer a solution to the scaling problem. These are non-voting members of the ZooKeeper ensemble that otherwise function like any other ZooKeeper host, i.e. can accept watches and other types of requests and proxy them back to the quorum. Adding a fleet of observers and shunting off traffic to them indeed substantially alleviated the load on the ZooKeeper cluster. However, this only helped with watches and other reads, not with writes. It partially addressed failure cause (1) mentioned in the previous section, but not the others.
- Use multiple ZooKeeper clusters for isolation: A mitigation we attempted was to use different ZooKeeper clusters for each use case, e.g. our deploy system uses a separate cluster from the one used for service discovery, and our HBase clusters each use independent ZooKeeper clusters. Again, this helped alleviate some of the problems, but wasn’t a complete solution.
- Fallback to static files: Another possibility is to use ZooKeeper for service discovery and configuration in normal operation, but if it fails, fallback to static host lists and configuration files. This is reasonable, but becomes problematic when keeping the static data up-to-date and accurate. This is particularly difficult for services that are auto scaled and have high churn in the host set. Another option is to fallback to a different storage system (e.g. MySQL) in case of failure, but that too is a manageability nightmare, with multiple sources of truth and another system to be provisioned to handle the full query and spike in the event of ZooKeeper failing.
We also considered a couple more radical steps:
- Look for an alternative technology that is more robust: There are a few alternative distributed coordinator implementations available through open source, but we found them to generally be less mature than ZooKeeper and less tested in production systems at scale. We also realized these problems are not necessarily implementation specific, rather, any distributed coordinator can be prone to most of the above issues.
- Don’t use central coordination at all: Another radical option was to not use ZooKeeper at all, and go back to using static host lists for service discovery and an alternate way to deploy and update configuration. But doing this would mean compromising on latency of update propagation, ease of use and/or accuracy.
Narrowing in on the solution
We ultimately realized that the fundamental problem here was not ZooKeeper itself. Like any service or component in our stack, ZooKeeper can fail. The problem was our complete reliance on it for overall functioning of our site. In essence, ZooKeeper was a Single Point of Failure (SPoF) in our stack. SPoF terminology usually refers to single machines or servers, but in this case, it was a single distributed service.
How could we continue to take advantage of the conveniences that ZooKeeper provides while tolerate its unavailability? The solution was actually quite simple: decouple our applications from ZooKeeper.
The drawing on the left represents how we were using ZooKeeper originally, and the one on the right shows what we decided to do instead.
Applications that were consuming service information and dynamic configuration from ZooKeeper connected to it directly. They cached data in memory but otherwise relied on ZooKeeper as the source of truth for the data. If there were multiple processes running on a single machine, each would maintain a separate connection to ZooKeeper. ZooKeeper outages directly impacted the applications.
Instead of this approach, we moved to a model where the application is in fact completely isolated from ZooKeeper. Instead, a daemon process running on each machine connects to ZooKeeper, establishes watches on the data it is configured to monitor, and whenever the data changes, downloads it into a file on local disk at a well known location. The application itself only consumes the data from the local file and reloads when it changes. It doesn’t need to care that ZooKeeper is involved in propagating updates.
Looking at scenarios
Let’s see how this simple idea helps our applications tolerate the ZooKeeper failure modes. If ZooKeeper is down, no applications are impacted since they read data from the file on local disk. The daemons indefinitely attempt to reestablish connection to ZooKeeper. Until ZooKeeper is back up, configuration updates cannot take place. In the event of an emergency, the files can always be pushed directly to all machines manually. The key point is that there is no availability impact whatsoever.
Now let’s consider the case when ZooKeeper experiences a data corruption and there’s partial or complete data loss till data can be restored from backup. It’s possible a single piece of data in ZooKeeper (referred to as znode) containing important configuration is wiped out temporarily. We protect against this case by building a fail-safe rule into the daemon: it rejects data that is suspiciously different from the last known good value. A znode containing configuration information entirely disappearing is obviously suspicious and would be rejected, but so would more subtle changes like the set of hosts for a service reducing suddenly by 50%. With carefully tuned thresholds, we’ve been able to effectively prevent disastrous configuration changes from propagating to the application. The daemon logs such errors and we can alert an on-call operator so the suspicious change can be investigated.
One might ask, why build an external daemon? Why not implement some sort of disk backed cache and fail-safe logic directly into the application? The main reason is that our stack isn’t really homogenous: we run applications in multiple languages like Python, Java, and Go. Building a robust ZooKeeper client and fault tolerant library in each language would be a non-trivial undertaking, and worse still, painful to manage in the long run since bug fixes and updates would need to be applied to all the implementations. Instead, by building this logic into a single daemon process, we can have the application itself only perform simple file read coupled with periodic check/reload, which is trivial to implement. Another advantage of the daemon is it reduces overall connections into ZooKeeper since we only need one connection per machine rather than one per process. This is particularly a big win for our large Python service fleet, where we typically run 8 to 16 processes per machine.
How about services registering with ZooKeeper for discovery purposes? That too can in theory be outsourced to an external daemon, with some work to ensure the daemon correctly reflects the server state in all cases. However, we ended up keeping the registration path in the application itself and instead spent some time taking care to harden the code path to make it resilient to ZooKeeper outages and equally importantly, be guaranteed to re-register after an outage. This was considered sufficient since individual machines failing to register does not typically have a large impact on the site. On the other hand, if a large group of machines suddenly lose connection to ZooKeeper thereby getting de-registered, the fail-safe rules in the daemon on the consumption side will trigger and result in a rejection of the update, thereby protecting the service and its clients.
Rolling out the final product
We implemented the daemon and modified our base machine configuration to ensure it was installed on all machine groups that needed to use ZooKeeper. The daemon operation is controlled by a configuration file.
Here’s a sample configuration snippet for the daemon:
type = service_discovery
zk_cluster = default
zk_path = /discovery/dataservice/prod
command = /usr/local/bin/zk_download_data.py -f /var/service
/discovery.dataservice.prod -p /discovery/dataservice/prod
Here we’re telling the daemon to monitor the /discovery/dataservice/prod path in the default ZooKeeper cluster for service discovery use, and when ZooKeeper notifies it of a change within that path, to run the zk_download_data script. This script reads all the children znodes of that path, retrieves the host and port information for each server and writes it to the specified file path on local disk, one host:port per line. A configuration file can contain several snippets like the above, comprising service discovery, dynamic configuration and other use cases. The format is designed to be generic enough to be able to extend to other use cases in future.
Next, our client libraries that deal with configuration or service discovery were modified to use files on local disk instead of ZooKeeper as the source of data. We attempted to make this change as opaque to the application code as possible. For example, some of our Java services use Twitter’s ServerSet libraries for service discovery. In this case, we built a ServerSet implementation that is backed by host:port combinations retrieved from a file on local disk (e.g. /var/service/discovery.dataservice.prod mentioned above). This implementation takes care of reloading data when the file content changes. Consumers of the ServerSet interface, including RPC client libraries, don’t need to be aware of the change at all. Since our services use a common set of RPC libraries, it was fairly easy to roll this change out across all our services. Similar relatively simple changes were made to our Python and other language libraries.
Validating the new design
To be sure our infrastructure was resilient to ZooKeeper failure, we ran a number of test scenarios until the real thing happened. A couple of weeks after roll out, there was another ZooKeeper outage, triggered by load introduced by an unrelated bug in one of our client libraries. We were happy to see that this outage caused no site impact, and in fact, went unnoticed till ZooKeeper monitoring alerts themselves fired. Since the outage happened late in the evening, a decision was made to do the cluster restoration the next morning. Thus the entire site functioned normally all night despite ZooKeeper being down, thanks to our resilience changes being in place.
Want to work on interesting distributed systems problems like this? We’re hiring!
Raghavendra Prabhu is a software engineer at Pinterest.
Acknowledgements: This project was a joint effort between the infrastructure and technical operations teams at Pinterest. The core project members were Michael Fu, Raghavendra Prabhu and Yongsheng Wu, but a number of other folks across these teams provided very useful feedback and help along the way to make the project successful.