The official Pinterest engineering blog.

As we focus on building a great user experience for the tens of millions of existing Pinners, it’s equally important to engage and retain new Pinners through the new user experience (NUX).

We recently rebuilt our new user experience and created a new framework to power it. Through the process, we determined the best content to show that would educate without overwhelming. Here you’ll learn how we arrived at a NUX that performs significantly better than the previous experience across all of our core engagement metrics.

Rethinking NUX from the ground up

We started by conducting qualitative and quantitative research to better understand new Pinners. The user experience research team interviewed a group of inactive Pinners to understand major pain points, while the data team analyzed a large sample of existing Pinners and determined the core set of actions that would increase the likelihood of a retaining a new person joining the site.

After looking at the insights and iterating on dozens of versions, we gathered new learnings about retaining new Pinners:

Demonstrate a simple value proposition that clearly shows off utility. A Pin is our primary value proposition so we immediately educate the person about how Pins work, and their value.

Actualize the value proposition immediately. Searching and discovering Pins is a core feature, so immediately following the Pin step, we give education on how to find and save interesting Pins.

Educate new Pinners at their own pace. The previous Pin and Search steps are mandatory for new Pinners because we’ve found they lead to increased long term engagement. However, if the Pinner doesn’t seem to get it the first time we’ll gently re-educate them on subsequent visits. For example, if he or she still hasn’t saved a Pin on their second visit, we’ll provide reeducation, and conduct the same process for board creation, following, and other features.

Encourage immediate action. Understanding what it means to Pin early substantially increases the likelihood of retaining the new Pinner. He or she will get a simplified experience where Pinning is highlighted and other advanced features are hidden, until they save their first Pin. We call this the First Pin Experience (more on that below).

After becoming active for the first time and saving a Pin, the Pinner will graduate to a richer experience.

The need for a framework

The updated NUX is a multi-session experience that differs based on Pinner state such as what they’ve done, how long they’ve seen an experience, etc. Therefore we needed a system that could control what the Pinner experiences based on those variables. We also needed a way to easily run experiments to test different NUX steps, messaging, and educational units.

Similar experiences had already existed, such as new feature tutorials and education. We realized the logic powering these existing experiences were standalone and shared similar logic, and created an Experience Framework to build NUX and power new and existing experiences.

You can think of an experience as any feature on Pinterest, each of which require logic to determine when they need to be shown, persistence logic for when they’re dismissed/completed, and logging (i.e impressions vs. completions).

Here’s how the logic was laid out in our client and backend:

Here’s how the logic looks like with the Experience Framework:

Each experience is configured in one place, the client delegates display logic to the backend, and persistence, logging and experimentation is all powered by the framework.

Boiling down to solutions

The Experience Framework answers one simple question: what experiences should a Pinner see on a given view within the app?

Every time the client renders a view, the Experience Framework will tell the client what experience the Pinner should see. For example, when rendering the home page, the Experience Framework tells the client whether to show a specific step in NUX, a tutorial, or a feed of Pins. How the decision is made is opaque to the client.

The decision engine

The core of the framework is the decision engine, responsible for determining the experience a Pinner should see by considering configured and eligible instances for all potential views. The best experience is then decided upon based on static configuration (start date, seconds_to_expire, max_display_count, etc), the Pinner’s state (such as number of Pins created, level of engagement, and features experienced), the experience state (enabled, expired, view count, etc), the client type, experiment group, and many other properties. These decision parameters allow us to build complicated experiences like our First Pin Experience.

To recap, the First Pin Experience is shown to new Pinners who’ve never saved a Pin, and it lasts for no longer than 24 hours. The configuration is simple: set the seconds_to_expire to 24 hours and write a handler that will ensure it’s only enabled for new users with no Pins. In this experience we’ve also configured an experiment to further test whether 24 hours is really the best duration for this experience.

Experience.WEB_FIRST_PIN_EXP: {

'description': 'First Pin Experience.',

'start_date': '2013-11-01',

'seconds_to_expire': 60*60*24,

'handler': autobahn.FirstPin,

'experiment': {

'name': 'first_pin_duration',

'groups': {

'1_day': {'seconds_to_expire': 60*60*24,

},

'2_days': {'seconds_to_expire': 60*60*24*2, },

'7_days': {'seconds_to_expire': 60*60*24*7, }

}

}

}

We then associate the experience with a unique placement in the client (web home page).

Placement.WEB_HOME_TAKEOVER:

{

'experiences': [

Experience.WEB_MANDATORY_AUTOBAHN,

Experience.WEB_FIRST_PIN_EXP,

Experience.WEB_FIRST_PIN_USER_ED,

Experience.WEB_YOUR_BOARDS_USER_ED,

Experience.WEB_FAST_FOLLOW_USER_ED,

Experience.WEB_FIND_FRIENDS_USER_ED

],

'cooldown': SESSION_LENGTH # 2 hour cooldown - session length

}

There could be many eligible experiences on the home page placement. The framework resolves this by guaranteeing only one experience can be shown on that view.

Be fast or fail fast

The experience framework is the gatekeeper in determining what experience a Pinner should see, so at it’s peak it can see about 50k decision requests/second and growing. Since in some cases the view needs to synchronously call the Experience Framework before rendering, it needs to be fast or at the very least fail fast.

The major bottleneck in our case is I/O, i.e accessing persistent user and experience state data. We addressed this by:

  • Storing all our data in an HBase cluster that’s highly optimized for retrieving state data
  • Minimizing the number of calls to HBase
  • Making use of gevent

Luckily we also applied much of our past learnings when optimizing HBase for fast online reads. As a result, we were able to achieve an upper90 latency of 30-40ms.

Even with fast response times, it’s important to avoid making any unnecessary backend calls. In order to achieve this, our clients periodically pull down and caches all displayable experiences and uses that to decide and render an experience whenever possible. However, keeping this “state of the world” cache up to date can be tricky.

As a last resort, if the decision engine takes too long to respond, we fail fast. This means in the worst case the Pinner will experience the default user experience, which is not ideal, but allows for the Pinner to continue using Pinterest.

Today the Experience Framework powers the majority of experiences on our website. The framework is also steadily powering experiences on our mobile apps, which is exciting because it enables us to dynamically render experiences as we run experiments without pushing a new release. You can expect to see better and improved experiences coming to a Pinterest app near you.

If problems like this interest you, the Pinterest Growth Team is hiring product-minded hackers to help billions of users worldwide discover the things they love and inspire them to go do those things.

Daniel Chu is an engineer at Pinterest.

Read More

With tens of millions of Pinners, and tens of billions of Pins, there’s a massive supply of data to power discovery projects on Pinterest such as search, recommendations, and interests to provide unlimited exploration opportunities. Here we describe how these data are calculated, stored and eventually served online.

Building the data model for the interest graph

Pinterest gives people a way to organize information on the web in a way that makes sense to them. Every Pin exists because someone thought it was important enough to add to a collection. The discovery part of Pinterest comes in when we can connect people to related Pins and Pinners they may be interested in. The more a person pins, the more connections can be created.

Discovery data leverages basic objects that share connections, aggregate them, and add more complex information. We call these aggregations PinJoin, BoardJoin, and UserJoin. PinJoin is no longer a single Pin. Instead, we use the image signatures to group all pins with the same image. BoardJoin and UserJoin groups are still using board id and user id, respectively.

PinJoin, BoardJoin, and UserJoin each contains three types of information:

  • Raw data: information input from users.
  • Derived features: information we learn from raw data.
  • Other joins: PinJoin, BoardJoin, and UserJoin each contains the other two joins.

The benefit of aggregating all information is two folds. First, it helps us correct inaccurate user input. For example, a board category may be wrongly assigned by its owner. However, its underlying pins will be repinned to many other boards that are accurately assigned. Utilizing aggregated data with connections, we can identify the wrongly assigned board. Second, additional features can be derived from raw data. For example, PinJoin contains all repin chains. It helps us construct a repin graph and run PageRank algorithm on the graph to calculate the importance of each board. All derived features are fed back to PinJoins, BoardJoins, and UserJoins.

Data calculation

Discovery data are calculated offline using a hadoop cluster. This figure illustrates a typical workflow of all jobs, with more than 200 jobs scheduled to run everyday. Job dependency is handled by Pinball , and data are stored in Amazon s3. We use the date and the iteration number to do version control.

Access and storage

Discovery data are loaded into different storages after creation, based on data size and data importance. We support the following four types of access:

Production random access using keys:

  • Random access with very high QPS
  • Response time is less than 20 milliseconds at P99
  • Limited space

    Redshift batch access:

  • Fast batch access
  • Response time for simple aggregation operations is around one second
  • Limited space

Hive/cascading batch access:

  • Slow batch access
  • Response time for simple aggregation can be up to multiple minutes
  • Unlimited space

Debugging random access using keys:

  • Random access with very low QPS
  • Response time is multiple seconds for a single key
  • Unlimited space

We use HFile format to store all data, which enables us to randomly access it on s3 without further indexing. (all data are debuggable offline). We create a light version for each set of data, where unimportant yet large fields are unset in the light version so that it can fit into space-limited storage. We use both HBase and in-house developed storage systems to hold production random access data.

Discovery data serving

We design our online high QPS serving system to achieve three high-level objectives:

  • It should be able to support multiple external storages.
  • It should be able to support multiple data sets independently.
  • It should be able to support arbitrary mixing of different data.

The serving system architecture is illustrated in the figure below. It’s a finagle thrift service to respond to given queries. Response is a list of object lists. Thus, different types of objects can be put into different lists inside a single Response object.

struct Response {

1: optional list responses;

}

Queries are handled by a scheduler. Upon receiving a query, the scheduler picks up Processors from a processor pool. Processors share the common API, and return a Future of a response object, which allows the processors be chained together arbitrarily.

public abstract class Processor {

public abstract Future

process(Request request, ResponseData prev);

}

An execution plan is an execution relationship of a set of processors. A valid one should contain no circles (DAG ). Execution plans are either defined by a query or loaded from configurations. After receiving a query, the scheduler verifies the validity of its execution plan, performs a topology sorting, and issues the task one by one. The top level processors fetch data from external storages or services. Data are further fed to other processors, and eventually transformed to a Response.

The finagle service is using async Future call. When issuing tasks, the scheduler doesn’t have to wait for the processing completion of each processor. Instead, it directly operates on Future objects in three ways:

  • Chain one Future object to the next Future object.
  • Merge multiple Future objects into a single Future object.
  • Split a single Future object into multiple ones.

Lessons in data

Data are fundamental units for all discovery projects. It is important to make them fresh and accessible. We learned several lessons, including:

  • Applications often have different needs. Offline analytics and online products have different access patterns. Creating data in multiple tiers is necessary.
  • A flexible serving system can quickly power experiments and products, and largely accelerate product development.

This project was a joint effort of the whole discovery team at Pinterest. If you are interested in being part of it, join our team!

Tao Tao is a software engineer at Pinterest.

Read More

Apache ZooKeeper is an open source distributed coordination service that’s popular for use cases like service discovery, dynamic configuration management and distributed locking. While it’s versatile and useful, it has failure modes that can be hard to prepare for and recover from, and if used for site critical functionality, can have a significant impact on site availability.

It’s important to structure the usage of ZooKeeper in a way that prevents outages and data loss, so it doesn’t become a single point of failure (SPoF). Here, you’ll learn how Pinterest uses ZooKeeper, the problems we’ve dealt with, and a creative solution to benefit from ZooKeeper in a fault-tolerant and highly resilient manner.

Service discovery and dynamic configuration

Like many large scale web sites, Pinterest’s infrastructure consists of servers that communicate with backend services composed of a number of individual servers for managing load and fault tolerance. Ideally, we’d like the configuration to reflect only the active hosts, so clients don’t need to deal with bad hosts as often. ZooKeeper provides a well known pattern to solve this problem.

Each backend service host registers an ephemeral node in ZooKeeper in a path specific to that service. Clients can watch that path to determine the list of hosts. Since each node is ephemeral, it will automatically be removed by ZooKeeper if the host registering fails or shuts down. New hosts brought up automatically register themselves at the correct path, so clients will notice within a few seconds. Thus the list stays up to date and reflects the hosts that are active, which addresses the issues mentioned above.

Another use case is any type of application configuration that needs to be updated dynamically and applied within a few seconds to multiple machines.

Imagine a distributed database fronted by a service layer, referred to as ‘Data Service’. Let’s assume the database is partitioned by a user. When a user request comes in, it needs to know which database holds the information for that user. This information could be deployed statically with the application, but that has the same problems as described above with service discovery. Additionally, it’s important that database configuration changes converge quickly across all the Data Service machines. Otherwise, we wouldn’t be able to apply the update quickly.

Here again, ZooKeeper is of help. The configuration can be stored in a node in ZooKeeper that all the Data Service servers watch. A command line tool or GUI can update the node, and within a few seconds, all the Data Service machines will reflect the update. Voila!

ZooKeeper Failure Modes

While ZooKeeper can play a useful role in a backend infrastructure stack as shown above, like all software systems, it can fail. Here are some possible reasons:

  • Too many connections: Let’s say someone brought up a large Hadoop job that needs to communicate with some of the core Pinterest services. For service discovery, the workers need to connect to ZooKeeper. If not properly managed, this could temporarily overload the ZooKeeper hosts with a huge volume of incoming connections, causing it to get slow and partially unavailable.
  • Too many transactions: When there’s a surge in ZooKeeper transactions, such as a large number of servers restarting in a short period and attempting to re-register themselves with ZooKeeper (a variant of the thundering herd problem). In this case, even if the number of connections isn’t too high, the spike in transactions could take down ZooKeeper.
  • Protocol bugs: Occasionally under high load, we’ve run into protocol bugs in ZooKeeper that result in data corruption. In this case, recovery usually involves taking down the cluster, bringing it back up from a clean slate and then restoring data from backup.
  • Human errors: In all software systems, there’s a possibility of human error. For example, we’ve had a manual replacement of a bad ZooKeeper host unintentionally take the whole ZooKeeper quorum offline for a short time due to erroneous configuration being put in place.
  • Network partitions: While relatively rare, network connectivity issues resulting in a network partition of the quorum hosts can result in downtime till the quorum can be restored.

While site outages due to service failures are never completely unavoidable when running a web service as large and fast moving as Pinterest, there were a few reasons why ZooKeeper issues were particularly problematic:

  • There was no easy immediate mitigation. If bad code is deployed, we can rollback immediately to resolve the issue and investigate what happened later. If a particular part of the site fails, we can disable that service temporarily to mitigate impact. There wasn’t such an option available for ZooKeeper failures.
  • Recovery would take multiple hours in some cases, due to a combination of the thundering herd problem and having to restart from a clean slate and recover from backup.
  • The centrality of ZooKeeper in our stack meant a pretty wide impact on the site. That is, the outages weren’t isolated to a particular part of the site or feature, but rather impacted the site as a whole.

Initial Mitigation Attempts

A few strategies to mitigate the problem were suggested, but would only provided limited relief.

  • Add capacity: The simplest strategy is to tackle load issues is to add capacity. Unfortunately, this doesn’t quite work since the ZooKeeper design and protocol is such that the quorum (voting members) cannot scale out very well. In practice, we have observed that having more than 7-10 hosts tends to make write performance significantly worse.
  • Add observers: ZooKeeper observers offer a solution to the scaling problem. These are non-voting members of the ZooKeeper ensemble that otherwise function like any other ZooKeeper host, i.e. can accept watches and other types of requests and proxy them back to the quorum. Adding a fleet of observers and shunting off traffic to them indeed substantially alleviated the load on the ZooKeeper cluster. However, this only helped with watches and other reads, not with writes. It partially addressed failure cause (1) mentioned in the previous section, but not the others.
  • Use multiple ZooKeeper clusters for isolation: A mitigation we attempted was to use different ZooKeeper clusters for each use case, e.g. our deploy system uses a separate cluster from the one used for service discovery, and our HBase clusters each use independent ZooKeeper clusters. Again, this helped alleviate some of the problems, but wasn’t a complete solution.
  • Fallback to static files: Another possibility is to use ZooKeeper for service discovery and configuration in normal operation, but if it fails, fallback to static host lists and configuration files. This is reasonable, but becomes problematic when keeping the static data up-to-date and accurate. This is particularly difficult for services that are auto scaled and have high churn in the host set. Another option is to fallback to a different storage system (e.g. MySQL) in case of failure, but that too is a manageability nightmare, with multiple sources of truth and another system to be provisioned to handle the full query and spike in the event of ZooKeeper failing.

We also considered a couple more radical steps:

  • Look for an alternative technology that is more robust: There are a few alternative distributed coordinator implementations available through open source, but we found them to generally be less mature than ZooKeeper and less tested in production systems at scale. We also realized these problems are not necessarily implementation specific, rather, any distributed coordinator can be prone to most of the above issues.
  • Don’t use central coordination at all: Another radical option was to not use ZooKeeper at all, and go back to using static host lists for service discovery and an alternate way to deploy and update configuration. But doing this would mean compromising on latency of update propagation, ease of use and/or accuracy.

Narrowing in on the solution

We ultimately realized that the fundamental problem here was not ZooKeeper itself. Like any service or component in our stack, ZooKeeper can fail. The problem was our complete reliance on it for overall functioning of our site. In essence, ZooKeeper was a Single Point of Failure (SPoF) in our stack. SPoF terminology usually refers to single machines or servers, but in this case, it was a single distributed service.

How could we continue to take advantage of the conveniences that ZooKeeper provides while tolerate its unavailability? The solution was actually quite simple: decouple our applications from ZooKeeper.

The drawing on the left represents how we were using ZooKeeper originally, and the one on the right shows what we decided to do instead.

Applications that were consuming service information and dynamic configuration from ZooKeeper connected to it directly. They cached data in memory but otherwise relied on ZooKeeper as the source of truth for the data. If there were multiple processes running on a single machine, each would maintain a separate connection to ZooKeeper. ZooKeeper outages directly impacted the applications.

Instead of this approach, we moved to a model where the application is in fact completely isolated from ZooKeeper. Instead, a daemon process running on each machine connects to ZooKeeper, establishes watches on the data it is configured to monitor, and whenever the data changes, downloads it into a file on local disk at a well known location. The application itself only consumes the data from the local file and reloads when it changes. It doesn’t need to care that ZooKeeper is involved in propagating updates.

Looking at scenarios

Let’s see how this simple idea helps our applications tolerate the ZooKeeper failure modes. If ZooKeeper is down, no applications are impacted since they read data from the file on local disk. The daemons indefinitely attempt to reestablish connection to ZooKeeper. Until ZooKeeper is back up, configuration updates cannot take place. In the event of an emergency, the files can always be pushed directly to all machines manually. The key point is that there is no availability impact whatsoever.

Now let’s consider the case when ZooKeeper experiences a data corruption and there’s partial or complete data loss till data can be restored from backup. It’s possible a single piece of data in ZooKeeper (referred to as znode) containing important configuration is wiped out temporarily. We protect against this case by building a fail-safe rule into the daemon: it rejects data that is suspiciously different from the last known good value. A znode containing configuration information entirely disappearing is obviously suspicious and would be rejected, but so would more subtle changes like the set of hosts for a service reducing suddenly by 50%. With carefully tuned thresholds, we’ve been able to effectively prevent disastrous configuration changes from propagating to the application. The daemon logs such errors and we can alert an on-call operator so the suspicious change can be investigated.

One might ask, why build an external daemon? Why not implement some sort of disk backed cache and fail-safe logic directly into the application? The main reason is that our stack isn’t really homogenous: we run applications in multiple languages like Python, Java, and Go. Building a robust ZooKeeper client and fault tolerant library in each language would be a non-trivial undertaking, and worse still, painful to manage in the long run since bug fixes and updates would need to be applied to all the implementations. Instead, by building this logic into a single daemon process, we can have the application itself only perform simple file read coupled with periodic check/reload, which is trivial to implement. Another advantage of the daemon is it reduces overall connections into ZooKeeper since we only need one connection per machine rather than one per process. This is particularly a big win for our large Python service fleet, where we typically run 8 to 16 processes per machine.

How about services registering with ZooKeeper for discovery purposes? That too can in theory be outsourced to an external daemon, with some work to ensure the daemon correctly reflects the server state in all cases. However, we ended up keeping the registration path in the application itself and instead spent some time taking care to harden the code path to make it resilient to ZooKeeper outages and equally importantly, be guaranteed to re-register after an outage. This was considered sufficient since individual machines failing to register does not typically have a large impact on the site. On the other hand, if a large group of machines suddenly lose connection to ZooKeeper thereby getting de-registered, the fail-safe rules in the daemon on the consumption side will trigger and result in a rejection of the update, thereby protecting the service and its clients.

Rolling out the final product

We implemented the daemon and modified our base machine configuration to ensure it was installed on all machine groups that needed to use ZooKeeper. The daemon operation is controlled by a configuration file.

Here’s a sample configuration snippet for the daemon:

[dataservice]

type = service_discovery

zk_cluster = default

zk_path = /discovery/dataservice/prod

command = /usr/local/bin/zk_download_data.py -f /var/service

/discovery.dataservice.prod -p /discovery/dataservice/prod

Here we’re telling the daemon to monitor the /discovery/dataservice/prod path in the default ZooKeeper cluster for service discovery use, and when ZooKeeper notifies it of a change within that path, to run the zk_download_data script. This script reads all the children znodes of that path, retrieves the host and port information for each server and writes it to the specified file path on local disk, one host:port per line. A configuration file can contain several snippets like the above, comprising service discovery, dynamic configuration and other use cases. The format is designed to be generic enough to be able to extend to other use cases in future.

Next, our client libraries that deal with configuration or service discovery were modified to use files on local disk instead of ZooKeeper as the source of data. We attempted to make this change as opaque to the application code as possible. For example, some of our Java services use Twitter’s ServerSet libraries for service discovery. In this case, we built a ServerSet implementation that is backed by host:port combinations retrieved from a file on local disk (e.g. /var/service/discovery.dataservice.prod mentioned above). This implementation takes care of reloading data when the file content changes. Consumers of the ServerSet interface, including RPC client libraries, don’t need to be aware of the change at all. Since our services use a common set of RPC libraries, it was fairly easy to roll this change out across all our services. Similar relatively simple changes were made to our Python and other language libraries.

Validating the new design

To be sure our infrastructure was resilient to ZooKeeper failure, we ran a number of test scenarios until the real thing happened. A couple of weeks after roll out, there was another ZooKeeper outage, triggered by load introduced by an unrelated bug in one of our client libraries. We were happy to see that this outage caused no site impact, and in fact, went unnoticed till ZooKeeper monitoring alerts themselves fired. Since the outage happened late in the evening, a decision was made to do the cluster restoration the next morning. Thus the entire site functioned normally all night despite ZooKeeper being down, thanks to our resilience changes being in place.

Want to work on interesting distributed systems problems like this? We’re hiring!

Raghavendra Prabhu is a software engineer at Pinterest.

Acknowledgements: This project was a joint effort between the infrastructure and technical operations teams at Pinterest. The core project members were Michael Fu, Raghavendra Prabhu and Yongsheng Wu, but a number of other folks across these teams provided very useful feedback and help along the way to make the project successful.

Read More

With a strong culture of making, we’re always putting our heads together to build the next great thing. We regularly hold Make-a-thons, all-night events where we work on projects we’ve been wanting to crank out, but may not have had the time during the normal work day. This is how products such as Send a Pin and Price Drop Notifications (built by a Pintern) got their start.

At our last Make-a-thon, a small team, including another engineer, designer, and localization manager, brought animated GIFs to Pinterest. After building the product over night, we quickly launched on web, and worked on mobile in parallel. Starting today, GIFs are available to everyone on Android and iOS!

A night of GIFs

By Richard Perez of skinnyships.com

The idea to build GIFs occurred to us around midnight the night of our Make-a-thon, while working on a separate project. We had been hearing from Pinners that they wanted a way to save GIFs, and quite frankly, I also wanted a way to make my Cats board more playable.

And so, we ran with it.

Challenges and wins

One of the biggest challenges, and the reason why GIF support had been missing for so long from Pinterest, was we wanted to make the experience elegant and seamless, yet highly functional.

From Memeplaza

We tried a number of tests, including playing GIFs in the grid, then transitioned to playing on hover, and even experimented with putting a bigger Play button on the Pin itself. Ultimately, we came up with a treatment for animated content with a small Play button on the bottom left of the Pin.

In the spirit of Make-a-thons and dogfooding, we started collecting feedback. As much feedback as we could. It was great to watch Pinployees find and enjoy funny GIFs in the middle of the night and navigate the UI of this newly formed product.

The positive response was a clear indicator that we were on the right path to a great solution, and thus on to a race to launch to achieve our goals:

  • Make GIFs playable in the grid and Pin close-up.
  • Maintain an elegant user experience by only playing GIFs when the Pinner clicks them.
  • Minimize waste of speed and bandwidth by using the click-to-play design. As a result, GIFs, which are generally large files, are only loaded with the intent to play.
  • Build with an easy-to-consume API so we could extend to mobile clients.
  • Work closely with design to differentiate experiences on mobile, where it doesn’t make sense to play in the grid. Design was also crucial in helping us quickly build a seamless experience that Pinners would love.

Shipping GIFs

With the qualitative results looking good, we needed to ensure that quantitatively we also felt good. We ran an experiment to see the impact on activity such as of re-pinning, viewing and sending Pins, and Pinner following, and saw encouraging results.

Stretching into the wee hours of the morning, we launched GIFs on web to the company, and later to all Pinners. We began working on mobile immediately in parallel, and with the latest mobile releases, GIFs are now available for your enjoyment across all platforms!

Now you’ll find GIFs all over Pinterest, including our board of Valentines.

Ludo is an engineer on the Pinterest Growth team

Read More

Pinterest is a data-driven company where we’re constantly using data to inform decisions, so it was important to develop an efficient system to analyze that information quickly.

We ultimately landed on Redshift, a data warehouse service on Amazon Web Service, to power our interactive analysis, and import billions of records everyday to make the core data source available as soon as possible. Redshift was a great solution to answer questions in seconds to enable interactive data analysis and quick prototyping (whereas Hadoop and Hive were used to process tens of terabytes of data per day, but only give answers in minutes to hours).

Here’s a look at our experience with Redshift, including challenges and wins, while scaling tens of billions of Pins in the system.

Redshift is built to be easy to set up and fairly reliable. However, with petabytes of data and a rapidly growing organization, there are some interesting challenges for us to use Redshift in production.

Challenge 1: building 100-terabyte ETL from Hive to Redshift

With petabytes of data in Hive, it took us some time to figure out the best practice to import more than 100 terabytes of core data into Redshift. We have various data formats in Hive, including raw json, Thrift, RCFile, all of which need to be transformed to text files with a flat schema. We write schema-mapping scripts in Python to generate the Hive queries to do the heavy lifting ETL.

Most of our tables in Hive are time series data partitioned by date. For the best results, we use the date as the sortkey and append data to each table daily to avoid expensive VACUUM operations. Another approach is using a table per day and connecting them with a view, but we found Redshift did not optimize queries with view very well (e.g. it did not push down LIMIT).

Loading a big snapshot table is also a challenge. Our biggest table, db_pins, which holds 20 billions of Pins, is more than 10TB in size. Loading it in one shot results in expensive partitioning and sorting, so we do the heavy partitioning in Hive and load it to Redshift in chunks.

Since Redshift has limited storage size, we implemented table retention for big time-series tables, by periodically running an INSERT query in Redshift to clone a new table with less data, which was much faster than deleting rows and doing expensive VACUUM, or dropping the entire table and re-importing.

Perhaps the biggest challenge was with the S3 Eventual Consistency. We found the data in Redshift got significantly missed sometimes, and traced it down to the S3 issue. We reduced the number of files on S3 by combining small files, and minimized the missing data. We also added the audit to the ETL at each step, and data loss rate is now usually under 0.0001% which is acceptable.

Challenge 2: getting the 100x speedup over Hive out of Redshift

Redshift is built to be high-performing, and as a result, during prototyping we were able to achieve 50-100x speedup over Hive for some queries without much effort. But once put into production, it did not always give the expected performance out of box.

In our early testing we observed a few hour-long queries. Debugging the performance issue was very tricky, requiring collecting the query plan, query execution statistics, etc., but ultimately it didn’t have that many performance tricks.

A lesson we learned here was to prepare the data well, and update the system statistics whenever necessary, as it can tremendously affect how well the optimizer works. Choose a good sortkey and distkey for each table, and note sortkey is always good to have, but bad distkey may contribute to skew and hurt the performance.

Below is the benchmark result between our Hive and Redshift clusters, based on db_pins (~20 billion rows with 50 columns, total size 10TB) and other core tables. Keep in mind that these comparisons don’t account for cluster sizes, contention and other possible optimizations so the comparisons are by no means scientific.

We also observed common mistakes and summarized our best practices, and built utilities to monitor slow queries in real time. Queries that take more than 20 minutes are considered suspicious, and engineers will receive a reminder to review our following best practices:

Perhaps the most common mistake is that many tend to put “SELECT *” in the select clause, which is against the columnar storage as it requires scanning all the columns unnecessarily.

Challenge 3: managing the contention among growing number of queries/users

With the impressive performance, Redshift was widely adopted at Pinterest soon after we set it up. We’re a fast growing organization, and with increasing number of people using Redshift concurrently, queries can slow down easily due to the contention among growing number of queries. An expensive query can occupy lots of resources and slow down other concurrent queries, so we came up with our own policy to minimize the contention.

We learned to avoid heavy ETL queries during peak hours (9am-5pm). ETL queries like COPY use a lot of I/O and network bandwidth, and should be avoided during these hours in favor of users’ interactive queries. We optimize our ETL pipeline to finish before peak hours, or we’ll pause the pipeline and resume afterward. Additionally, timeout users’ interactive queries during peak hours. Long queries likely have some mistakes and should be timed out as soon as possible instead of wasting the resource.

Current status

We’ve settled on a 16-node hs1.8xlarge. Almost 100 people use Redshift at Pinterest regularly and we’re running 300-500 interactive queries each day. The overall performance has exceeded our expectations, as most queries can finish in a few seconds. Here’s the duration percentile of all the interactive queries in the last week. We can see that 75% queries can finish in 35 seconds.

Because of the success we’ve had with Redshift, we’re continuing to integrate it with our next generation tools. If you’re interested in taking on these type of challenges and coming up with quick, scalable solutions, join our team!

Jie Li is an engineer at Pinterest.

Read More

Almost every data-driven company depends on a workflow management system. At Pinterest, we built Pinball, our own customizable platform for creating workflow managers before constructing a manager on top of it.

The birth of project Pinball

Hadoop is the technology of choice for large scale distributed data processing, while Redis does for an in-memory key-value store, and Zookeeper handles synchronization of distributed processes. So why isn’t there a standard for workflow management?

What is it that makes workflow management special? The workflow manager operates at a higher level than other systems. It usually comes as a package including the workflow configuration language, UI, executors with adapters for specific computing platforms (e.g., barebone OS, Hadoop, Hive, Pig), scheduler, etc. The broadness of the scope makes it challenging to come up with a one-size-fits-all type of solution, and flexibility is a desired virtue of a well designed workflow manager.

Many of the workflow managers available in the open source space fail to satisfy the requirements of the problem domain. Built on fixed skeletons, they often follow a monolithic design, and adding custom components requires drilling into the system core. Merging local changes with external releases becomes challenging, and pushing local changes to the external repository is often not possible.

Tokens get the (Pin)ball rolling

The key to flexibility is abstraction. In Pinball, the finest piece of system state that’s atomically updated is the token. Tokens may be owned for a limited time, and only owners are allowed to modify the token’s content. Tokens are versioned with identifiers unique across the time and space, where a version gets updated every time a token is modified and it’s never reused, even in different tokens. Version numbers are included in modification requests effectively implementing transactional updates.

To keep track of a token’s identify across modifications, a token is assigned an immutable name - an identifier unique across the space - at creation time. At a given point in time, only one token may use a given name.

One master to rule them all

Pinball core is built on top of master-worker paradigm. It’s generic, scalable, resilient, and above all, simple, which makes the system concepts easier to grasp, debug, and build on top of.

Workers periodically contact the master to claim tokens and perform tasks they represent. Every state (token) change goes through the master and gets committed to the persistent store before the worker request returns. Consequently, any component, including the master, can die at any time and recover without compromising the state consistency. Expiring token leases take care of the disappearing workers. Workers can be added and removed at will, practically at any point in time which becomes useful when dynamic, load-based resizing becomes an objective.

Newly created tokens are labeled as active and kept in master’s memory for efficient access and updates. Irrelevant tokens are either deleted or archived. The transition from active to archived state is one way. Archived tokens become read only and they are pushed out of memory.

Consequently, workers can read archived tokens directly from the persistent storage, bypassing the master, greatly improving system scalability.

Workflow layer and the master-worker paradigm

The master-worker paradigm is applied to workflow management but not as a horizontal extension but rather a vertical layer on top of the facade described in the previous section. The following components constitute the Pinball workflow management framework.

Configuration parser converts a workflow description into a set of tokens. Typically, tokens are defined at the granularity of individual jobs forming the workflow topology.

Workers impersonate application specific clients. Some workers, executors, are responsible for running jobs or delegating the execution to an external system (such as Hadoop cluster) and monitoring the progress. The scheduler worker makes sure that workflows are instantiated at predefined time intervals.

The UI visualizing the execution progress and providing access to job logs is also a worker. The UI may need to visit arbitrarily old workflows introducing the need to keep tokens of finished jobs around. This is where the token archival mechanism and direct reads from the persistent store come in handy. The UI can go back in the history as far back as needed without putting load on the operational components.

Any component can be replaced without affecting the remaining ones. For instance, supporting a new workflow definition language is as simple as replacing the workflow parser translating the configuration to a set of tokens. Similarly, adding a new type of executor does not require changes in other parts of the system.

Selected features

  • Python-based workflow definition language and an accompanying parser, and an alternative, full-UI workflow and job editor
  • Github-backed storage of workflow configurations
  • Job executors interfacing with local OS, Hive, and Hadoop
  • Workflow visualization and tracking UI
  • Job log explorer
  • Auto-retries of failed jobs
  • Email notifications
  • User authentication
  • Scheduler governing workflow execution timelines and supporting various overrun policies
  • Ability to retry failed jobs, abort running workflows, drain the system, resume the execution from where it left off
  • No-downtime releases, as the system can be upgraded to a new software version without disrupting running workflows
  • Dynamic resizing, so workers can be added or removed without taking the system down

Each of these features can be removed, altered, or replaced with minimal effort and without the need to understand all the intricate details of the system core.

Coming soon

We’ll be open sourcing Pinball soon. Keep an eye on this blog and our Pinterest Engineering Facebook Page for updates.

Pawel Garbacki is a software engineer at Pinterest.

Read More

An HTML sitemap makes it easier for search engines and people visiting your site to find content. It should also let them get to that content in as few clicks as possible, with a minimum number of page loads. With a site as big as Pinterest (we’re talking hundreds of millions of pages), though, building a sitemap can present some interesting challenges. Here’s the story of how Pinterest engineers tackled it.

Divide and conquer

To make the colossal job of mapping our site a bit more manageable, we split up our sitemap based on the different types of pages on our site—one sub-sitemap for boards, a separate sub-sitemap for Pinner profiles, and so on.

We then divided the Pinner profiles based on the first letter of the Pinner’s username, which allowed us to divide the sitemap creation task between multiple processes (one for each letter). While it didn’t guarantee uniform distribution of the workload, it kept the whole system intuitive to use.

Since we have millions of usernames for each letter of the alphabet, simply listing them all on consecutive pages wasn’t really an option. Instead, we divided the Pinner profiles into ranges, based on usernames:

Each range links to another sitemap page, where the usernames are broken down into smaller ranges. Leaf pages include usernames as entries, and each username links to a Pinner profile on Pinterest.

By organizing things this way, our visitors (including search engine bots) can now get to any profile page with a minimum number of page loads.

How we did it

We sorted all our Pinner profiles into alphabetical lists using secondary sort in Hadoop MapReduce. Each mapper processed user data and emitted the first letter of the username as the primary key, and the username as the secondary key. With this setup, all usernames starting with the same letter ended up in the same reducer, sorted alphabetically.

At that point, we were ready to generate the sitemap pages for each letter, going from the bottom up. First we split a sorted list of usernames (Uo, …, UK) into HTML pages (P0, …, PM), each containing up to N usernames:

Each page in the list represents a range of usernames. For example for P0 it is (U0, …, UN-1). We split this list of leaf pages again into N-sized chunks to create a list of higher-level pages (P’0, …, P’M’). We repeated this process until the resulting page list had N or fewer elements. From this list we created the top level page (P’’0).

We were almost done, except the last page on the list of leaf and lower-level pages (for ex. PM) often contained just a few entries with a lot of space around them. We fixed this issue by borrowing entries from the previous page on the list (PM-1), which had N entries. After the fix, PM-1 and PM had roughly the same number of entries.

In the final step, the top-level page (P’’0) had too few entries. We borrowed entries from the first child page (P’0) as shown above. We also updated the first entry in the original page (before the fix) to represent the range for new page P’0 (after the fix).

And that’s it! It took some time, and some intriguing problem-solving, but now Pinterest has an efficient sitemap of its very own. If you’re interested in solving problems like these, join our team to help people discover the things they love.

Anna Majkowska is a software engineer at Pinterest.

Read More

Here’s a look at how we defined release engineering at Pinterest, including how we developed efficient ways of fixing bugs, created design principles, streamlined tools to be useful across teams, and built release tools.

Tools of the Trade

We use a variety of tools that are fairly accessible to most companies.

  • Github Enterprise is our version-control overlay, managing code-reviews and facilitates code-merging, and has a great API.
  • Jenkins is our continuous integration system for packaging builds and running unit tests after each check in.
  • Zookeeper manages our state, and tells each node what version of code it should be running.
  • Amazon S3 is where we keep our builds. It’s a simple way to share data and scales with no intervention on our end.

The build pipeline

In practice, Pinterest is a continuous delivery shop, meaning the master branch of code is always deployable.

Here’s our build pipeline, from the inception of a change, to being served in production:

  • An engineer makes a git branch.
  • She pushes it to fork in Github Enterprise and submits a Pull Request.
  • Jenkins runs automated tests against her pull request (for services in our main repository).
  • The request is merged after it’s approved by the original engineer.
  • The newly integrated master branch triggers a Jenkins job that runs the same automated tests in step 3.
  • We then build task that creates a tarball of the files and pushes to S3. We also branch this build in git as jenkins-stable,
  • Some systems are automatically deployed to (e.g. an internal copy of the site).
  • A deploy is then manually initiated. We usually choose the same build that’s currently jenkins-stable, but we can choose anything that’s available in S3.
  • The deploy

    A build can be deployed to a set of canary hosts, or to our entire fleet. We record this state in Zookeeper. “Canarying” to a few hosts gives us time to validate that everything is working as planned. Each node has a robust Zookeeper client called deployd that listens to either enabled or enabled/canary depending on it’s role (which is also defined in Zookeeper).

    Graceful restarts are somewhat complicated, so we employ different strategies for different services. In the most advanced form, multiple copies of the same service are running on a machine, and through iptables we’re able to turn off traffic to a few instances while we restart it. With most services, we define a restart concurrency that defines how many nodes will be restarted at any given time. In some cases we can restart almost all the nodes at once with no user impact.

    We can then monitor our deploy in our state-of-the-art deploy monitoring tool.

    We had the help of Erik Rose’s blessings module which makes light work of the terminal.

    Design principles

    Don’t touch the deployed code directly: Initially our deploy scripts lived in the same repository as the code we were deploying. Even after we moved our tools to their own repository, we still did repository manipulations. We eventually moved all the git operations into Github API calls. Any pull request that removed a subprocess.call(['git', ...]) and replaced it with a call to the API. As a result, our deploys sped up and we had fewer dependencies for where we could deploy.

    Keep a consistent interface: We wanted the deploy process to be seamless for all engineers at the company, so we made the services conform to our requirements so each follows roughly the same flow. Whether it’s in our main repository or not, or whether it’s Python or Java is irrelevant to the tool, everything operates exactly the same. Our goal is to automate as much as possible. At every step of we think, how can we make the deploy easier for the deployer?

    Crash early, Crash often: Our deploy daemon attempts to catch all exceptions inside eventlets and the main thread and exit as quickly as possible. Rather than try to write complicated code to fix things, we wrote code that crashes early. We relied on a tool called Supervisor to restart our daemons, which usually has a side effect of fixing everything. The last thing we wanted our code to do was to get to a point where it was in a for-loop and needed to be killed. Thanks to health checks that close this loop, we can quickly verify if a node isn’t coming up correctly and bug fix (though this is rarely necessary).

    Biggest challenges

    Provisioning: We use puppet as part of our provisioning process. Many of our classes were coupled so that almost every box had rsync'd a git checkout of our code base. We solved this by refactoring our puppet code, decoupling where we could and having a two provider system for providing our software. We continue to have many checks in place that determine whether we'll use rsync or let deployd do it’s thing.

    Configuration: Initially configuration was bundled with the tools, but we moved service configuration into a configuration file to make it easier to update. The goal was to have as few moving parts for a given system as possible, so taking puppet out of the equation made a lot of sense.

    Python Packaging: We learned pip didn’t work great with system packages, and found an edge case where the cached copies of files were sometimes bad. We found many of these problems went away with pip 1.4 and have since changed our entire fleet to use it (rather than a mix of pip's from 0.8-1.3.x). We were building a lot of system tools that had their own dependencies, but we now have the option to use virtualenv for our python services, so they don’t conflict with our tools.

    New services: We were moving to SOA fast, and new services would be created while we were scrambling to fix bugs with our Zookeeper system and to transition older services. One of our earlier services forced us to document how to create a Pinterest service and integrate it into the deploy tool, which made it clear how fragile our frameworks were. Later on, some of our services were written in Java, and we were forced to write service configuration. While it wasn’t the code we wanted to write, it was the code we needed. The service level configuration made it easier for other teams to use our tools. The configuration made its way to Zookeeper, making the barrier to deployability even lower. Now we have a tool that’s a bit more robust and useful for other teams.

    Alternate routes

    We went with Python because it’s how the original scripts were written, however, if we were to start over, we might consider something like Go. For downloads, we might look to a tool like Herd which uses BitTorrent to distribute code, rather than S3. We might even decide that changing machines in place isn’t serving us well, and add baked AMIs to our deploy infrastructure.

    We’ll continue to collect data, optimize and repeat. While some of this process suffers from NIH, existing tools for deploying didn’t seem to meet our needs.

    If you like solving problems like these, join our team!

    Dave Dash is a software engineer at Pinterest

Read More

At WWDC this past June, Apple unveiled iOS7 and redefined the platform with a new visual design, including changes to UIKit and a set of new APIs. A team of four Pinterest engineers collaborated with the design team to rebuild the Pinterest iOS app with easier navigation, custom transitions, and more ways to discover related Pins through gestures, such as swiping.

image

With more than ¾ of all Pinterest usage occurring on mobile, it was important that the 3.0 update responded to Pinner requests for simpler, faster ways to engage with more Pins, and to provide an overall improved experience. Our main challenge was to reimagine the app while maintaining the aesthetics of our brand. We focused on three areas:

  • Migrating to UICollectionView and dropping iOS5 support
  • Adopting newer iOS7 APIs including UIViewController Transitions, Background Fetching, and UIKit Dynamics.
  • Building a new gestural interface

Embracing UICollectionView

At the core of the Pinterest app you’ll find UICollectionView. Before we dropped iOS5 support, we managed our own UIScrollView subclass for building grids. It was modeled after UITableView and handled all of the cell reuse and layout. Now that we’ve moved to support iOS6+, we’ve migrated our app over to UICollectionView, which all of the main views in the app use. We wrote a UICollectionViewLayout subclass to manage the layout of grid, which supports multiple sections, header/footer views, and floating headers.

image

image

Adopting new iOS7 APIs

iOS7 provides new powerful view controller transitioning APIs. We used these when tapping on a Pin from the grid to go into closeup:

image

We wanted to provide a transition that helped the user understand where they were at all times, and to help enforce that they can swipe left and right to explore more Pins. To do this, we used UINavigationControllerDelegate’s animationControllerForOperation method to provide a UIViewControllerAnimatedTransitioning object to perform the transition:

- (id )navigationController:(UINavigationController *)navigationController
                                   animationControllerForOperation:(UINavigationControllerOperation)operation
                                                fromViewController:(UIViewController *)fromVC
                                                  toViewController:(UIViewController *)toVC {
    if (operation == UINavigationControllerOperationPush && [toVC isKindOfClass:[CBLPinViewController class]]) {
        return [[CBLPinViewTransition alloc] init];
    } else if (operation == UINavigationControllerOperationPop && [fromVC isKindOfClass:[CBLPinViewController class]]) {
        return [[CBLGridViewTransition alloc] init];
    }
    return nil;
}

On iOS6, we fallback to the default UINavigationController slide transition.

We also adopted the new background multitasking modes on iOS7 to surface newer content without requiring Pinners to manually refresh their home feeds. iOS7 provides two new modes for background multitasking, by either having the app fetch new content, or be notified via silent notification. We implemented the “fetch” background mode.

Apple states that when the OS permits your app to download new content, it will launch or resume it in the background and provide it with a small amount of time to do work. When this happens, the system will call UIApplicationDelegate’s performFetchWithCompletion. We offload the fetching logic to our rootViewController who then calls the completion block with either UIBackgroundFetchResultNewData, UIBackgroundFetchResultNoData or UIBackgroundFetchResultFailed.

- (void)application:(UIApplication *)application performFetchWithCompletionHandler:(void (^)(UIBackgroundFetchResult))completionHandler
{
    [self.rootViewController performBackgroundFetchWithCompletion:completionHandler];
}

If new Pins are available, the home feed view is updated to indicate there are new Pins. Since iOS7 also allows you to update the screenshot for your app in the multitasking view, this was a great opportunity to grab the user’s attention to open up the app:

image

New Gestural Design

With our 3.0 release, we wanted to completely rethink how gestures were used. Earlier this year we presented iOS Pinners with an animated “contextual menu” when long pressing on a Pin to make it faster to Pin, like and send inline from the grid. We plan on adding this menu to boards and users in the future.

image

All of the animations and interactions for the menu are driven by a combination of CADisplayLink and Core Animation.

We wanted to teach Pinners another gestural interface that was engaging yet easy to understand. We looked at the data to better understand how Pinners were navigating the app, and we noticed a few things:

  • Pinners spent a lot of time in closeup view of a Pin where they could see a larger image and more metadata
  • Many performed core actions, such as Pinning, from closeup
  • They were often hitting the back button to get back to the grid to tap on the next Pin in the feed

We determined that adding the ability for Pinners to swipe left and right to discover more Pins would be simpler, faster and a more engaging experience.

image

This view was built using a horizontal UICollectionView, with UIScrollViews for each cell. Inside the scroll view exists another UICollectionView to present the related pins below.

More mobile to come!

In re-architecting the app, the team has laid the foundation to continue innovating on mobile, as many of these changes will be seen soon on iPad. As mobile is the leading platform for Pinterest, the team is focused on providing the best experience possible for Pinners on the go, regardless of the device they’re using. There are a lot of exciting projects we have planned for next year—if you’re interested in joining us, check out our careers page!

Steven Ramkumar is a software engineer at Pinterest and works on Mobile.

Read More

How we use gevent to go fast

November 1, 2013

Not too long ago we were tackling the challenge of fixing a legacy Python system and converting a two-year old single-threaded codebase with hundreds of thousands of lines of code to a multi-threaded codebase. To save us from rewriting everything from scratch, we went with gevent to make the program greenlet-safe.

With the update, Pinners can spend less time waiting and more time collecting and discovering the things they love on the site.

Here’s a look at how it all went down.

Lessons from the early days

In the first few years, we took the simplest scaling approach by building web and API servers in single-threaded mode. We developed features quickly and scaled to the fast-growing traffic, but as the traffic and the code size continued to grow, running many processes started to show its limits:

  • As more and more features were added, the footprint of our server was getting bigger.
  • As we added more backend servers to keep up with the growth, we ran the risk of having issues or slowing down. Slow requests could take a large number of processes out the pool, significantly shrinking the degree of concurrency, causing 500 errors or skyrocketing the site latency.
  • As more logics were added in code, we wanted to parallelize work (i.e. network IO) to reduce latency, however we were stuck with the single-threaded server.

Building high performance servers

The solution called for parallelized servers capable of handling multiple requests at the same time, and gevent was the answer.

Gevent is a library based on non-blocking IO (libevent/libev) and lightweight greenlets (essentially Python coroutines). Non-blocking IO means requests waiting for network IO won’t block other requests; greenlets mean we can continue to write code in synchronous style natural to Python. Together they can efficiently support a large number of connections without incurring the usual overhead (e.g. call stacks) associated with threads.

It’s easier to make code greenlet-safe than thread-safe because of the cooperative scheduling of greenlets. Unlike threads, greenlets are non-preemptive; unless the running greenlet voluntarily yields, no other greenlets can run. Keep in mind that the critical sections must not yield; if they do, they must be synchronized.

Go time: running the code

Here’s the approach we took:

1. Make blocking operations yield

Greenlets can’t be preempted, so unless it yields, no other greenlets can execute. Gevent comes with monkey_patch utility that patches the common libraries (e.g. socket, threads, time.sleep) to yield before they block. But not all libraries can be monkey-patched. For example, if a library binds on external C library (e.g. MySQL-python, pylibmc, zc.zk) that does blocking operations, it can’t be monkey-patched. For these cases we needed to replace them with their pure-python implementation (e.g. pymysql, python-memcached, kazoo).

2. Make code greenlet-safe

Because greenlets are non-preemptive, there’s usually no need to synchronize critical sections as long as they don’t yield. If a critical section yields (i.e. if we need data consistency before and after a yielding operation), we need to make them greenlet-safe by either synchronization or changing the implementation to be non-yielding.

All yielding operations themselves need to be made greenlet-safe. The most common examples are classes that make network connections (e.g. thrift, memcache, redis, s3, thrift, http, etc). If two greenlets access a socket at the same time, conflicts will cause undefined behaviors. The solution is to use connection pooling, create per-request connection, or synchronize to make them greenlet-safe.

3. Testing, testing, testing

A project like this won’t be successful without comprehensive tests. By leveraging the well-defined interface of our API server, we wrote concurrent tests that not only helped identify some subtle issues early, but also helped us fine-tune concurrency settings in production.

4. Deals with unfair scheduling

While cooperative scheduling of greenlets makes it easy to deal with critical sections, it can introduce other problems, such as unfair scheduling. If a greenlet is doing pure CPU work and doesn’t yield, other greenlets have to wait. The solution is to explicitly yield (by calling gevent.sleep(0)) during heavy processing. Moreover, running more processes can help alleviate the problem since each process gets less number of concurrent requests and process scheduling is fair as it’s done by OS.

A faster Pinterest

As of early 2013, all our Python servers (including web servers, API servers, and thrift servers that power some important services, e.g. follow service), are running on gevent. With gevent, we were able to significantly reduce the number of processes on each machine, while still getting lower latency, higher throughput, and much better resilience to spiky traffic and network problems.

Here’s to continuing to make Pinterest more efficient!

To connect with the Pinterest Engineering team, like our Facebook Page!

Xun Liu is a software engineer at Pinterest

Read More