The official Pinterest engineering blog.

A core part of Pinterest is giving people the ability to discover content related to their interests. Many times, a Pin can be discovered when people share it with one another, through features like Group Boards and messages.

To make the experience of finding a person as seamless as possible, we recently rebuilt our user typeahead. The goal was to quickly surface the most relevant contacts for the user given what we know about that person’s social connections and following activity on Pinterest.

The legacy typeahead was often slow and unresponsive, and limited the number of contacts a user could store. Additionally, all of the logic for the typeahead layer resided in our internal API. We set out to not only make the typeahead faster, but to also split it into a separate service that could be deployed and maintained independently of our API and web application. Building our typeahead in line with our service-oriented architecture would improve testability, ease the process of adding and deploying new features, and make our code more maintainable.

Developing the Contact Book signal

To surface the most relevant contacts, we leveraged the Pinterest following graph, and social graphs from Pinners’ social networks, such as Twitter, Facebook, and Gmail. Still, many users don’t link their social networks to Pinterest, so additional signals based on mobile contacts were used.

Asking for permission to access contacts on mobile can be tricky if intentions aren’t clear. Simply showing users a native dialog can result in a confusing user experience and a low acceptance rate. To build a seamless permissions experience, we integrated learnings from other companies, such as that the best way to ask for permissions is to explain the value of connecting upfront before asking for native permissions.

We also leveraged the Experience Framework for our mobile permissions flow, which allowed us swap out the text of the mobile permissions flow without changing client code. It also meant we could easily experiment with different flows to further optimize our success rate.

Building the new backend

We built a separate backend service (we call Contacts Service) to store the data for the new user typeahead. The new system consisted of three major components:

  • A modifiable, real-time online “index” for fast access of any user’s contacts from different sources.
  • A Thrift server (which we call Contacts Service) exposing interfaces to manage this index. This allows clients to update the index and look up top contacts using prefix-matching.
  • A set of PinLater tasks that keeps the index in sync with the contacts sources using the Thrift interface.

We chose HBase as the storage solution for the contacts index because it met all of our requirements:

  • Speed: The primary requirement of the typeahead is to look up names quickly. If names are sorted, the operation is simply a binary search using the prefix combo to locate the position then scan until we have enough results. This is a typical HBase scan operation. Our performance data shows that a scan of 20 rows in HBase takes less than two milliseconds — fast enough to meet our requirements.
  • Scalability: HBase is horizontally scalable, and adding more machines to a cluster is easy with minimal manual intervention, and gives linear throughput increase.
  • Fault tolerance: HBase supports auto failover. If a box dies, the HBase cluster moves the data to other live servers and the whole process takes a few minutes to finish with no change required on client side.
  • Writable: We chose to maintain an updatable index instead of two-layer system solution involving base index and fresh index to simplify the implementation and maintenance. HBase provided us with great write performance.

Supporting millions of contacts for one user

We support and aggregate contacts from various sources for each user in the contacts service. Therefore, it’s extensible if we want to add new sources, and flexible if we want to query contacts from certain sources. In most cases, the number of contacts from each source is under two hundred. However, some Pinners have millions of followers, so to tackle this challenge, we used two kinds of schemas to store contacts: wide schema and tall schema.

Wide schema: This is the default schema, which is expected to fit most sources. Contacts in one source for one user are stored together in one row. Each name token is stored as the prefix of column name. With our own implementation of ColumnPaginationFilter (which provides similar feature as Scan but inside one row), we are able to batch these GET requests to all sources in the wide schema in one RPC to do a prefix lookup.

Tall schema: This schema is specifically designed for sources (e.g. Pinterest follower) with potentially large number of contacts. The wide schema cannot support this use case because data in one row cannot be split across regions in HBase. In the tall schema, for each user we store contacts from one source in nearby rows, where the name token is part of the row key. Then for each source, one Scan request can achieve the prefix lookup among the contacts in this source.

Ranking and de-duping contacts

As we provide contacts lookup from different sources, it’s important that we display the most relevant contact at the top. It could also be annoying if we returned the same contact multiple times because it appears as a contact in multiple of your sources. We found an accurate de-duping logic to filter out contacts was highly beneficial to a user’s typeahead experience. To further improve on the above two areas, we relied on many of our existing services to provide real-time data access and enable Contacts Service to return de-duped contacts in the pre-defined ranking order.

  • Ranking: We have configurable ranking order based on different sources. For example, mutual followers is a strong signal of relevancy, so we always want to boost mutual follower contacts at top. This ranking order is configurable, which can be easily replaced with another one.
  • De-duping: Contacts Service talks to Friend Service for social network (including Facebook, Twitter, G+) id to Pinterest id lookup, and Data Service (with our own positive and negative caching) for email to Pinterest id lookup. With this information, we can easily tell whether these contacts are actually the same person based on their Pinterest id.

In order to store the data for efficient lookups, we tokenized each of the contacts names, and store each token with a reference to the original contact. We use the same algorithm to tokenize the contact’s full name, and the query string. This guarantees consistent results. We were able to use this tokenization to help score each match based on the proximity of the terms within the contact name and query string. For example, searching for “John Smith” should yield “John Smith” as a stronger match (higher in the results) compared to “John Michael Smith.”

How to enable real-time results?

We considered running an update job daily, to refresh the index with all of the new connections, but decided real-time results were far more useful. For example, when you connect your Facebook account to your Pinterest account, you should be able to send pins to your Facebook friends almost immediately.

In order to have real-time results, we hooked in updates to the new backend service for every time a user’s contact information changes. There are a few places this can happen:

  • Pinterest name change: When a Pinner changes his/her name, it’s reflected in all of their contacts. That way, others can search based on this new name, and the old name will no longer yield this contact as a result. It can get tricky for those with millions of connections for a particular source, though. We wrote a chained PinLater task that uses pagination to help fan out the updates without overloading the system.
  • Connecting / updating social network source: When a user connects their Pinterest account to Facebook, Twitter, etc., we want them to be able to send pins, and so we created a new PinLater task for this user, to lookup their new connections and update the backend Contacts Service accordingly. We also hooked up these updates in the existing social network refresh tasks.
  • Updating follower relationships: When a Pinterest user follows or unfollows another Pinterest user, those changes should be reflected immediately.

Getting all that data loaded

Since we wrote a new backend service for the new user typeahead, we had to populate it somehow. The initial set of data was not small by any means.

To upload the initial data, we wrote a task to upload all of a user’s connections for each supported source type. We ran this task for every Pinterest user, which took about three days to complete, even with a dedicated PinLater cluster.

Timing was critical. We needed to write all of the real-time updates before doing the initial upload so the corresponding updates would be done on top of the base set of data. So, first we added the real-time updates logic, and then we did the uploaded all of the data. Any actions, thereafter, that changed a user’s contacts updated the backend automatically.

Introducing a faster typeahead

The new user typeahead is markedly faster than the original implementation, with a server-side p99 of 25ms. When we A/B tested our new typeahead implementation against the old version, we found that message sends and Pinner interactions increased significantly.

The next improvement is to compute second-order connections to expand the breadth of the typeahead. Stay tuned!

Devin Finzer, Jiacheng Hong, and Kelsey Stemmler are software engineers at Pinterest.

Acknowledgements: Dannie Chu, Xun Liu, Varun Sharma and Eusden Shing

Read More

Every person sees each page of Pinterest differently, based on factors such as their interests and who they follow. Each pageload is computed anew by our back-end servers, with JavaScript and XHR powering client-side rendering for interactive content and infinite scrolling. This process provides Pinners with unique experiences, but at the scale of our operations, it can create friction.

Here I’ll discuss the atomic deploy system, a solution to some of the challenges that occur during deployment, and a path to successful and ongoing deployments.

Introducing the atomic deploy system

When a Pinner first visits the site, the backend server instructs the web browser to load a particular JavaScript bundle, and ensures the bundle matches the version of the backend software running on that server. So far, everything’s in lock-step, but when we deploy an update, we create a problem for ourselves.

We deploy software at Pinterest in waves, taking 10% of our servers offline in a batch, replacing the software, and putting them back into service. This allows us to have continuous service during our deployments, but it also means we’re running a mixed fleet of old- and new-version servers.

The atomic deploy system is our answer, which grew out of our desire to balance rapid innovation with a consistent, seamless user experience. We aim to develop our technology as quickly as possible, so this system was designed to avoid the intricate dance of backward-compatible updates. At the same time, we wanted to avoid the jarring experience of a page reloading by itself (potentially even losing the user’s context), or forcing the user to click on something to force the site to reload itself.

We hadn’t heard of other cases of doing deploys this way, which meant it could be a terrible idea, or a great one. We set out to find out for ourselves.

Managing “flip flops”

Say for example you visit the website on a Monday morning. You’re likely to view a page generated by a version of our front-end software, version “A”. If you hit reload, you’ll get pages generated by version “A”. Furthermore, we use XHR, so when a you interact with the web app, you’ll be served by dozens of requests in the background. All of these requests are powered by version “A.”

Later, you might wish to deploy “B”. Our standard model for deployments is to roll through our fleet 10-15% at a time slowly converting one web server from serving “A” to “B”.

Now a request has a chance of being served by “A” or “B” with no guarantees. With standard page-reloads this is not a problem, but much of Pinterest is XHR-based, meaning only part of a page will reload when a link is clicked. Our web framework can detect when it’s expecting a certain version and it gets something unexpected, in which case it’ll often force a reload.

For example if you go to and it’s served by A and you click a Pin, and the XHR is served by B, you’ll get a page reload. At which point you might click on another Pin which might be served by an mismatched version, which will cause another reload. In fact, you can’t escape a chance of reloads until the deploy is complete, which we call the “flip-flopping effect”. In this case, rather than browse Pinterest smoothly with nice clean interactions, you’ll get a number of full-page reloads.

Our architecture changes

When you visit the site, you talk to a load balancer which chooses a varnish front-end which in turn talks to our web front-ends which used to run nine python processes. Each of these processes are serving the exact same version on any given web front-end.

What we really wanted was a system that would only force a user to reload at most once during a deploy. The best way to ensure this was to ensure atomicity, meaning if we’re running version “A” and you’re deploying “B”, we flip a switch to version “B” and all users are on “B.”

We decided the best way to achieve this was to support serving two different versions of Pinterest and have Varnish intelligently decide which version to use. We created beefier web front-end boxes (c3.8xlarges from c1.xlarges), which could not only handle more load, but easily run 64 Python processes where half were serving the current version of Pinterest and the other serving the previous. The new and old versions were backed behind nginx with a unique port per each version of the site being served. For example, port 8000 might serve version “A” on one host, and port 8001 might serve version “B”.

Varnish will happily serve either the current version of the site or the previous version if you specify which you want (presumably you wouldn’t, but our JavaScript framework would). If you make a request without specifying a version you’ll get the current version of the site. Varnish will route to the right host/port which happens to serve the desired version.

Coordination and deploys

In order to inform Varnish what we should do, we developed a series of barriers, which tell Varnish what version to serve and when. Additionally we created “server sets” in ZooKeeper that let Varnish know which upstream nginx are serving.

Let’s imagine a steady state where “A” is our previous version, “B” is our current version. Users can reach either version “A” or “B”, and within a page load, they will always stay on either “A” or “B” and not switch unless they reload their browser. If they reload their browser they will get version “B”.

If we decide to roll out version C we do the following:

  • Through ZooKeeper we tell Varnish to no longer serve version “A”.
  • Varnish responds when it’s no longer serving version “A”.
  • We roll through our web fleet and uninstall “A” and install “C” in it’s place.
  • When all the web has “C” available we let varnish know that it’s ok to serve.
  • Varnish responds when all the varnish nodes can serve “C”.
  • We switch the default version from “B” to “C”.

By using these barriers, it’s not until the second step that people who were on “A” are now being forced onto “B”. At step 6 we allow new users to be on “C” by default, and users who were on “B” stay on “B” until the next deploy.

A look at the findings

The absolute values are redacted, but you can see the relative effect. Note the dips correspond with weekends, which is when we tend not to deploy our web app. In mid-April, we switched completely to the new atomic deploy system.

We found that the new atomic deployments reduced annoyances for Pinners and contributed to an overall improved user experience. This ultimately means that deploys are stealthier and can we can reasonably do more deploys throughout the day or as the business might require.

Nick Taylor is a software engineer at Pinterest.

Acknowledgements: Jeremy Stanley and Dave Dash, whose contributions helped make this technology a reality.

Read More

Last week we hosted a Tech Talk on Pinterest’s logging infrastructure and Apache Mesos, with engineers Roger Wang of Pinterest, Connor Doyle of Mesosphere, and Bernardo Gomez Palacio of Guavus. You can find a recap of the discussion below, and slides on SlideShare.

Logging Infrastructure at Pinterest (Roger Wang)

Roger joined Pinterest as a software engineer on the data engineering team last year, after working in engineering at Amazon, Google, and Microsoft. Over the years, he’s had the opportunity to work on persistence services, search features and infrastructure. For the talk, Roger shared insights into Pinterest’s new high performance logging agent, Singer, which plays a crucial role in our data architecture. Singer is currently in production to ship logs from application hosts to Kafka, and we plan to open-source it soon.

Deploying Docker Containers at Scale with Mesos & Marathon (Connor Doyle)

Connor joined Mesosphere as an engineer after most recently working at Gensler and One Orange Software. His current work focuses on Marathon - an Apache Mesos framework, which was the topic of his presentation. Connor started with a high level overview of Mesos before diving into new features of Marathon, including deploying Docker containers at scale.

Scaling Big Data with Hadoop & Mesos (Bernardo Gomez Palacio)

Bernardo joined Guavus from HashGo and IGN Entertainment. With a previous focus on backend APIs and infrastructure, he’s now working closer to big data. His presentation focused on using Mesos as a resource management layer for Spark and Hadoop clusters. He also shared case studies from personal experiences managing the Mesos clusters for these two frameworks, as well as comparisons between Mesos with Yarn.

Interested in keeping an eye out for future events? Follow us on Facebook and LinkedIn .

Krishna Gade is a software engineer at Pinterest.

Read More

As part of an ongoing series, engineers will share a bit of what life is like at Pinterest. Here, engineer Wendy Lu talks about discovering her love for Computer Science, building the Pinterest iOS app, and escaping Alcatraz.

How did you get involved with Computer Science?

I started college with an interest in sociology and statistics, specifically large scale relationship theory, social networks, and graph theory. I took a class as a freshman, Advanced Social Network Analysis, where we used a Python package called NetworkX. I decided that to truly learn this type of analysis, I should take some more classes in the CS department. I ended up enjoying the department so much that I decided to switch majors.

What are you working on right now?

I work on the mobile team, focusing on the iOS app. I’m currently experimenting with making notifications a more engaging and attractive experience. I’m also working on monitoring and logging performance in our apps.

How would you describe Pinterest’s engineering culture?

It’s a super collaborative culture! We work very closely with designers, community, communications, and all of the other engineering teams to solve problems together.

What’s your favorite Pinterest moment?

I really enjoyed our last Make-a-thon (our version of a hackathon). It’s inspiring to see people across different teams working together and how much teams can accomplish in such a short time frame.

My team worked on a yet-to-be-released feature that involved people from the design, discovery, mobile, BlackOps, and web engineering teams. We won two awards: “Pinployees Choice” and “Pinners Most Wanted”, for building a feature that’s highly requested by the community. To keep with tradition, we capped off a night of hacking with a 5am Denny’s trip.

How do you spend your time outside of work?

After work I can usually be found at the dance studio or swimming laps at the YMCA. Then I’ll listen to Sci-fi audiobooks until I fall asleep. Right now I’m liking the series written by Orson Scott Card (author of Ender’s Game). I’m also always exploring all that San Francisco has to offer. The numerous brunch spots are my favorite!

What do you do to get inspired?

I try to surround myself with people who I admire, which is easy to do that at Pinterest! I also find it extremely healthy to daydream- whether that’s through a scenic road trip, while listening or playing music, or through a good book that makes you think.

What’s your latest interest?

I’m doing a swim from Alcatraz to San Francisco in September with a couple of friends. It’s something I’ve wanted to do since moving to San Francisco, but it will only be my second open water race, so something tells me that I should start training in the Bay soon.

Fun fact?

I competed as a synchronized swimmer for 10 years, and was a member of the U.S. Junior National team and the Stanford Varsity team.

Interested in working with engineers like Wendy? Join us!

Read More

Last October, I posed the question: "Where are the numbers?". It was a call to action for the tech industry to share metrics on diversity in the workplace. Without measurement and transparency, it’s impossible to have honest conversations about making tech more inclusive. Since then, more than 150 startups have shared their women in engineering numbers, and some of the largest and most prominent tech companies have published their stats.

Today we’re taking our latest step by giving a more holistic look at our demographics across the company. We’re not close to where we want to be, but we’re working on it.

Our vision is to help people live inspired lives—people across the world, from all walks of life. We only stand to improve the quality and impact of our products if the people building them are representative of the user base and reflect the same diversity of demography, culture, life experiences and interests that makes our community so vibrant.

As we look ahead, we’ve put particular focus on inclusion efforts in hiring earlier in the engineering pipeline, recruiting a 29% female inaugural engineering intern class last year and 32% female this year. Beyond hiring, we’re mindful of processes and practices that may affect success and retention of employees coming from less represented backgrounds.

We’re also working with organizations that are effecting real change, including:

While we’ve made some progress in diversifying gender at the company, we haven’t done as well in representing different ethnicities, and we’re focused on getting better. We still have a lot of work ahead of us to make Pinterest a global company, as we build a global product. However, we’re excited to be a part of a broader movement in the tech industry to make it a more diverse and inclusive place.

*Gender and ethnicity data are global and include hires starting through September 2014. This is not based on EEO-1 reports; however, ethnicity refers to the EEO-1 categories which we know are imperfect categorizations of race and ethnicity, but reflect the U.S. government reporting requirements.
**Other includes Biracial, American Indian, Alaskan Native, Native Hawaiian and Pacific Islander.
***Tech includes Engineering, Product Management, and Design. Business includes all disciplines outside of Tech.

Tracy Chou is a software engineer and tech lead at Pinterest.

Read More

Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, we’re building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling our data infrastructure to traverse the interest graph to extract context and intent for each Pin.

We currently log 20 terabytes of new data each day, and have around 10 petabytes of data in S3. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.

In order to build big data applications quickly, we’ve evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.

Building a self-serve platform for Hadoop

Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform. Fortunately there are many Hadoop libraries/applications and service providers that offer solutions to these limitations. Before choosing from these solutions, we mapped out our Hadoop setup requirements.

1. Isolated multitenancy: MapReduce has many applications with very different software requirements and configurations. Developers should be able to customize their jobs without impacting other users’ jobs.

2. Elasticity: Batch processing often requires burst capacity to support experimental development and backfills. In an ideal setup, you could ramp up to multi-thousand node clusters and scale back down without any interruptions or data loss.

3. Multi-cluster support: While it’s possible to scale a single Hadoop cluster horizontally, we’ve found that a) getting perfect isolation/elasticity can be difficult to achieve and b) business requirements such as privacy, security and cost allocation make it more practical to support multiple clusters.

4. Support for ephemeral clusters: Users should be able to spawn clusters and leave them up for as long as they need. Clusters should spawn in a reasonable amount of time and come with full blown support for all Hadoop jobs without manual configuration.

5. Easy software package deployment: We need to provide developers simple interfaces to several layers of customization from the OS and Hadoop layers to job specific scripts.

6. Shared data store: Regardless of the cluster, it should be possible to access data produced by other clusters

7. Access control layer: Just like any other service oriented system, you need to be able to add and modify access quickly (i.e. not SSH keys). Ideally, you could integrate with an existing identity (e.g. via OAUTH).

Tradeoffs and implementation

Once we had our requirements down, we chose from a wide range of home-brewed, open source and proprietary solutions to meet each requirement.

Decoupling compute and storage: Traditional MapReduce leverages data locality to make processing faster. In practice, we’ve found network I/O (we use S3) is not much slower than disk I/O. By paying the marginal overhead of network I/O and separating computation from storage, many of our requirements for a self-serve Hadoop platform became much easier to achieve. For example, multi-cluster support was easy because we no longer needed to worry about loading or synchronizing data, instead any existing or future clusters can make use of the data across a single shared file system. Not having to worry about data meant easier operations because we could perform a hard reset or abandon a problematic cluster for another cluster without losing any work. It also meant that we could use spot nodes and pay a significantly lower price for compute power without having to worry about losing any persistent data.

Centralized Hive metastore as the source of truth: We chose Hive for most of our Hadoop jobs primarily because the SQL interface is simple and familiar to people across the industry. Over time, we found Hive had the added benefit of using metastore as a data catalog for all Hadoop jobs. Much like other SQL tools, it provides functionality such as “show tables”, “describe table” and “show partitions.” This interface is much cleaner than listing files in a directory to determine what output exists, and is also much faster and consistent because it’s backed by a MySQL database. This is particularly important since we rely on S3, which is slow at listing files, doesn’t support moves and has eventual consistency issues.

We orchestrate all our jobs (whether Hive, Cascading, HadoopStreaming or otherwise) in such a way that they keep the HiveMetastore consistent with what data exists on disk. This makes is possible to update data on disk across multiple clusters and workflows without having to worry about any consumer getting partial data.

Multi-layered package/configuration staging: Hadoop applications vary drastically and each application may have a unique set of requirements and dependencies. We needed an approach that’s flexible enough to balance customizability and ease of setup/speed.

We took a three layered approach to managing dependencies and ultimately cut the time it takes to spawn and invoke a job on a thousand node cluster from 45 minutes to as little as five.

1. Baked AMIs:

For dependencies that are large and take a while to install, we preinstall them on the image. Examples of this are Hadoop Libraries and a NLP library package we needed for internationalization. We refer to this process as “baking an AMI.” Unfortunately, this approach isn’t available across many Hadoop service providers.

2. Automated Configuration (Masterless Puppet):

The majority of our customization is managed by Puppet. During the bootstrap stage, our cluster installs and configures Puppet on every node and, within a matter of minutes, Puppet keeps all our nodes with all of the dependencies we specify within our Puppet configurations.

Puppet had one major limitation for our use case: when we add new nodes to our production systems, they simultaneously contact the Puppet master to pull down new configurations and often overwhelm the master node, causing several failure scenarios. To get around this single point of failure, we made Puppet clients “masterless,” by allowing them to pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master.

3. Runtime Staging (on S3): Most of the customization that happens between MapReduce jobs involves jars, job configurations and custom code. Developers need to be able to modify these dependencies in their development environment and make them available on any one of our Hadoop clusters without affecting other jobs. To balance flexibility, speed and isolation, we created an isolated working directory for each developer on S3. Now, when a job is executed, a working directory is created for each developer and its dependencies are pulled down directly from S3.

Executor abstraction layer

Early on, we used Amazon’s Elastic MapReduce to run all of our Hadoop jobs. EMR played well with S3 and Spot Instances, and was generally reliable. As we scaled to a few hundred nodes, EMR became less stable and we started running into limitations of EMR’s proprietary versions of Hive. We had already built so many applications on top of EMR that it was hard for us to migrate to a new system. We also didn’t know what we wanted to switch to because some of the nuances of EMR had creeped into the actual job logic. In order to experiment with other flavors of Hadoop, we implemented an executor interface and moved all the EMR specific logic into the EMRExecutor. The interface implements a handful of methods such as “run_raw_hive_query(query_str)” and “run_java_job(class_path)”. This gave us the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.

Deciding on Qubole

We ultimately migrated our Hadoop jobs to Qubole, a rising player in the Hadoop as a Service space. Given that EMR had become unstable at our scale, we had to quickly move to a provider that played well with AWS (specifically, spot instances) and S3. Qubole supported AWS/S3 and was relatively easy to get started on. After vetting Qubole and comparing its performance against alternatives (including managed clusters), we decided to go with Qubole for a few reasons:

1) Horizontally scalable to 1000s of nodes on a single cluster

2) Responsive 24/7 data infrastructure engineering support

3) Tight integration with Hive

4) Google OAUTH ACL and a Hive Web UI for non-technical users

5) API for simplified executor abstraction layer + multi-cluster support

6) Baked AMI customization (available with premium support)

7) Advanced support for spot instances - with support for 100% spot instance clusters

8) S3 eventual consistency protection

9) Graceful cluster scaling and autoscaling

Overall, Qubole has been a huge win for us, and we’ve been very impressed by the Qubole team’s expertise and implementation. Over the last year, Qubole has proven to be stable at Petabyte scale and has given us 30%-60% higher throughput than EMR. It’s also made it extremely easy to onboard non-technical users.

Where we are today

With our current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. We have over 100 regular Mapreduce users running over 2,000 jobs each day through Qubole’s web interface, ad-hoc jobs and scheduled workflows.

We have six standing Hadoop clusters comprised of over 3,000 nodes, and developers can choose to spawn their own Hadoop cluster within minutes. We generate over 20 billion log messages and process nearly a petabyte of data with Hadoop each day.

We’re also experimenting with managed Hadoop clusters, including Hadoop 2, but for now, using cloud services such as S3 and Qubole is the right choice for us because they free us up from the operational overhead of Hadoop and allow us to focus our engineering efforts on big data applications.

If you’re interested in working with us on big data, join our team!

Acknowledgements: Thanks to Dmitry Chechik, Pawel Garbacki, Jie Li, Chunyan Wang, Mao Ye and the rest of the Data Infrastructure team for their contributions.

Mohammad Shahangian is a data engineer at Pinterest.

Read More

A lot goes on in the backend when a person clicks the Pin It button. Thumbnails of all sizes are generated, the board thumbnail is updated, and a Pin is fanned out to those who follow the Pinner or the board. We also evaluate if a Pin should be added to a category feed, check for spam, index for search, and so on.

These jobs are critically important but don’t all need to happen before we can acknowledge success back to the user. This is where an asynchronous job execution system comes in, where we need to enqueue one or more jobs to execute these actions at a later time and rest assured they will eventually be executed. Another use case is when a large batch of jobs needs to be scheduled and executed with retries for resiliency toward temporary backend system unavailability, such as a workflow to generate and send emails to millions of Pinners each week. Here’s a look at how we developed an asynchronous job execution system in-house, which we call PinLater.

Evaluating options

We had originally implemented a solution based on Pyres for this purpose, however it had several limitations:

  • Job execution was best effort, i.e. there was no success acknowledgement (ACK) mechanism.
  • There was a lack of visibility into the status of individual job types, since jobs were all clubbed into a single set of nine priority queues.
  • The system wasn’t entirely configurable or manageable, e.g. no ability to throttle job execution or configure retries.
  • It was tied to Redis as the storage backend, and only worked for jobs written in Python, both of which were restrictions that would not continue to be acceptable for us.
  • It didn’t have built-in support for scheduled execution of jobs at a specific time in the future, a feature that some of our jobs needed.

We looked at a few other open source queue or publish/subscribe system implementations, but none provided the minimum feature set we needed, such as time-based scheduling with priorities and reliable ACKs, or could properly scale. Amazon Simple Queue Service (SQS) would likely meet many of our requirements, but for such a critical piece of infrastructure, we wanted to operate it ourselves and extend the feature set as needed, which is why we developed PinLater.

Designing for execution of asynchronous jobs

In building PinLater, we kept the following design points in mind:

  • PinLater is a Thrift service to manage scheduling and execution of asynchronous jobs. It provides three actions via its API: enqueue, dequeue and ACK that make up the core surface area.
  • PinLater is agnostic to the details of a job. From its point of view, the job body is just an opaque sequence of bytes. Each job is associated with a queue and a priority level, as well as a timestamp called run_after that defines the minimum time at which the job is eligible to run (by default, jobs are eligible to run immediately, but this can be overridden to be a time in the future).
  • When a job is enqueued, PinLater sends it to a backend store to keep track of it. When a dequeue request comes in, it satisfies the request by returning the highest priority jobs that are eligible to run at that time, based on run_after timestamps. Typically there are one or more worker pools associated with each PinLater cluster, which are responsible for executing jobs belonging to some subset of queues in that cluster. Workers continuously grab jobs, execute them and then reply to PinLater with a positive or negative ACK, depending on whether the execution succeeded or failed.
  • In our use of PinLater, each job type maps 1:1 to a specific queue. The interpretation of the job body is a contract between the enqueuing client(s) and the worker pool responsible for that queue. This 1:1 mapping isn’t mandated by PinLater, but we have found it to be operationally very useful in terms of managing jobs and having good visibility into their states.

Job state machine

A newly enqueued job starts in state PENDING. When it becomes eligible for execution (based on priority and its run_after timestamp), it can be dequeued by a worker, at which point its state changes to RUNNING.

If the worker completed the execution successfully, it will send a success ACK back, and the job will move to a terminal SUCCEEDED state. Succeeded jobs are retained in PinLater for diagnostics purposes for a short period of time (usually a day) and then garbage collected.

If the job execution failed, the worker will send a failure ACK back, at which point PinLater will check if the job has any retries available. If so, it will move the job back to PENDING. If not, the job goes into a terminal FAILED state. Failed jobs stay around in PinLater for diagnostics purposes (and potentially manual retries) for a few days. When a job is first enqueued, a numAttemptsAllowed parameter is set to control how many retries are allowed. PinLater allows the worker to optionally specify a delay when it sends a failure ACK. This delay can be used to implement arbitrary retry policies per job, e.g. constant delay retry, exponential backoff, or a combination thereof.

If a job was dequeued (claimed) by a worker and it didn’t send back an ACK within a few minutes, PinLater considers the job lost and treats it as a failure. At this point, it will automatically move the job to PENDING or FAILED state depending on whether retries are available.

The garbage collection of terminal jobs as well as the claim timeout handling is done by a scheduled executor within the PinLater thrift server. This executor also logs statistics for each run, as well as exports metrics for longer term analysis.

PinLater’s Python worker framework

In addition to the PinLater service, we provide a Python worker framework that implements the PinLater dequeue/ACK protocol and manages execution of python jobs. Adding a new job involves a few lines of configuration to tell the system which PinLater cluster the job should run in, which queue it should use, and any custom job configuration (e.g. retry policy, number of execution attempts). After this step, the engineer can focus on implementing the job logic itself.

While the Python framework has enabled smooth transition of jobs from the earlier system and continues to support the vast majority of new jobs, some of our clients have implemented PinLater workers in other languages like Java and C++. PinLater’s job agnostic design and simple Thrift protocol have made this relatively straight forward to do.

Implementation details

The PinLater Thrift server is written in Java and leverages Twitter’s Finagle RPC framework. We currently provide two storage backends: MySQL and Redis. MySQL is used for relatively low throughput use cases and those that schedule jobs over long periods and thus can benefit from storing jobs on disk rather than purely in memory. Redis is used for high throughput job queues that are normally drained in real time.

MySQL was chosen for the disk-backed backend since it provides the transactional querying capability needed to implement a scheduled job queue. As one might expect, lock contention is an issue and we use several strategies to mitigate it including a separate table for each priority level , use of UPDATE … LIMIT instead of SELECT FOR UPDATE for the dequeue selection query, and carefully tuned schemas and secondary indexes to fit this type of workload.

Redis was chosen for the in-memory backend due to the sophisticated support it has for data structures like sorted sets. Being single threaded, lock contention is not an issue with Redis, but we did have to implement optimizations to make this workload efficient, including the use of Lua scripting to reduce unnecessary round trips.

Horizontal scaling is provided by sharding the backend stores across a number of servers. Both backend implementations use a “free” sharding scheme (shards are chosen at random when enqueueing jobs). This makes adding new shards trivial and ensures well balanced load across shards. We implement a shard health monitor that keeps track of the health of each individual shard and pulls out of rotation shards that are misbehaving either due to machine failure, network issues or even deadlock (in the case of MySQL). This monitor has proven invaluable in automatically handling operational issues that could otherwise result in high error rates and paging an on-call operator.

Production experience

PinLater has been in use in production for months now, and our legacy Pyres based system was fully deprecated in Q1 2014. PinLater runs hundreds of job types at aggregate processing rates of over 100,000 per second. These jobs vary significantly on multiple parameters including running time, frequency, CPU vs. network intensive, job body size, programming language, enqueued online vs. offline, and needing near real time execution instead being scheduled hours in advance. It would be fair to say nearly every action taken on Pinterest or notification sent relies on PinLater at some level. The service has grown to be one of Pinterest’s most mission critical and widely used pieces of infrastructure.

Our operational model for PinLater is to deploy independent clusters for each engineering team or logical groupings of jobs. There are currently around 10 clusters, including one dedicated for testing and another for ad hoc one-off jobs. The cluster-per-team model allows better job isolation and, most importantly, allows each team to configure alerting thresholds and other operational parameters as appropriate for their use case. Nearly every operational issue that arises with PinLater tends to be job specific or due to availability incidents with one of our backend services. Thus having alerts handled directly by the teams owning the jobs usually leads to faster resolution.

Observability and manageability

One of the biggest pain points of our legacy job queuing system was that it was hard to manage and operate. As a result, when designing PinLater, we paid considerable attention to how we could improve on that aspect.

Like every service at Pinterest, PinLater exports a number of useful stats about the health of the service that we incorporate into operational dashboards and graphs. In addition, PinLater has a cluster status dashboard that provides a quick snapshot of how the cluster is doing.

PinLater also provides two features that have greatly helped improve manageability: per-queue rate limiting and configurable retry policies. Per-queue rate limiting allows an operator to limit the dequeue rate on any queue in the system, or even stop dequeues completely, which can help alleviate load quickly on a struggling backend system, or prevent a slow high priority job from starving other jobs. Support for configurable retry policies allows deployment of a policy that’s appropriate to each use case. Our default policy allows 10 retries, with the first five using linear delay, and the rest using exponential backoff. This policy allows the system to recover automatically from most types of sustained backend failures and outages. Job owners can configure arbitrary other policies as suitable to their use case as well.

We hope to open source PinLater this year. Stay tuned!

Want an opportunity to build and own large scale systems like this? We’re hiring!

Raghavendra Prabhu is a software engineer at Pinterest.

Acknowledgements: The core contributors to PinLater were Raghavendra Prabhu, Kevin Lo, Jiacheng Hong and Cole Rottweiler. A number of engineers across the company provided useful feedback, either directly about the design or indirectly through their usage, that was invaluable in improving the service.

Read More

As part of an ongoing series, engineers will share a bit of what life is like at Pinterest. Here, Engineering Manager Makinde Adeagbo talks about his early years as an engineer, recent projects, and how he spends his time outside of work.

How did you get involved with CS?

I first started programming on my graphing calculator in middle school—just​ simple games or programs to solve math equations. Later on in high school, I got hooked on building games in C++. It was a great feeling—a​ll you needed was a computer and determination…with that, the sky’s the limit.

How would you describe Pinterest’s engineering culture?

We GO! If you have an idea, go build and show it to people. The best way to end a discussion is to put the working app in someone’s hand and show that it’s possible.

What’s your favorite Pinterest moment?

Alongside a team, I launched Place Pins in November ​2013​. We had an event at the office to show off the result of lots of hard work by engineers, designers, and others from across the company. The launch went smoothly and we were able to get some sleep after many long nights.

How do you use Pinterest? What are your favorite things to Pin?

I Pin quite a few DIY projects. A recent one was a unique mix of a coding challenge and wood glue to make some nice looking coasters.

How do you spend your time outside of work?

I’m a runner, and have been since elementary school. Over the years I’ve progressed from sprinting to endurance running. It’s a great way to relax and reflect on the day. All I need is some open road and my running shoes.

What’s your latest interest?

I’ve recently started learning about free soloing, a form of free climbing where the climber forgoes ropes and harnesses. It’s spectacular to watch. There’s also deep water soloing, which involves climbing cliffs over bodies of water so falling off is fun, and you can just climb back on the cliffs.

Fun fact?

I’ve been known to jump over counter tops from a standstill.

Interested in working with engineers like Makinde? Join us!

Read More

We launched Place Pins a little over six months ago, and in that time we’ve been gathering feedback from Pinners and making product updates along the way, such as adding thumbnails of the place image on maps and the ability to filter searches by Place Boards. The newest feature is a faster, smarter search for Web and iOS that makes it easier to add a Place Pin to the map.

There are now more than one billion travel Pins on Pinterest, more than 300 unique countries and territories are represented in the system, and more than four million Place Boards have been created by Pinners.

Here’s the story of how the Place Pins team built the latest search update.

Supercharging place search

People have been mapping Pins for all types of travel plans, such as trips to Australia, places to watch the World Cup, cycling trips, a European motorcycle adventure, best running spots, and local guides and daycations.

Even with the growth in usage of Place Pins, we knew we needed to make the place search experience more intuitive. In the beginning, the place search interface was based on two distinct inputs: one for the place’s name (the “what”) and another for the search’s geospatial constraint (the “where”). We supported searching within a named city, within the bounds of the current map view, and globally around the world. While powerful, many Pinners found this interface to be non-intuitive. Our research showed Pinners were often providing both the “what” and the “where” in the first input box, just like they do when using our site-wide search interface. With that in mind, we set out to build a more natural place search interface based on just a single text input field.

The result is our one-box place search interface:

We start by attempting to identify any geographic names found within the query string. This step is powered by Twofishes, an open source geocoder written by our friends at Foursquare. Twofishes tokenizes the query string and uses a Geonames -based index to identify named geographic features. These interpretations are ranked based on properties such as geographic bounds, population, and overall data quality.

This process breaks down the original query string into two parts: one that defines the “what”, and one that defines the “where”. It also lets us discard any extraneous connector words like “in” and “near”. For example, given the query string “city hall in san francisco”, the top-ranked interpretation would return “city hall” as the “what” and “san francisco” as the “where” while completely dropping the connector word “in”.

Some geographic names are ambiguous, in which case Twofishes returns multiple possible interpretations. By default, we use the top-ranked result, but we also provide a user interface affordance that allows Pinners to easily switch between the alternatives.

Configuring place search

We use the result of the query splitting pass to configure our place search. Foursquare is our primary place data provider, and Foursquare venue search requests can be parameterized to search globally or within a set of geospatial constraints.

A single query can produce multiple venue search requests. Continuing with our example, we would issue one search for “city hall” within the bounds of “san francisco” and as well as a global search for the entire original query string “city hall san francisco”. This approach helps us find places that have geographic names in their place names, like “Boston Market” and “Pizza Chicago”.

We experimented with performing a third search for the full query string within the bounds of the geographic feature (“city hall san francisco” near “san francisco”), but in practice that didn’t yield significantly different results from those returned by the other two searches.

If we don’t identify a geographic feature (e.g. “the white house”), we only issue the global search request.

Blending and ranking results

We gather the results of those multiple search requests and blend them into a single ranked list. This is an important step because Pinners will judge the quality of our place search results based on what’s included in this list and whether their intended place appears near the top. Our current approach takes the top three “global” results, adds the top seven unique “local” results, and then promotes some items closer to the top (based on attributes like venue categorization).

More to come

In early tests, the new one-box Place search interface has been well-received by Pinners, and Place Pin creation is higher than ever. The updated place search is now available in the Pinterest iOS app and our web site, and look for it to make its appearance in our Android app soon.

One-box place search was built by engineers Jon Parise, Connor Montgomery (web) and Yash Nelapati (iOS), and Product Designer Rob Mason, with Product Manager Michael Yamartino.

If you’re interested in working on search and discovery projects like this, join us!

Jon Parise is an engineer at Pinterest.

Read More

The security of Pinners is one of our highest priorities, and to keep Pinterest safe, we have teams dedicated to solving issues and fixing bugs. We even host internal fix-a-thons where employees across the company search for bugs so we can patch them before they affect Pinners.

Even with these precautions, bugs get into code. Over the years, we’ve worked with external researchers and security experts who’ve alerted us to bugs. Starting today, we’re formalizing a bug bounty program with Bugcrowd and updating our responsible disclosure, which means we can tap into the more than 9,000 security researchers on the Bugcrowd platform. We hope these updates will allow us to learn more from the security community and respond faster to Whitehats.

This is just the first step. As we gather feedback from the community, we have plans to turn the bug bounty into a paid program, so we can reward experts for their efforts with cash. In the meantime, Whitehats can register, report and get kudos using Bugcrowd. We anticipate a much more efficient disclosure process as a result, and an even stronger and bug-free environment for Pinners!

Paul Moreno is a security engineer at Pinterest.

Read More