The official Pinterest engineering blog.

Last October, I posed the question: "Where are the numbers?". It was a call to action for the tech industry to share metrics on diversity in the workplace. Without measurement and transparency, it’s impossible to have honest conversations about making tech more inclusive. Since then, more than 150 startups have shared their women in engineering numbers, and some of the largest and most prominent tech companies have published their stats.

Today we’re taking our latest step by giving a more holistic look at our demographics across the company. We’re not close to where we want to be, but we’re working on it.

Our vision is to help people live inspired lives—people across the world, from all walks of life. We only stand to improve the quality and impact of our products if the people building them are representative of the user base and reflect the same diversity of demography, culture, life experiences and interests that makes our community so vibrant.

As we look ahead, we’ve put particular focus on inclusion efforts in hiring earlier in the engineering pipeline, recruiting a 29% female inaugural engineering intern class last year and 32% female this year. Beyond hiring, we’re mindful of processes and practices that may affect success and retention of employees coming from less represented backgrounds.

We’re also working with organizations that are effecting real change, including:

While we’ve made some progress in diversifying gender at the company, we haven’t done as well in representing different ethnicities, and we’re focused on getting better. We still have a lot of work ahead of us to make Pinterest a global company, as we build a global product. However, we’re excited to be a part of a broader movement in the tech industry to make it a more diverse and inclusive place.

*Gender and ethnicity data are global and include hires starting through September 2014. This is not based on EEO-1 reports; however, ethnicity refers to the EEO-1 categories which we know are imperfect categorizations of race and ethnicity, but reflect the U.S. government reporting requirements.
**Other includes Biracial, American Indian, Alaskan Native, Native Hawaiian and Pacific Islander.
***Tech includes Engineering, Product Management, and Design. Business includes all disciplines outside of Tech.

Tracy Chou is a software engineer and tech lead at Pinterest.

Read More

Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, we’re building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling our data infrastructure to traverse the interest graph to extract context and intent for each Pin.

We currently log 20 terabytes of new data each day, and have around 10 petabytes of data in S3. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.

In order to build big data applications quickly, we’ve evolved our single cluster Hadoop infrastructure into a ubiquitous self-serving platform.

Building a self-serve platform for Hadoop

Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform. Fortunately there are many Hadoop libraries/applications and service providers that offer solutions to these limitations. Before choosing from these solutions, we mapped out our Hadoop setup requirements.

1. Isolated multitenancy: MapReduce has many applications with very different software requirements and configurations. Developers should be able to customize their jobs without impacting other users’ jobs.

2. Elasticity: Batch processing often requires burst capacity to support experimental development and backfills. In an ideal setup, you could ramp up to multi-thousand node clusters and scale back down without any interruptions or data loss.

3. Multi-cluster support: While it’s possible to scale a single Hadoop cluster horizontally, we’ve found that a) getting perfect isolation/elasticity can be difficult to achieve and b) business requirements such as privacy, security and cost allocation make it more practical to support multiple clusters.

4. Support for ephemeral clusters: Users should be able to spawn clusters and leave them up for as long as they need. Clusters should spawn in a reasonable amount of time and come with full blown support for all Hadoop jobs without manual configuration.

5. Easy software package deployment: We need to provide developers simple interfaces to several layers of customization from the OS and Hadoop layers to job specific scripts.

6. Shared data store: Regardless of the cluster, it should be possible to access data produced by other clusters

7. Access control layer: Just like any other service oriented system, you need to be able to add and modify access quickly (i.e. not SSH keys). Ideally, you could integrate with an existing identity (e.g. via OAUTH).

Tradeoffs and implementation

Once we had our requirements down, we chose from a wide range of home-brewed, open source and proprietary solutions to meet each requirement.

Decoupling compute and storage: Traditional MapReduce leverages data locality to make processing faster. In practice, we’ve found network I/O (we use S3) is not much slower than disk I/O. By paying the marginal overhead of network I/O and separating computation from storage, many of our requirements for a self-serve Hadoop platform became much easier to achieve. For example, multi-cluster support was easy because we no longer needed to worry about loading or synchronizing data, instead any existing or future clusters can make use of the data across a single shared file system. Not having to worry about data meant easier operations because we could perform a hard reset or abandon a problematic cluster for another cluster without losing any work. It also meant that we could use spot nodes and pay a significantly lower price for compute power without having to worry about losing any persistent data.

Centralized Hive metastore as the source of truth: We chose Hive for most of our Hadoop jobs primarily because the SQL interface is simple and familiar to people across the industry. Over time, we found Hive had the added benefit of using metastore as a data catalog for all Hadoop jobs. Much like other SQL tools, it provides functionality such as “show tables”, “describe table” and “show partitions.” This interface is much cleaner than listing files in a directory to determine what output exists, and is also much faster and consistent because it’s backed by a MySQL database. This is particularly important since we rely on S3, which is slow at listing files, doesn’t support moves and has eventual consistency issues.

We orchestrate all our jobs (whether Hive, Cascading, HadoopStreaming or otherwise) in such a way that they keep the HiveMetastore consistent with what data exists on disk. This makes is possible to update data on disk across multiple clusters and workflows without having to worry about any consumer getting partial data.

Multi-layered package/configuration staging: Hadoop applications vary drastically and each application may have a unique set of requirements and dependencies. We needed an approach that’s flexible enough to balance customizability and ease of setup/speed.

We took a three layered approach to managing dependencies and ultimately cut the time it takes to spawn and invoke a job on a thousand node cluster from 45 minutes to as little as five.

1. Baked AMIs:

For dependencies that are large and take a while to install, we preinstall them on the image. Examples of this are Hadoop Libraries and a NLP library package we needed for internationalization. We refer to this process as “baking an AMI.” Unfortunately, this approach isn’t available across many Hadoop service providers.

2. Automated Configuration (Masterless Puppet):

The majority of our customization is managed by Puppet. During the bootstrap stage, our cluster installs and configures Puppet on every node and, within a matter of minutes, Puppet keeps all our nodes with all of the dependencies we specify within our Puppet configurations.

Puppet had one major limitation for our use case: when we add new nodes to our production systems, they simultaneously contact the Puppet master to pull down new configurations and often overwhelm the master node, causing several failure scenarios. To get around this single point of failure, we made Puppet clients “masterless,” by allowing them to pull their configuration from S3 and set up a service that’s responsible for keeping S3 configurations in sync with the Puppet master.

3. Runtime Staging (on S3): Most of the customization that happens between MapReduce jobs involves jars, job configurations and custom code. Developers need to be able to modify these dependencies in their development environment and make them available on any one of our Hadoop clusters without affecting other jobs. To balance flexibility, speed and isolation, we created an isolated working directory for each developer on S3. Now, when a job is executed, a working directory is created for each developer and its dependencies are pulled down directly from S3.

Executor abstraction layer

Early on, we used Amazon’s Elastic MapReduce to run all of our Hadoop jobs. EMR played well with S3 and Spot Instances, and was generally reliable. As we scaled to a few hundred nodes, EMR became less stable and we started running into limitations of EMR’s proprietary versions of Hive. We had already built so many applications on top of EMR that it was hard for us to migrate to a new system. We also didn’t know what we wanted to switch to because some of the nuances of EMR had creeped into the actual job logic. In order to experiment with other flavors of Hadoop, we implemented an executor interface and moved all the EMR specific logic into the EMRExecutor. The interface implements a handful of methods such as “run_raw_hive_query(query_str)” and “run_java_job(class_path)”. This gave us the flexibility to experiment with a few flavors of Hadoop and Hadoop service providers, while enabling us to do a gradual migration with minimal downtime.

Deciding on Qubole

We ultimately migrated our Hadoop jobs to Qubole, a rising player in the Hadoop as a Service space. Given that EMR had become unstable at our scale, we had to quickly move to a provider that played well with AWS (specifically, spot instances) and S3. Qubole supported AWS/S3 and was relatively easy to get started on. After vetting Qubole and comparing its performance against alternatives (including managed clusters), we decided to go with Qubole for a few reasons:

1) Horizontally scalable to 1000s of nodes on a single cluster

2) Responsive 24/7 data infrastructure engineering support

3) Tight integration with Hive

4) Google OAUTH ACL and a Hive Web UI for non-technical users

5) API for simplified executor abstraction layer + multi-cluster support

6) Baked AMI customization (available with premium support)

7) Advanced support for spot instances - with support for 100% spot instance clusters

8) S3 eventual consistency protection

9) Graceful cluster scaling and autoscaling

Overall, Qubole has been a huge win for us, and we’ve been very impressed by the Qubole team’s expertise and implementation. Over the last year, Qubole has proven to be stable at Petabyte scale and has given us 30%-60% higher throughput than EMR. It’s also made it extremely easy to onboard non-technical users.

Where we are today

With our current setup, Hadoop is a flexible service that’s adopted across the organization with minimal operational overhead. We have over 100 regular Mapreduce users running over 2,000 jobs each day through Qubole’s web interface, ad-hoc jobs and scheduled workflows.

We have six standing Hadoop clusters comprised of over 3,000 nodes, and developers can choose to spawn their own Hadoop cluster within minutes. We generate over 20 billion log messages and process nearly a petabyte of data with Hadoop each day.

We’re also experimenting with managed Hadoop clusters, including Hadoop 2, but for now, using cloud services such as S3 and Qubole is the right choice for us because they free us up from the operational overhead of Hadoop and allow us to focus our engineering efforts on big data applications.

If you’re interested in working with us on big data, join our team!

Acknowledgements: Thanks to Dmitry Chechik, Pawel Garbacki, Jie Li, Chunyan Wang, Mao Ye and the rest of the Data Infrastructure team for their contributions.

Mohammad Shahangian is a data engineer at Pinterest.

Read More

A lot goes on in the backend when a person clicks the Pin It button. Thumbnails of all sizes are generated, the board thumbnail is updated, and a Pin is fanned out to those who follow the Pinner or the board. We also evaluate if a Pin should be added to a category feed, check for spam, index for search, and so on.

These jobs are critically important but don’t all need to happen before we can acknowledge success back to the user. This is where an asynchronous job execution system comes in, where we need to enqueue one or more jobs to execute these actions at a later time and rest assured they will eventually be executed. Another use case is when a large batch of jobs needs to be scheduled and executed with retries for resiliency toward temporary backend system unavailability, such as a workflow to generate and send emails to millions of Pinners each week. Here’s a look at how we developed an asynchronous job execution system in-house, which we call PinLater.

Evaluating options

We had originally implemented a solution based on Pyres for this purpose, however it had several limitations:

  • Job execution was best effort, i.e. there was no success acknowledgement (ACK) mechanism.
  • There was a lack of visibility into the status of individual job types, since jobs were all clubbed into a single set of nine priority queues.
  • The system wasn’t entirely configurable or manageable, e.g. no ability to throttle job execution or configure retries.
  • It was tied to Redis as the storage backend, and only worked for jobs written in Python, both of which were restrictions that would not continue to be acceptable for us.
  • It didn’t have built-in support for scheduled execution of jobs at a specific time in the future, a feature that some of our jobs needed.

We looked at a few other open source queue or publish/subscribe system implementations, but none provided the minimum feature set we needed, such as time-based scheduling with priorities and reliable ACKs, or could properly scale. Amazon Simple Queue Service (SQS) would likely meet many of our requirements, but for such a critical piece of infrastructure, we wanted to operate it ourselves and extend the feature set as needed, which is why we developed PinLater.

Designing for execution of asynchronous jobs

In building PinLater, we kept the following design points in mind:

  • PinLater is a Thrift service to manage scheduling and execution of asynchronous jobs. It provides three actions via its API: enqueue, dequeue and ACK that make up the core surface area.
  • PinLater is agnostic to the details of a job. From its point of view, the job body is just an opaque sequence of bytes. Each job is associated with a queue and a priority level, as well as a timestamp called run_after that defines the minimum time at which the job is eligible to run (by default, jobs are eligible to run immediately, but this can be overridden to be a time in the future).
  • When a job is enqueued, PinLater sends it to a backend store to keep track of it. When a dequeue request comes in, it satisfies the request by returning the highest priority jobs that are eligible to run at that time, based on run_after timestamps. Typically there are one or more worker pools associated with each PinLater cluster, which are responsible for executing jobs belonging to some subset of queues in that cluster. Workers continuously grab jobs, execute them and then reply to PinLater with a positive or negative ACK, depending on whether the execution succeeded or failed.
  • In our use of PinLater, each job type maps 1:1 to a specific queue. The interpretation of the job body is a contract between the enqueuing client(s) and the worker pool responsible for that queue. This 1:1 mapping isn’t mandated by PinLater, but we have found it to be operationally very useful in terms of managing jobs and having good visibility into their states.

Job state machine

A newly enqueued job starts in state PENDING. When it becomes eligible for execution (based on priority and its run_after timestamp), it can be dequeued by a worker, at which point its state changes to RUNNING.

If the worker completed the execution successfully, it will send a success ACK back, and the job will move to a terminal SUCCEEDED state. Succeeded jobs are retained in PinLater for diagnostics purposes for a short period of time (usually a day) and then garbage collected.

If the job execution failed, the worker will send a failure ACK back, at which point PinLater will check if the job has any retries available. If so, it will move the job back to PENDING. If not, the job goes into a terminal FAILED state. Failed jobs stay around in PinLater for diagnostics purposes (and potentially manual retries) for a few days. When a job is first enqueued, a numAttemptsAllowed parameter is set to control how many retries are allowed. PinLater allows the worker to optionally specify a delay when it sends a failure ACK. This delay can be used to implement arbitrary retry policies per job, e.g. constant delay retry, exponential backoff, or a combination thereof.

If a job was dequeued (claimed) by a worker and it didn’t send back an ACK within a few minutes, PinLater considers the job lost and treats it as a failure. At this point, it will automatically move the job to PENDING or FAILED state depending on whether retries are available.

The garbage collection of terminal jobs as well as the claim timeout handling is done by a scheduled executor within the PinLater thrift server. This executor also logs statistics for each run, as well as exports metrics for longer term analysis.

PinLater’s Python worker framework

In addition to the PinLater service, we provide a Python worker framework that implements the PinLater dequeue/ACK protocol and manages execution of python jobs. Adding a new job involves a few lines of configuration to tell the system which PinLater cluster the job should run in, which queue it should use, and any custom job configuration (e.g. retry policy, number of execution attempts). After this step, the engineer can focus on implementing the job logic itself.

While the Python framework has enabled smooth transition of jobs from the earlier system and continues to support the vast majority of new jobs, some of our clients have implemented PinLater workers in other languages like Java and C++. PinLater’s job agnostic design and simple Thrift protocol have made this relatively straight forward to do.

Implementation details

The PinLater Thrift server is written in Java and leverages Twitter’s Finagle RPC framework. We currently provide two storage backends: MySQL and Redis. MySQL is used for relatively low throughput use cases and those that schedule jobs over long periods and thus can benefit from storing jobs on disk rather than purely in memory. Redis is used for high throughput job queues that are normally drained in real time.

MySQL was chosen for the disk-backed backend since it provides the transactional querying capability needed to implement a scheduled job queue. As one might expect, lock contention is an issue and we use several strategies to mitigate it including a separate table for each priority level , use of UPDATE … LIMIT instead of SELECT FOR UPDATE for the dequeue selection query, and carefully tuned schemas and secondary indexes to fit this type of workload.

Redis was chosen for the in-memory backend due to the sophisticated support it has for data structures like sorted sets. Being single threaded, lock contention is not an issue with Redis, but we did have to implement optimizations to make this workload efficient, including the use of Lua scripting to reduce unnecessary round trips.

Horizontal scaling is provided by sharding the backend stores across a number of servers. Both backend implementations use a “free” sharding scheme (shards are chosen at random when enqueueing jobs). This makes adding new shards trivial and ensures well balanced load across shards. We implement a shard health monitor that keeps track of the health of each individual shard and pulls out of rotation shards that are misbehaving either due to machine failure, network issues or even deadlock (in the case of MySQL). This monitor has proven invaluable in automatically handling operational issues that could otherwise result in high error rates and paging an on-call operator.

Production experience

PinLater has been in use in production for months now, and our legacy Pyres based system was fully deprecated in Q1 2014. PinLater runs hundreds of job types at aggregate processing rates of over 100,000 per second. These jobs vary significantly on multiple parameters including running time, frequency, CPU vs. network intensive, job body size, programming language, enqueued online vs. offline, and needing near real time execution instead being scheduled hours in advance. It would be fair to say nearly every action taken on Pinterest or notification sent relies on PinLater at some level. The service has grown to be one of Pinterest’s most mission critical and widely used pieces of infrastructure.

Our operational model for PinLater is to deploy independent clusters for each engineering team or logical groupings of jobs. There are currently around 10 clusters, including one dedicated for testing and another for ad hoc one-off jobs. The cluster-per-team model allows better job isolation and, most importantly, allows each team to configure alerting thresholds and other operational parameters as appropriate for their use case. Nearly every operational issue that arises with PinLater tends to be job specific or due to availability incidents with one of our backend services. Thus having alerts handled directly by the teams owning the jobs usually leads to faster resolution.

Observability and manageability

One of the biggest pain points of our legacy job queuing system was that it was hard to manage and operate. As a result, when designing PinLater, we paid considerable attention to how we could improve on that aspect.

Like every service at Pinterest, PinLater exports a number of useful stats about the health of the service that we incorporate into operational dashboards and graphs. In addition, PinLater has a cluster status dashboard that provides a quick snapshot of how the cluster is doing.

PinLater also provides two features that have greatly helped improve manageability: per-queue rate limiting and configurable retry policies. Per-queue rate limiting allows an operator to limit the dequeue rate on any queue in the system, or even stop dequeues completely, which can help alleviate load quickly on a struggling backend system, or prevent a slow high priority job from starving other jobs. Support for configurable retry policies allows deployment of a policy that’s appropriate to each use case. Our default policy allows 10 retries, with the first five using linear delay, and the rest using exponential backoff. This policy allows the system to recover automatically from most types of sustained backend failures and outages. Job owners can configure arbitrary other policies as suitable to their use case as well.

We hope to open source PinLater this year. Stay tuned!

Want an opportunity to build and own large scale systems like this? We’re hiring!

Raghavendra Prabhu is a software engineer at Pinterest.

Acknowledgements: The core contributors to PinLater were Raghavendra Prabhu, Kevin Lo, Jiacheng Hong and Cole Rottweiler. A number of engineers across the company provided useful feedback, either directly about the design or indirectly through their usage, that was invaluable in improving the service.

Read More

As part of an ongoing series, engineers will share a bit of what life is like at Pinterest. Here, Engineering Manager Makinde Adeagbo talks about his early years as an engineer, recent projects, and how he spends his time outside of work.

How did you get involved with CS?

I first started programming on my graphing calculator in middle school—just​ simple games or programs to solve math equations. Later on in high school, I got hooked on building games in C++. It was a great feeling—a​ll you needed was a computer and determination…with that, the sky’s the limit.

How would you describe Pinterest’s engineering culture?

We GO! If you have an idea, go build and show it to people. The best way to end a discussion is to put the working app in someone’s hand and show that it’s possible.

What’s your favorite Pinterest moment?

Alongside a team, I launched Place Pins in November ​2013​. We had an event at the office to show off the result of lots of hard work by engineers, designers, and others from across the company. The launch went smoothly and we were able to get some sleep after many long nights.

How do you use Pinterest? What are your favorite things to Pin?

I Pin quite a few DIY projects. A recent one was a unique mix of a coding challenge and wood glue to make some nice looking coasters.

How do you spend your time outside of work?

I’m a runner, and have been since elementary school. Over the years I’ve progressed from sprinting to endurance running. It’s a great way to relax and reflect on the day. All I need is some open road and my running shoes.

What’s your latest interest?

I’ve recently started learning about free soloing, a form of free climbing where the climber forgoes ropes and harnesses. It’s spectacular to watch. There’s also deep water soloing, which involves climbing cliffs over bodies of water so falling off is fun, and you can just climb back on the cliffs.

Fun fact?

I’ve been known to jump over counter tops from a standstill.

Interested in working with engineers like Makinde? Join us!

Read More

We launched Place Pins a little over six months ago, and in that time we’ve been gathering feedback from Pinners and making product updates along the way, such as adding thumbnails of the place image on maps and the ability to filter searches by Place Boards. The newest feature is a faster, smarter search for Web and iOS that makes it easier to add a Place Pin to the map.

There are now more than one billion travel Pins on Pinterest, more than 300 unique countries and territories are represented in the system, and more than four million Place Boards have been created by Pinners.

Here’s the story of how the Place Pins team built the latest search update.

Supercharging place search

People have been mapping Pins for all types of travel plans, such as trips to Australia, places to watch the World Cup, cycling trips, a European motorcycle adventure, best running spots, and local guides and daycations.

Even with the growth in usage of Place Pins, we knew we needed to make the place search experience more intuitive. In the beginning, the place search interface was based on two distinct inputs: one for the place’s name (the “what”) and another for the search’s geospatial constraint (the “where”). We supported searching within a named city, within the bounds of the current map view, and globally around the world. While powerful, many Pinners found this interface to be non-intuitive. Our research showed Pinners were often providing both the “what” and the “where” in the first input box, just like they do when using our site-wide search interface. With that in mind, we set out to build a more natural place search interface based on just a single text input field.

The result is our one-box place search interface:

We start by attempting to identify any geographic names found within the query string. This step is powered by Twofishes, an open source geocoder written by our friends at Foursquare. Twofishes tokenizes the query string and uses a Geonames -based index to identify named geographic features. These interpretations are ranked based on properties such as geographic bounds, population, and overall data quality.

This process breaks down the original query string into two parts: one that defines the “what”, and one that defines the “where”. It also lets us discard any extraneous connector words like “in” and “near”. For example, given the query string “city hall in san francisco”, the top-ranked interpretation would return “city hall” as the “what” and “san francisco” as the “where” while completely dropping the connector word “in”.

Some geographic names are ambiguous, in which case Twofishes returns multiple possible interpretations. By default, we use the top-ranked result, but we also provide a user interface affordance that allows Pinners to easily switch between the alternatives.

Configuring place search

We use the result of the query splitting pass to configure our place search. Foursquare is our primary place data provider, and Foursquare venue search requests can be parameterized to search globally or within a set of geospatial constraints.

A single query can produce multiple venue search requests. Continuing with our example, we would issue one search for “city hall” within the bounds of “san francisco” and as well as a global search for the entire original query string “city hall san francisco”. This approach helps us find places that have geographic names in their place names, like “Boston Market” and “Pizza Chicago”.

We experimented with performing a third search for the full query string within the bounds of the geographic feature (“city hall san francisco” near “san francisco”), but in practice that didn’t yield significantly different results from those returned by the other two searches.

If we don’t identify a geographic feature (e.g. “the white house”), we only issue the global search request.

Blending and ranking results

We gather the results of those multiple search requests and blend them into a single ranked list. This is an important step because Pinners will judge the quality of our place search results based on what’s included in this list and whether their intended place appears near the top. Our current approach takes the top three “global” results, adds the top seven unique “local” results, and then promotes some items closer to the top (based on attributes like venue categorization).

More to come

In early tests, the new one-box Place search interface has been well-received by Pinners, and Place Pin creation is higher than ever. The updated place search is now available in the Pinterest iOS app and our web site, and look for it to make its appearance in our Android app soon.

One-box place search was built by engineers Jon Parise, Connor Montgomery (web) and Yash Nelapati (iOS), and Product Designer Rob Mason, with Product Manager Michael Yamartino.

If you’re interested in working on search and discovery projects like this, join us!

Jon Parise is an engineer at Pinterest.

Read More

The security of Pinners is one of our highest priorities, and to keep Pinterest safe, we have teams dedicated to solving issues and fixing bugs. We even host internal fix-a-thons where employees across the company search for bugs so we can patch them before they affect Pinners.

Even with these precautions, bugs get into code. Over the years, we’ve worked with external researchers and security experts who’ve alerted us to bugs. Starting today, we’re formalizing a bug bounty program with Bugcrowd and updating our responsible disclosure, which means we can tap into the more than 9,000 security researchers on the Bugcrowd platform. We hope these updates will allow us to learn more from the security community and respond faster to Whitehats.

This is just the first step. As we gather feedback from the community, we have plans to turn the bug bounty into a paid program, so we can reward experts for their efforts with cash. In the meantime, Whitehats can register, report and get kudos using Bugcrowd. We anticipate a much more efficient disclosure process as a result, and an even stronger and bug-free environment for Pinners!

Paul Moreno is a security engineer at Pinterest.

Read More

Marc Andreessen famously said that for startups, “the only thing that matters is getting to product/market fit.” Product/market fit means providing enough value to enough people that the startup can flourish. We believe the key to sustainable growth is putting Pinners first, and finding ways to increase the value people get from Pinterest. That could mean improving the experience for existing Pinners, more effectively communicating the benefit of Pinterest to new users, or improving content for less engaged people. With tens of millions of Pinners, though, it can be a challenge to understand if we’re reaching our goals.

We measure success with four techniques: user state transitions, Xd28s, cohort heat maps, and conversion funnels. This post covers how to understand these different types of metrics and how we use them to identify problem areas and inform our strategy and decision-making on the Growth Team.

Understanding gains and losses with user state transitions

The metric: For this metric we use a simple model, with three states to understand the growth of our service: Monthly Active Users (MAUs), dormant Pinners, and new Pinners that just joined. The chart monitors the number of people who go from one state to another on a daily basis.The sum of the four different transitions yields our Net MAU line, which shows the total number of additional MAUs we added that week.

Possible user state transitions are:

  • New signup: When a new person joins Pinterest
  • New -> Dormant: When a new Pinner doesn’t use Pinterest in the 28 days following sign up
  • MAU -> Dormant: Pinner was an MAU, but didn’t use Pinterest for 28 days.
  • Dormant -> MAU: Pinner used Pinterest after having been inactive for 28+days.

How we use it: This is one of the most important graphs for the Growth team because it tells us where to focus. By looking at where we’re losing Pinners, and where we’re gaining them, we can decide where to concentrate our efforts to deliver maximum impact. For instance, if we see an increase in the number of new Pinners transitioning to dormant, we know to focus our efforts on better communicating Pinterest’s value in the new user experience during the person’s first week.

Monitoring engagement through Xd28s

The metric: Xd28s are the number of Pinners who have used Pinterest X days in the past 28 days. For instance, 4d28s+ are the number of users that used Pinterest 4 or more days during the past 28.

How we use it: There are many ways people can use Pinterest, so there’s no one specific thing Pinners do to gain value. We use Xd28s as a proxy for the amount of value a person is getting from the service. We segment into three major categories: 14d28s+ are core Pinners who are deriving a lot of value; 4d28s+ are casual and getting some value, and anyone below 4d28 is a marginal Pinner who’s likely at risk of churning because they’re not receiving much value. By monitoring the ratio between the different groups, we can determine how much value people are getting and see how it changes over time. If one of the less desirable segments (such as marginal users or casual users) begin to increase, we can focus on understanding why that’s happening and determine what we can do to fix it.

Tracking new user retention with cohort heat maps

The metric: The cohort heat map shows the activity level for new Pinners; where red represents high activity and blue indicates low. The columns along the x-axis represent the day the person joined, and the rows along the y-axis represent the number of days since they joined. The coloring of a specific square in the graph represents what percentage of Pinners who joined on day X were subsequently active on day Y.

How we use it: The foundation for sustainable growth is retaining users. We use graphs like this to see how our new user retention curve changes over time. When the red and yellow extend further up a column, retention is improving. If the blue and green areas begin to decrease, a retention or new user activation problem has been introduced. In the mock example above, something happened around 2013-04-01 that hurt retention. This graph becomes especially powerful when segmented by gender or locale, which allows for easy identification of segments of the user base where retention can be improved. We can then monitor over time to see if retention is indeed improving.

Understanding Pinner interactions using conversion funnels

The metric: For multi-step flows, conversion funnels measure how many Pinners get to each step of the flow.

How we use it: We use conversion funnels for monitoring landing pages and sharing, invitation, and signup flows. By understanding how people are interacting with the feature and seeing where users are dropping off, we know where to focus our efforts on improving the flow. Sometimes the fix is functional: If someone tries to send a Pin to a friend, but can’t find the friend they are looking for, we can improve the friend recommendations or our typeahead logic. However, Pinners can also drop off in the flow because they don’t understand the value and don’t have enough motivation. At this point, we collaborate with the design team on creative ways to communicate that value. A great example is our current sign up walls on iOS and web, where we show use cases to communicate how people use Pinterest.

Putting Pinners first

As you can see, fixing retention issues can be as simple as reminding users what they may be missing out on, or as complicated as rethinking the user experience for a segment of the user base. For us, it always starts and ends with ensuring a great experience for new and existing Pinners. If challenges like this interest you, the Pinterest Growth team is hiring!

John Egan is an engineer on the Growth team.

Read More

The core value of Pinterest is to help people find the things they care about, by connecting them to Pins and people that relate to their interests. We’re building a service that’s powered by people, and supercharged with technology.

The interest graph - the connections that make up the Pinterest index - creates bridges between Pins, boards, and Pinners. It’s our job to build a system that helps people to collect the things they love, and connect them to communities of engaged people who share similar interests and can help them discover more. From categories like travel, fitness, and humor, to more niche areas like vintage motorcycles, craft beer, or Japanese architecture, we’re building a visual discovery tool for all interests.

The interests platform is built to support this vision. Specifically, it’s responsible for producing high quality data on interests, interest relationships, and their association with Pins, boards, and Pinners.

Figure 1: Feedback loop between machine intelligence and human curation

In contrast with conventional methods of generating such data, which rely primarily on machine learning and data mining techniques, our system relies heavily on human curation. The ultimate goal is to build a system that’s both machine and human powered, creating a feedback mechanism by which human curated data helps drive improvements in our machine algorithms, and vice versa.

Figure 2: System components

Raw input to the system includes existing data about Pins, boards, Pinners, and search queries, as well as explicit human curation signals about interests. With this data, we’re able to construct a continuously evolving interest dictionary, which provides the foundation to support other key components, such as interest feeds, interest recommendations, and related interests.

Generating the interest dictionary

From a technology standpoint, interests are text strings that represent entities for which a group of Pinners might have a shared passion.

We generated an initial collection of interests by extracting frequently occurring n-grams from Pin and board descriptions, as well as board titles, and filtering these n-grams using custom built grammars. While this approach provided a high coverage set of interests, we found many terms to be malformed phrases. For instance, we would extract phrases such as “lamborghini yellow” instead of “yellow lamborghini”. This proved problematic because we wanted interest terms to represent how Pinners would describe them, and so, we employed a variety of methods to eliminate malformed interests terms.

We first compared terms with repeated search queries performed by a group of Pinners over a few months. Intuitively, this criterion matches well with the notion that an interest should be an entity for which a group of Pinners are passionate.

Later we filtered the candidate set through public domain ontologies like Wikipedia titles. These ontologies were primarily used to validate proper nouns as opposed to common phrases, as all available ontologies represented only a subset of possible interests. This is especially true for Pinterest, where Pinners themselves curate special interests like “mid century modern style.”

Finally, we also maintain an internal blacklist to filter abusive words and x-rated terms as well as Pinterest specific stop words, like “love”. This filtering is especially important to interest terms which might be recommended to millions of users.

We arrived at a fair quality collection of interests following the above algorithmic approaches. In order to understand the quality of our efforts, we gave a 50,000 term subset of our collection to a third party vendor which used crowdsourcing to rate our data. To be rigorous, we composed a set of four criteria by which users would evaluate candidate Interests terms:

- Is it English?

- Is it a valid phrase in grammar?

- Is it a standalone concept?

- Is it a proper name?

The crowdsourced ratings were both interesting if not somewhat expected. There was a low rate of agreement amongst raters, with especially high discrepancy in determining whether an interest’s term represented a “standalone concept.” Despite the ambiguity, we were able to confirm that 80% of the collection generated using the above algorithms satisfied our interests criteria.

This type of effort, however, is not easy to scale. The real solution is to allow Pinners to provide both implicit and explicit signals to help us determine the validity of an interest. Implicit signals behaviors like clicking and viewing, while explicit signals include asking Pinners to specifically provide information (which can be actions like a thumbs up/thumbs down, starring, or skipping recommendations).

To capture all the signals used for defining the collections of terms, we built a dictionary that stores all the data associated with each interest, including invalid interests and the reason why it’s invalid. This service plays a key role in human curation, by aggregating signals from different people. On top of this dictionary service, we can build different levels of reviewing system.

Identifying Pinner interests

With the Interests dictionary, we can associate Pins, boards, and Pinners with representative interests. One of the initial ways we experimented with this was launching a preview of a page where Pinners can explore their interests.

Figure 3: Exploring interests

In order to match interests to Pinners, we need to aggregate all the information related with a person’s interests. At its core, our system recommends interests based upon Pins with which a Pinner interacts. Every Pin on Pinterest has been collected and given context by someone who thinks it’s important, and in doing so, is helping other people discover great content. Each individual Pin is an incredibly rich source of data. As discussed in a previous blog post on discovery data model, one Pin often has multiple copies — different people may Pin it from different sources, and the same Pin can be repinned multiple times. During this process, each Pin accumulates numerous unique textual descriptions which allows us to connect Pins with interests terms with high precision.

However, this conceptually simple process requires non-trivial engineering effort to scale to the amount of Pins and Pinners that the service has today. The data process pipeline (managed by Pinball) composes over 35 Hadoop jobs, and runs periodically to update the user-interest mapping to capture users’ latest interest information.

The initial feedback on the explore interests page has been positive, proving the capabilities of our system. We’ll continue testing different ways of exposing a person’s interests and related content, based on implicit signals, as well as explicit signals (such as the ability to create custom categories of interests).

Calculating related interests

Related interests are an important way of enabling the ability to browse interests and discover new ones. To compute related interests, we simply combine the co-occurrence relationship for interests computed at Pin and board levels.

Figure 4: Computing related interests

The quality of the related interests is surprisingly high given the simplicity of the algorithm. We attribute this effect to the cleanness of Pinterest data. Text data on Pins tend to be very concise, and contain less noise than other types of data, like web pages. Also, related interests calculation already makes use of boards, which are heavily curated by people (vs. machines) in regards to organizing related content. We find that utilizing the co-occurrence of interest terms at the level of both Pins and boards provides the best tradeoff between achieving high precision as well as recall when computing the related interests.

One of the initial ways we began showing people related content was through related Pins. When you Pin an object, you’ll see a recommendation for a related board with that same Pin so you can explore similar objects. Additionally, if you scroll beneath a Pin, you’ll see Pins from other people who’ve also Pinned that original object. At this point, 90% of all Pins have related Pins, and we’ve seen 20% growth in engagement with related Pins in the last six months.

Powering interest feeds

Interests feeds provide Pinners with a continuous feed of Pins that are highly related. Our feeds are populated using a variety of sources, including search and through our annotation pipeline. A key property of the feed is flow. Only feeds with decent flow can attract Pinners to come back repeatedly, thereby maintaining high engagement. In order to optimize for our feeds, we’ve utilized a number of real-time indexing and retrieval systems, including real-time search, real-time annotating, and also human curation for some of the interests.

To ensure quality, we need to guarantee quality from all sources. For that purpose, we measure the engagement of Pins from each source and address quality issue accordingly.

Figure 5: How interest feeds are generated

More to come

Accurately capturing Pinner interests and interest relationships, and making this data understandable and actionable for tens of millions of people (collecting tens of billions of Pins), is not only an engineering challenge, but also a product design one. We’re just at the beginning, as we continue to improve the data and design ways to empower people to provide feedback that allows us to build a hybrid system combining machine and human curation to power discovery. Results of these effort will be reflected in future product releases.

If you’re interested in building new ways of helping people discover the things they care about, join our team!

Acknowledgements: The core team members for the interests backend platform are Ningning Hu, Leon Lin, Ryan Shih and Yuan Wei. Many other folks from other parts of the company, especially the discovery team and the infrastructure teams, have provided very useful feedback and help along the way to make the ongoing project successful.

Ningning Hu is an engineer at Pinterest.

Read More

One of the most exciting aspects of working with Pinterest data is its opportunity to connect people with things and ideas they’re interested in. We know that interests change over time, and even day to day. What you’re interested in on Sunday morning when you want an awesome pancake recipe may not align exactly with the travel plans you’re dreaming up on Saturday.

Since one of our goals is to help Pinners find the content that inspires them at any moment, we’re constantly asking ourselves how we can help people discover the things they care about by making the right recommendations at the optimal time. Our answer lies in the data infrastructure we’ve built.

Digging into Pin Trends

We recently looked at aggregate data to see which categories peak throughout the week and which interests were most popular among Pinners at various times.

What we found is that TGIF is real. People start the week off motivated and Pinning mostly to fitness boards on Mondays, technology is popular on Tuesdays, and inspirational quotes see a spike on Wednesdays as people work through hump day. Fashion is big on Thursdays, while people are ready for a laugh on Friday and humor Pins spike. Over the weekend, travel is the top category on Saturday, and the week closes out on Sunday with food and craft ideas.

Improving discovery with context

As new content is created on Pinterest, we can identify the context behind a Pin based on a mix of signals, such as the board in which the Pin was added. Just knowing when an individual Pin is created might not give us too much information on its own, but because hundreds of others may have saved a similar Pin, we can deduce what that Pin is about. With a timestamp for that action, we can track how popular different categories of Pins are at different times of day or across the days of the week.

We can go a level deeper by looking at the context of an action, such as if it was discovered in home feed, category feed, or search. We can use this information to make the product easier to navigate, as well as to build a more relevant recommendation engine.

Using these different sources, we did the analysis of Pinners’ propensity to engage with different topics by time of day, day of week, and month of the year. Learn more about these Pin trends on our Pinner Blog. If you’re interested in digging into this type of data, join our team!

Andrea Burbank is a data engineer at Pinterest.

Read More

Thanks to everyone who came to our Engineering Tech Talks last week at the Pinterest HQ in San Francisco, where we covered:

Mobile & Growth

Scaling user education on mobile, and a deep dive into the NUX using the Experience Framework, with engineers Dannie Chu and Wendy Lu

Monetization & Data

The open sourcing of Pinterest Secor, and a look at zero data loss log persistence services, with engineer Pawel Garbacki

Developing & Shipping Code at Pinterest

The tools and technologies we use to build and deploy confidently, with engineers Chris Danford and Jeremy Stanley

For those who couldn’t make the talks, or would like a refresher, we’ve posted the slides.

Pinterest Engineering Tech Talks - 4/29/14 by Pinterest_Eng

You can always find more information from Pinterest Engineering right on this blog, or on our Facebook Page, where we’ll keep you posted on future tech talks.

Read More