Building Pinalytics: Pinterest’s data analytics engine
Pinterest is a data-driven company. On the Data Engineering team, we’re always considering how we can deliver data in a meaningful way to the rest of the company. To help employees analyze information quickly and better understand metrics more efficiently, we built Pinalytics, our own customizable platform for big data analytics.
We built Pinalytics with the following goals in mind:
- Simple interface to create custom dashboards to analyze and share data
- Scalable backend that supports low latency queries
- Support for persisting and automatically updating report data
- Ability to segment and aggregate high dimensional data
Pinalytics has three main components:
- Web app, which has the rich UI components to help users dig in and discover insights
- Reporter, which generates the data in a report format
- Backend, which consists of the Thrift service and Hbase databases that power the tool
The Pinalytics web application stack consists of MySQL, SQLAlchemy and the Flask framework. The frontend is built using the React.js library to create user interface components such as simple, interactive visualizations.
Visualization is the main form of analysis within Pinalytics, with a specific focus on time-series plots that update daily. Everything can be customized, including the data displayed in a chart (by choosing specific built-in or user-defined metrics) as well as the chart itself (by selecting various features of the visualization).
We have more than 100 custom dashboards that offer access to charts and metrics for daily tracking. We offer operations that are applied globally on all charts within a particular dashboard including segmentation, rollup aggregation and setting of time window or axes origins.
The metrics composer is a tool on top of Pinalytics, which enables customized time series data by combining metrics together via composite functions on the frontend. Pinterest teams can call and embed various functions to create a formula that will be evaluated and displayed dynamically. For example, DIVIDE(SUM(M1, M2), M3) would be a valid composition. We support basic arithmetic functions as well as more complex calculations that we commonly use, including a simple moving average with a seven-day window, three-day lag difference and anomaly detection.
Custom reporting and metric computation
We wanted to enable employees to customize both the data being visualized and the segmentation of that data, a combination which we call a report. Creating a new customized segment report for Pinalytics that’s updated daily is as easy as a few lines of code or simply writing a Hive query in a separate user interface built for non-technical users. After running the query, the report will automatically appear on Pinalytics for further analysis and update daily.
With the ability to write custom jobs comes the possibility of redundancy, where the same data is processed over and over again. To avoid this problem, we consolidated the core ETLs that compute common metrics. These core metrics are also adjusted for the most recent spam. Finally, we offer higher dimensionality by segmenting data on important features such as gender, country and application type, when applicable. The core metrics consist of data that tracks user activity, events, retention and signups, as well as unique events and events per application type.
Scalable analytics engine
One big challenge in building a system like Pinalytics at scale was allowing flexible and efficient aggregation over multidimensional data at run time. For example, a single chart segmented by country, app type and event type with different days on fully denormalized data could end up rolling up over 1 billion rows in one load. To avoid pre-processing of data, and at the same time provide low latency post aggregation, we needed a scalable backend engine.
To address this problem, we designed our own backend made up of several components including a Thrift Server, HBase with coprocessors deployed and a secondary index table. The secondary index table contains all meta data for all reports, such as the metrics, segment keys and encodings of each report. When each report is created, the Thrift Handler will automatically invoke the creation of a HBase table for the report, manage the splits of this report table and construct segment key encodings which are persisted into the secondary index table. The purpose of segment value encoding is to ensure the row key as short as possible. Each report could have multiple metrics and dimensions (segments). The row key is constructed as follows:
To support flexible roll-up of arbitrary dimensions/segments for each request, the Thrift Server takes advantage of the HBase FuzzyRowFilter. The Thrift server constructs a FuzzyRowFilterMask to filter eligible row keys based on the segment specification. The Thrift Handler then initiates a parallel request on each table region with this FuzzyRowFilterMask. Each region has a coprocessor deployed to roll up metric values for qualified rows in parallel. The Thrift Handler does the final aggregation on returned results from each region to obtain the final value.
As mentioned earlier, we faced a variety of challenges while building these metrics. Let’s take activity metrics as an example. Activity metrics depend on multiple days of user activity data. For instance, we have a dataset called xd28_users that tracks users’ different actions during the last 28 days. As it takes some time to accurately catch spammy users, once we learn about a user’s malicious activities we have to recompute past metrics. Recomputing these metrics is expensive both in terms of I/O and computation since a large amount of user activity data over the past X months needs to be processed. Additionally, our data is partitioned by date, which prevents us from efficiently aggregating the activity metrics. A natural way to aggregate user actions for consecutive days or months is to keep a rolling window of metrics for each user over 28 days and partially add or subtract events from these metrics as we move forward day by day. However, keeping the intermediate metrics and event data in memory was prohibitively expensive due to the scale.
We first transformed the data in a way that each user’s event data and spam definition over months are stored in continuous file blocks. This way we only have to keep the intermediate metrics and event data of one user in memory at a given time. The resulting core metrics are then computed from a Cascading job. Cascading’s data streaming abstraction on Hadoop Map-Reduce improves developer productivity and provides more flexibility than Hive. The data transformation along with the rolling-window algorithm gave us a speedup of 20x - 50x.
More than half of the teams at Pinterest are using Pinalytics to track relevant metrics daily and make fast decisions backed by data to improve the user experience.
If you’re interested in working on large scale data processing and analytics challenges like this one, join our team!
Acknowledgements: Pinalytics was built by Chunyan Wang, Stephanie Rogers, Jooseong Kim and Joshua Inkenbrandt, along with the rest of the data engineering team. We got a lot of useful feedback from the Business Analytics team during the development phase and other engineers from across the company.