Pinterest is a data-driven company where we’re constantly using data to inform decisions, so it was important to develop an efficient system to analyze that information quickly.
We ultimately landed on Redshift, a data warehouse service on Amazon Web Service, to power our interactive analysis, and import billions of records everyday to make the core data source available as soon as possible. Redshift was a great solution to answer questions in seconds to enable interactive data analysis and quick prototyping (whereas Hadoop and Hive were used to process tens of terabytes of data per day, but only give answers in minutes to hours).
Here’s a look at our experience with Redshift, including challenges and wins, while scaling tens of billions of Pins in the system.
Redshift is built to be easy to set up and fairly reliable. However, with petabytes of data and a rapidly growing organization, there are some interesting challenges for us to use Redshift in production.
Challenge 1: building 100-terabyte ETL from Hive to Redshift
With petabytes of data in Hive, it took us some time to figure out the best practice to import more than 100 terabytes of core data into Redshift. We have various data formats in Hive, including raw json, Thrift, RCFile, all of which need to be transformed to text files with a flat schema. We write schema-mapping scripts in Python to generate the Hive queries to do the heavy lifting ETL.
Most of our tables in Hive are time series data partitioned by date. For the best results, we use the date as the sortkey and append data to each table daily to avoid expensive VACUUM operations. Another approach is using a table per day and connecting them with a view, but we found Redshift did not optimize queries with view very well (e.g. it did not push down LIMIT).
Loading a big snapshot table is also a challenge. Our biggest table, db_pins, which holds 20 billions of Pins, is more than 10TB in size. Loading it in one shot results in expensive partitioning and sorting, so we do the heavy partitioning in Hive and load it to Redshift in chunks.
Since Redshift has limited storage size, we implemented table retention for big time-series tables, by periodically running an INSERT query in Redshift to clone a new table with less data, which was much faster than deleting rows and doing expensive VACUUM, or dropping the entire table and re-importing.
Perhaps the biggest challenge was with the S3 Eventual Consistency. We found the data in Redshift got significantly missed sometimes, and traced it down to the S3 issue. We reduced the number of files on S3 by combining small files, and minimized the missing data. We also added the audit to the ETL at each step, and data loss rate is now usually under 0.0001% which is acceptable.
Challenge 2: getting the 100x speedup over Hive out of Redshift
Redshift is built to be high-performing, and as a result, during prototyping we were able to achieve 50-100x speedup over Hive for some queries without much effort. But once put into production, it did not always give the expected performance out of box.
In our early testing we observed a few hour-long queries. Debugging the performance issue was very tricky, requiring collecting the query plan, query execution statistics, etc., but ultimately it didn’t have that many performance tricks.
A lesson we learned here was to prepare the data well, and update the system statistics whenever necessary, as it can tremendously affect how well the optimizer works. Choose a good sortkey and distkey for each table, and note sortkey is always good to have, but bad distkey may contribute to skew and hurt the performance.
Below is the benchmark result between our Hive and Redshift clusters, based on db_pins (~20 billion rows with 50 columns, total size 10TB) and other core tables. Keep in mind that these comparisons don’t account for cluster sizes, contention and other possible optimizations so the comparisons are by no means scientific.
We also observed common mistakes and summarized our best practices, and built utilities to monitor slow queries in real time. Queries that take more than 20 minutes are considered suspicious, and engineers will receive a reminder to review our following best practices:
Perhaps the most common mistake is that many tend to put “SELECT *” in the select clause, which is against the columnar storage as it requires scanning all the columns unnecessarily.
Challenge 3: managing the contention among growing number of queries/users
With the impressive performance, Redshift was widely adopted at Pinterest soon after we set it up. We’re a fast growing organization, and with increasing number of people using Redshift concurrently, queries can slow down easily due to the contention among growing number of queries. An expensive query can occupy lots of resources and slow down other concurrent queries, so we came up with our own policy to minimize the contention.
We learned to avoid heavy ETL queries during peak hours (9am-5pm). ETL queries like COPY use a lot of I/O and network bandwidth, and should be avoided during these hours in favor of users’ interactive queries. We optimize our ETL pipeline to finish before peak hours, or we’ll pause the pipeline and resume afterward. Additionally, timeout users’ interactive queries during peak hours. Long queries likely have some mistakes and should be timed out as soon as possible instead of wasting the resource.
We’ve settled on a 16-node hs1.8xlarge. Almost 100 people use Redshift at Pinterest regularly and we’re running 300-500 interactive queries each day. The overall performance has exceeded our expectations, as most queries can finish in a few seconds. Here’s the duration percentile of all the interactive queries in the last week. We can see that 75% queries can finish in 35 seconds.
Because of the success we’ve had with Redshift, we’re continuing to integrate it with our next generation tools. If you’re interested in taking on these type of challenges and coming up with quick, scalable solutions, join our team!
Jie Li is an engineer at Pinterest.