Increasing demand for more higher-granularity data continues to push the boundaries of what is possible to process using big data technologies. At Netflix, the Big Data Platform team manages a highly organized and curated data warehouse in Amazon S3 with over 25 petabytes of data. Even with great effort to optimize datasets, our metadata is growing to the point it is big data. With thousands of tables—and partitions per table reaching into the millions—approaches to isolate data and process it efficiently are breaking down.
The Big Data Platform team recently made additions to Presto and is adding similar functionality to other processing engines like Spark, Hive, and Pig that will allow for new paradigms of efficient storage and optimized processing. The foundation for this effort is the Parquet file format, which is the predominant storage format in our data warehouse. Integrating advanced features of this format with capabilities of modern processing engines allows us to enhance processing by use of the following:
Daniel Weeks takes attendees through how Netflix combines these features with storage patterns and other enhancements the team has made to achieve performance improvements, which can exceed 100x in real-world query cases. Daniel will go into detail about how Netflix’s big data platform collects, processes, and stores data from the data pipeline to take advantage of Parquet and how the team uses Spark on YARN for ETL and Presto for interactive analytics to enhance performance. He will also demonstrate applications of this approach that dramatically speed up the querying of telemetry-service data and A/B-testing-platform data.
Daniel Weeks manages the big data compute team at Netflix and is a Parquet committer. Previously, Daniel focused on research in big data solutions and distributed systems.
Comments on this page are now closed.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.