The Big Data Platform team at Netflix maintains a cloud-based data warehouse with over 10 petabytes of data stored predominantly in Parquet format. Our platform has traditionally leveraged Pig for ETL processing, Hive for large analytic workloads, and Presto for interactive and exploratory use cases. For a long time, Spark seemed attractive to complement our platform, but technical gaps prevented effective use at scale in our environment. Recent improvements have allowed us to add Spark to our cloud data architecture and interoperate seamlessly with the other tools and services in our stack.
We will go into detail about our deployment configuration and what it takes to run Spark alongside traditional workloads on YARN. We will share examples of a few of our largest workflows translated to Spark for comparison in terms of both performance and complexity. We also identified cases where big data tools were used to solve problems clearly out of their respective domains. This resulted in awkward implementations that were elegantly solved by Spark. Finally, we will share our vision of how Spark will evolve our platform and push the state of big data processing at Netflix.
Daniel Weeks manages the big data compute team at Netflix and is a Parquet committer. Previously, Daniel focused on research in big data solutions and distributed systems.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.