Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Top five mistakes when writing Spark applications

Ted Malaska (Capital One), Mark Grover (Lyft)
1:15pm–1:55pm Wednesday, 09/28/2016
Spark & beyond
Location: Hall 1B Level: Advanced
Average rating: ***..
(3.92, 12 ratings)

Prerequisite knowledge

  • A working knowledge of Spark
  • What you'll learn

  • Understand common mistakes and how to fix them, in order to write more capable, powerful Spark applications
  • Description

    In the world of distributed computing, Spark has simplified development and opened the doors for many to start writing distributed programs. Folks with little to no distributed coding experience can now write just a couple lines of code that will immediately get hundreds or thousands of machines working on creating business value.

    Even though Spark code is easy to write and read, that doesn’t mean that users don’t run into issues of long-running, slow-performing jobs or out-of-memory errors. Thankfully most of the issues with using Spark have nothing to do with Spark but rather the approach taken when using it. Ted Malaska and Mark Grover cover the top five things that prevent Spark developers from getting the most out of their Spark clusters. When these issues are addressed, it is not uncommon to see the same job running 10x or 100x faster with the same clusters and the same data, using just a different approach.

    Photo of Ted Malaska

    Ted Malaska

    Capital One

    Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

    Photo of Mark Grover

    Mark Grover

    Lyft

    Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMCmember on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

    Comments on this page are now closed.

    Comments

    Picture of Alex Rivlin
    Alex Rivlin
    10/02/2016 8:34pm EDT

    Ted and Mark, can you please share the link to the slides? – thank you

    09/28/2016 10:29am EDT

    Below link was shared in the presentation:
    tiny.cloudera.com/spark-mistakes

    09/28/2016 9:36am EDT

    Will you guys be sharing the slides today?