Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Learn stream processing with Apache Beam

Tyler Akidau (Google), Jesse Anderson (Big Data Institute)
9:00am–12:30pm Tuesday, 09/27/2016
IoT & real-time
Location: 1B 03/04 Level: Beginner
Tags: real-time
Average rating: ****.
(4.50, 6 ratings)

Prerequisite knowledge

  • Familiarity with the high-level concepts covered in the O'Reilly Radar posts "The World Beyond Batch: Streaming 101" and "Streaming 102"
  • A working knowledge of Flink, Spark, or Cloud Dataflow
  • Materials or downloads needed in advance

  • A laptop
  • A GitHub account
  • Any initial setup already completed for the Beam execution engine of your choice (Flink, Spark, or Cloud Dataflow)
  • What you'll learn

  • Understand the foundations of stream processing and the ease with which portable streaming can be accomplished via the Apache Beam platform
  • Description

    Stream processing is increasingly relevant in today’s world of big data, thanks to the lower latency, higher-value results, and more predictable resource utilization afforded by stream processing engines. At the same time, without a solid understanding of the necessary building blocks, streaming can feel like a complex and subtle beast. It doesn’t have to be that way. Join Tyler Akidau and Jesse Anderson for a tour of stream processing concepts via a walkthrough of the easiest to use yet most sophisticated stream processing model on the planet, Apache Beam (incubating).

    You’ll explore a series of examples that help shed light on the important topics of windowing, watermarks, and triggers; observe firsthand the different shapes of materialized output made possible by the flexibility of the Beam streaming model; experience the portability afforded by Beam, as you work through examples using the runner of your choice (Apache Flink, Apache Spark, or Google Cloud Dataflow); and interact with engineers who have years of experience with massive-scale stream processing.

    Photo of Tyler Akidau

    Tyler Akidau

    Google

    Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

    Photo of Jesse Anderson

    Jesse Anderson

    Big Data Institute

    Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

    Comments on this page are now closed.

    Comments

    Picture of Eric Sheetz
    09/24/2016 8:24am EDT

    The instructions in the body of the email advised this sequence of tasks:

    $ mvn eclipse:eclipse package

    Please download http://tiny.jesse-anderson.com/beamtutorial. That will
    contain the materials for the tutorial like the slides.

    I suggest reversing this, and downloading http://tiny.jesse-anderson.com/beamtutorial first. As it offers more detailed setup instructions and states different steps for Eclipse vs Intellij.

    Also, if getting a Maven build error due to the absence of a pom file then change directories and put yourself in the directory where the POM.xml file is and try rerunning.

    Picture of Jesse Anderson
    Jesse Anderson
    09/23/2016 8:33am EDT

    @riccardo There was an bug that was checked in. I added a workaround for it. Do a pull and try again.

    Picture of Riccardo Corbella
    09/23/2016 5:20am EDT

    I followed the instructions but when i execute the command mvn exec:java Dexec.mainClass=“org.apache.beam.examples.tutorial.game.solution.Exercise1” i got the following:
    _[WARNING]
    java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:294)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.NoClassDefFoundError: org/apache/spark/api/java/JavaSparkContext
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetPublicMethods(Class.java:2902)
    at java.lang.Class.getMethods(Class.java:1615)
    at sun.misc.ProxyGenerator.generateClassFile(ProxyGenerator.java:451)
    at sun.misc.ProxyGenerator.generateProxyClass(ProxyGenerator.java:339)
    at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:639)
    at java.lang.reflect.Proxy$ProxyClassFactory.apply(Proxy.java:557)
    at java.lang.reflect.WeakCache$Factory.get(WeakCache.java:230)
    at java.lang.reflect.WeakCache.get(WeakCache.java:127)
    at java.lang.reflect.Proxy.getProxyClass0(Proxy.java:419)
    at java.lang.reflect.Proxy.getProxyClass(Proxy.java:371)
    at org.apache.beam.sdk.options.PipelineOptionsFactory.validateWellFormed(PipelineOptionsFactory.java:615)
    at org.apache.beam.sdk.options.PipelineOptionsFactory.register(PipelineOptionsFactory.java:553)
    at org.apache.beam.sdk.options.PipelineOptionsFactory.initializeRegistry(PipelineOptionsFactory.java:579)
    at org.apache.beam.sdk.options.PipelineOptionsFactory.(PipelineOptionsFactory.java:528)
    at org.apache.beam.examples.tutorial.game.solution.Exercise1.main(Exercise1.java:125)
    … 6 more
    Caused by: java.lang.ClassNotFoundException: org.apache.spark.api.java.JavaSparkContext
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    … 23 more
    [INFO] -
    ———————————————————————————————————
    [INFO] BUILD FAILURE
    [INFO] -——————————————————————————————————-
    [INFO] Total time: 5.992 s
    [INFO] Finished at: 2016-09-23T15:14:34+02:00
    [INFO] Final Memory: 18M/231M
    [INFO] -——————————————————————————————————-
    [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.5.0:java (default-cli) on project Tutorial: An exception occured while executing the Java class. null: InvocationTargetException: org/apache/spark/api/java/JavaSparkContext: org.apache.spark.api.java.JavaSparkContext → [Help 1]
    [ERROR]
    [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    [ERROR]
    [ERROR] For more information about the errors and possible solutions, please read the following articles:
    [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException_

    Picture of Jesse Anderson
    Jesse Anderson
    09/15/2016 4:29pm EDT

    I updated the repository URL too.

    C G
    09/15/2016 1:57pm EDT

    For those who are maven-incompetent like me, I ended up having to enable snapshots for the repository in my settings.xml. Sorry for the idiocy

    C G
    09/15/2016 12:44pm EDT

    I don’t see the version 0.3.0-incubating-SNAPSHOT artifacts for Apache Beam on repository.apache.org nor Maven Central. Is that my fault?