Juliet Hougland and Sean Owen offer a practical overview of the basics of using Python data tools with a Hadoop cluster—using an interactive demo format with accompanying online materials and data. Juliet and Sean cover HDFS connectivity and dealing with raw data files and running SQL queries with a SQL-on-Hadoop system like Apache Hive or Apache Impala (incubating) as well as the basics of accessing data with Spark, creating new Spark DataFrames, and implementing the two most common modeling workflows: fitting a model on a single node using scikit, saving the model, and performing embarrassingly parallel model application and fitting a model to distributed data using Spark MLlib.
Juliet Hougland answers complex business problems using statistics to tame multiterabyte datasets. She succeeds in applying and explaining the results of mathematical models across a variety of industries including software, industrial energy, retail, and consumer packaged goods. Juliet is currently the head of data science, engineering at Cloudera, where she focuses on using data to help engineering build high-quality products. Juliet’s been sought after by Cloudera’s customers as a field-facing data scientist advising on which tools to use, teaching how to use them, recommending the best approach to bring together the right data to answer the business problem at hand, and building production machine-learning models. For many years, Juliet has been a contributor in the open source community working on projects such as Apache Spark, Scalding, and Kiji. Juliet holds an MS in applied mathematics from the University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in math-physics.
Sean Owen is director of data science at Cloudera in London. Before Cloudera, he founded Myrrix Ltd. (now the Oryx project) to commercialize large-scale real-time recommender systems on Hadoop. He is an Apache Spark committer, was a committer and VP for Apache Mahout, and is the coauthor of Advanced Analytics on Spark and Mahout in Action. Previously, Sean was a senior engineer at Google.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.