Brought to you by NumFOCUS Foundation and O’Reilly Media Inc.
The official Jupyter Conference
August 22-23, 2017: Training
August 23-25, 2017: Tutorials & Conference
New York, NY

Data science made easy in Jupyter notebooks using PixieDust and InsightFactory

David Taieb (IBM), Prithwish Chakraborty (IBM Watson Health), Faisal Farooq (IBM Watson Health)
11:55am–12:35pm Thursday, August 24, 2017
Development and community
Location: Sutton Center/Sutton South Level: Beginner
Average rating: *****
(5.00, 1 rating)

Who is this presentation for?

  • Data scientists, developers, and line of business users

Prerequisite knowledge

  • A basic understanding of common charting techniques (bar, line, pie, scatter plots, maps, and histograms)
  • No advanced data science experience required

What you'll learn

  • Explore tools and techniques to improve your productivity when working with data inside Jupyter notebooks using PixieDust and InsightFactory

Description

Jupyter notebooks have gained widespread adoption among data scientists due to their extensibility and self-contained, language-agnostic environment. Typically, data scientists follow a number of repeatable steps to collect, analyze, and visualize the data under consideration. David Taieb, Prithwish Chakraborty, and Faisal Farooq offer an overview of PixieDust, a new open source library that speeds data exploration with interactive autovisualizations that make creating charts easy and fun, and InsightFactory, aimed at reuse, collaboration, and productivity improvement across an enterprise.

PixieDust speeds up data manipulation and displays, with features like interactive autovisualization of Apache Spark and pandas DataFrames using popular chart engines like Matplotlib, seaborn, Bokeh, or MapBox; real-time Spark job progress monitoring directly from a notebook; seamless integration with cloud services; easy creation of sophisticated dashboards; and much more. InsightFactory includes components like an enterprise-ready standardized, multistep, hierarchical workflow in the interactive sidebar; a prepopulated workflow with standardized code modules for the workflow steps; user-contributed custom code modules beyond the standardized snippets for individual as well as enterprise use; and quick module access via search bars. Using these two tools in tandem in your projects, you can visualize and explore data interactively and effortlessly—all within a standardized and reusable setting. And PixieDust can also run on a Scala kernel—imagine being able to visualize your favorite Python chart engines from a Scala notebook.

David, Prithwish, and Faisal conclude with a few demos that showcase the power of each of these tools both individually and, more importantly, in conjunction with each other and explore analytic applications that combine multiple data sources and technologies including Twitter, Watson Tone Analyzer, Spark Streaming, and some fun real-time visualizations, all running within a notebook. They also demonstrate how a data scientist can start with an empty notebook and with a few clicks build a data science experiment for clinical research based on predefined workflows and how you can then combine both PixieDust and InsightFactory to not only speed up said example using faster data exploration but also save and share the process as a workflow toward future reuse for other experiments and users, all with little to no coding.

Contributor

Bibo Hao is a research scientist on the Cognitive Healthcare team at IBM Research China, where he builds tools to help data scientists and researchers find insights using machine-learning techniques applied to healthcare data. He holds an MS in computer science from the University of Chinese Academy of Sciences in 2015, where his research focused on the application of machine learning in social media and psychology.

Speakers

Photo of David Taieb

David Taieb

IBM

David Taieb is the STSM for the Cloud Data Services Developer Advocacy team at IBM, where he leads a team of avid technologists with the mission of educating developers on the art of possible with cloud technologies. He’s passionate about building open source tools, such as the PixieDust Python library for the Jupyter Notebook and Apache Spark, that help improve developer’s productivity and overall experience. Previously, David was the lead architect for the Watson Core UI and Tooling team based in Littleton, Massachusetts, where he led the design and development of a Unified Tooling Platform to support all the Watson Tools, including accuracy analysis, test experiments, corpus ingestion, and training data generation. Before that, he was the lead architect for the Domino Server OSGi team responsible for integrating the eXpeditor J2EE Web Container in Domino and building first-class APIs for the developer community. David started with IBM in 1996, working on various globalization technologies and products including Domino Global Workbench and a multilingual content management system for the Websphere Application Server. David enjoys sharing his experience by speaking at conferences and meeting as many people as possible. You’ll find him at various events like the Strata Data Conference, Velocity, and IBM Interconnect.

Photo of Prithwish Chakraborty

Prithwish Chakraborty

IBM Watson Health

Prithwish Chakraborty is a data scientist on the IBM Watson for Real World Evidence team at IBM Watson Health. His work focuses on applications of data science towards patient health characterization and risk modeling. Broadly, his research interests are temporal data mining, machine learning, and image recognition. His work has been published in key data science venues, including KDD, SDM, and AAAI, and he presented a tutorial on public health forecasting in AAAI 2016 and gave an invited talk at BCDE 2014. Prithwish holds a patent with HP labs on forecasting solar photovoltaic output. He holds a PhD in computer science from Virginia Tech, where his research, under the guidance of Naren Ramakrishnan, focused on the applications of data science to public health forecasting.

Photo of Faisal Farooq

Faisal Farooq

IBM Watson Health

Faisal Farooq is the principal scientist in the Watson Health group of IBM Watson, where he works on next-generation healthcare software to improve patient care. Faisal is an expert in applying machine learning in the healthcare domain. Previously, he was a senior key expert (distinguished scientist) at Siemens Healthcare, where he successfully delivered the most widely adopted data science product in US healthcare. Faisal has published a number of papers in multiple journals and at conferences in the areas of machine learning, handwriting, biometrics, and text analysis. He holds a PhD in computer science and engineering from the University at Buffalo, where he worked as a graduate research assistant in Center of Excellence for Document Analysis and Recognition (CEDAR) and the Center for Unified Biometrics and Sensors (CUBS). He also completed multiple research internships at the IBM T.J. Watson Research Center.