Data science for enterprise use cases explodes the number of intermediate datasets. Thus, one of upcoming challenges is to find a way into these ever-growing data sources. Andy Petrella proposes a data-science-on-data-science approach, using behavioral data combined with static and runtime metadata of processes.
Andy explores the well-known problems of doing data science, like finding the right data, connecting to it, and figuring out the content, provenance, and all the contextual information you need before reading a dataset. Nowadays, data science platform use distributed technologies because the amount of data is increasing; hence the process is more expensive.
Andy emphasizes the link existing between people in an enterprise, data, and processes and dives into how to collect information in a organic manner (dynamically and implicitly), using a combination of notebook and harvester system.
Andy Petrella is the CEO of Kensu, where he also gets his hands dirty in Adalog’s code. Andy is a mathematician turned distributed computing entrepreneur. Besides being a Scala/Spark trainer, Andy participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata, Scala eXchange, Data Science eXchange, and Devoxx events.
©2017, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com