Creating production-ready analytical pipelines can be a messy, error-prone undertaking. In the simplest case, connecting a workflow of heterogeneous components, such as databases, feature enrichment and visualization tools, programming languages, and analytical engines, requires maintaining connections between multiple tools. And each of these tools is subject to its own development cycle. In the case of projects involving big data or analytics over real-time streaming data, the difficulties only increase.
The Trusted Analytics Platform (TAP) is an open source-based platform combining elements from popular projects, including Python, Spark Streaming, GearPump, and Docker. TAP enables data scientists to ask bigger questions of their data and carry out principled data science experiments—all while engaging in iterative, collaborative development of production solutions with application developers. Since TAP was introduced in 2015, project contributions have included popular analytics tools and libraries, including the ability to “ bring your own." Kyle Ambert offers an overview of these open source project contributions, which include a new Docker-based architecture and improved Spark integration, and explains what they mean to data scientists. Kyle also discusses a healthcare machine-learning-based solution focused on the identification of hospital patients at risk for readmittance.
This session is sponsored by Intel.
Kyle Ambert is lead data scientist at Intel’s Artificial Intelligence and Analytics Solutions group, where he uses machine learning and statistical methods to solve real-world big data problems. Currently, his research centers around novel applications of machine learning in the health and life sciences. Kyle contributes to the data science direction of the Trusted Analytics Platform, particularly as it pertains to analytical pipeline and algorithm development. He holds a BA in biological psychology from Wheaton College and a PhD in biomedical informatics from Oregon Health & Science University, where his research focused on text analytics and developing machine-learning optimization solutions for biocuration workflows in the neurosciences.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.