Enterprises are increasingly looking to deploy big data technologies to transform their business, but the time to generate value from their data often exceeds their worst expectations. Hadoop and Spark provide unprecedented scale and flexibility at a low cost compared to data warehouses. However, the messy and diverse nature of big data means that users have to stitch together disparate systems, resulting in undesirable complexities and inefficiencies. Even simple tasks like ingestion or transformation of data can be cumbersome, requiring large number of lines of complex code and manual programming. The sheer volume of petabytes of data distributed across a cluster further complicates operations, security, and data governance, and the lack of skilled resources is a big barrier in using and operationalizing this modern data architecture.
Moreover, many of the advantages result in downstream issues. Schema-on-read allows more flexibility but is turning data lakes into data swamps. The varied choice of open source technologies that offer the promise of a rich, diverse ecosystem ends up in specialized divergent options that can trigger integration headaches (e.g., multiple storage layers, many processing engines, and various workflow engines and schedules). Point solutions are limited and cannot be easily put in production or often require custom integration code. Finally, breaking data silos and democratizing data is not easily achievable as the platform has severe usability shortcomings (command-line and code requirements) for business users.
Jonathan Gray explores the standardization, automation, and deep integration technologies in Hadoop and Spark that allow companies, developers, and users to focus on application logic and insights rather than infrastructure and integration.
This session is sponsored by Cask.
Jonathan Gray is the founder and CEO of Cask. Jonathan is an entrepreneur and software engineer with a background in startups, open source, and all things data. Prior to founding Cask, he was a software engineer at Facebook, where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase. He is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.