Data platform architecture principles
Who is this presentation for?
- Engineers, architects, and data scientists
We’re well into the big data era. Most organizations have embraced collecting data and analyzing what’s happening inside their products. This is crucial to their success, not only to understand what works but also to optimize their services and increase their value to their customers. Several industries are being disrupted just by using technology to optimize existing processes. Think about cabs, short-term rentals, or coworking spaces.
There’s also discussion around what constitutes “big” data. Here, we’re not only talking about large volumes of data produced by the likes of Google, Facebook, and other very large companies. We’re also talking about the multitude of various data sources and the many teams using them and producing derived datasets. The concept of a central data team that does all the data-related work is outdated; the entire organization should become an ecosystem where teams depend on each other. Central data teams now become enablers, coaching and providing a safe and flexible environment to move fast while bringing transparency to the increasing complexity of interdependent systems. Data processing and microservices have similar requirements in terms of ownership, monitoring, and dependency management.
Julien Le Dem outlines the principles to follow while building the data platform that enables the entire organization to build data-driven products, whether you’re using insights from the data or using the data directly to build features (for example, recommendations). Every team can consume and produce data using explicit contracts: what they share or don’t, the level of service they provide, and the quality of the data. We need to build visibility to the entire organization and help evolve the dependency graph with global lineage and schema evolution.
The platform is self-service and gets out of the way to empower users to do the right thing. It provides a safe environment where mistakes can be easily mitigated and the scope of their impact limited. It’s flexible to allow users to pick the best tool for the job while facilitating interdependencies. Streaming and batch processing are complementary and work together. Governance is delegated to the appropriate stewards. Sensitive data is properly annotated, and secured and its usage tracked and controlled. Cloud being omnipresent, users expect not to have to worry about where their processes are run or where their data is stored. The platform is expected to scale transparently and be billed by the minute.
You’ll explore the best tools for making the data platform and how to build the missing pieces.
- A working knowledge of big data concepts, batch and stream processing, and database concepts
What you'll learn
- Understand the key principles, abstractions, and capabilities to expect from a data platform to allow an organization to scale while using data
Julien Le Dem
Julien Le Dem is a principal engineer at WeWork. He’s also the coauthor of Apache Parquet and the PMC chair of the project, and he’s a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts