Data platform architecture principles
Who is this presentation for?Engineers, Architects, Data scientists
We’re well into the Big Data era. Most organizations embraced collecting data and analyzing what’s happening inside their products. This is crucial to their success, not only to understand what works but also to optimize their services and increase their value to their customers. Several industries are being disrupted just by using technology to optimize existing processes. Think about cabs, short term rentals, or co-working spaces.
There’s also discussion around what constitutes “big” data. Here, we’re not only talking about large volumes of data produced by the likes of Google, Facebook and others very large companies. We’re also talking about the multitude of various datasources and the many teams using them and producing derived datasets. The concept of central data team that does all the data related work is outdated and the entire organization should become an ecosystem where teams depend on each other. Central data teams now become enablers, coaching and providing a safe and flexible environment to move fast while bringing transparency to the increasing complexity of interdependent system. Data processing and micro services have similar requirements in terms of ownership, monitoring and dependency management.
In this talk we will discuss the principles to follow while building the data platform enabling the entire organization to build data driven products, whether using insights from the data or using data directly to build features (for example recommendations).
Every team can consume and produce data using explicit contracts: What they share or don’t, the level of service they provide and the quality of the data. We need to build visibility to the entire org and help evolve the dependency graph with global lineage and schema evolution.
The platform is self-service and gets out of the way to empower users to do the right thing. It provides a safe environment where mistakes can be easily mitigated and the scope of their impact limited. It is flexible to allow users to pick the best tool for the job while facilitating interdependencies. Streaming and batch processing are complementary and work together. Governance is delegated to the appropriate stewards. Sensitive data is properly annotated and secured and its usage tracked and controlled. Cloud being omnipresent, users expect not to have to worry about where they processes are run or where the data is stored. The platform is expected to scale transparently and be billed by the minute.
We’ll discuss the best tools making the data platform and how to build the missing pieces.
Prerequisite knowledgeBig data concepts, batch and stream processing as well as database concepts.
What you'll learn
Julien Le Dem
Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts