Jack Gudenkauf explores how organizations have successfully deployed tiered hyperscale architecture for real-time streaming with Spark, Kafka, Hadoop, and Vertica and discusses how advancements in hardware technologies such as nonvolatile memory, SSDs, and accelerators are changing the role of big data and big analytics platforms in an overall enterprise-data-platform strategy.
Joe Goldberg explores how companies like GoPro, Produban, Navistar, and others have taken a platform approach to managing their workflows; how they are using workflows to power data ingest, ETL, and data integration processing; how an end-to-end view of workflows has reduced issue resolution time; and how these companies are achieving success in their data warehouse modernization projects.
Viral Shah explains how enterprises like Asurion Services are leveraging big data management solutions to accelerate enterprise data lake initiatives for business value.
What are the essential components of a data platform? John Akred, Mauricio Vacas, and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes, such as client-side activity tracking, a unified reporting platform, and data virtualization techniques to simplify migration, that enable LinkedIn to roll out future product innovations with minimal downstream impact.
Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.
Siva Raghupathy demonstrates how to use Hadoop innovations in conjunction with Amazon Web Services (cloud) innovations.
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.
Alex Bordei walks you through the steps required to build a data lake in the cloud and connect it to on-premises environments, covering best practices in architecting cloud data lakes and key aspects such as performance, security, data lineage, and data maintenance. The technologies presented range from basic HDFS storage to real-time processing with Spark Streaming.
Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.
Ben Sharma uses popular cloud-based use cases to explore how to effectively and safely leverage big data in the cloud to achieve business goals. Now is the time to get the jump on this trend before your competition gets the upper hand.
Todd Lipcon and Marcel Kornacker explain how to simplify Hadoop-based data-centric applications with the CRUD (create, read, update, and delete) and interactive analytic functionality of Apache Impala (incubating) and Apache Kudu (incubating).
Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.
The new erasure coding feature in Apache Hadoop (HDFS-EC) reduces the storage cost by ~50% compared with 3x replication. Zhe Zhang and Uma Maheswara Rao G present the first-ever performance study of HDFS-EC and share insights on when and how to use the feature.
Jake Dolezal shares research into the performance of data quality and data management workloads on Hadoop clusters. Jake discusses a YARN-based approach to data management and outlines highly effective IT resource utilization techniques to achieve extreme agility for organizations and performance gains in Hadoop.
Hear the Chief Data Platform Architect of Dell Technologies outline streaming principles.
Bas Geerdink offers an overview of the evolution that the Hadoop ecosystem has taken at ING. Since 2013, ING has invested heavily in a central data lake and data management practice. Bas shares historical lessons and best practices for enterprises that are incorporating Hadoop into their infrastructure landscape.
Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.
Ready to take a deeper look at how Hadoop and its ecosystem has a widespread impact on analytics? Douglas Liming explains where SAS fits into the open ecosystem, why you no longer have to choose between analytics languages like Python, R, or SAS, and how a single, unified open analytics architecture empowers you to literally have it all.
Yaron Haviv explains how to design real-time IoT and FSI applications, leveraging Spark with advanced data frame acceleration. Yaron then presents a detailed, practical use case, diving deep into the architectural paradigm shift that makes the powerful processing of millions of events both efficient and simple to program.
Jim Scott outlines the core tenets of a message-driven architecture and explains its importance in real-time big data-enabled distributed systems within the realm of finance.
Scott Gnau provides unique insights into the tipping point for data, how enterprises are now rethinking everything from their IT architecture and software strategies to data governance and security, and the cultural shifts CIOs must grapple with when supporting a business using real-time data to scale and grow.
Enterprises are increasingly demanding real-time analytics and insights. Tony Ng offers an overview of Pulsar, an open source real-time streaming system used at eBay. Tony explains how Pulsar integrates Kafka, Kylin, and Druid to provide flexibility and scalability in event and metrics consumption.
Sharing your valuable data internally or with third-party consumers can be risky due to data privacy regulations and IP considerations, but sharing can also generate revenue or help nonprofits succeed at world-changing missions. Steve Touw explores real-world examples of how a proper data architecture enables philanthropic missions and offers ideas for how to better share your data.
Although Python and R promise powerful data science insights, they can also be complex to manage and deploy with Hadoop infrastructure. Peter Wang distills the vast array of Hadoop and data science tools and architectures down to the essentials that deliver a powerful and lightweight stack quickly so that you can accelerate time to value while meeting your data science, governance, and IT needs.
Crystal Valentine draws on lessons learned from companies like Uber and Ericsson to outline the key principles to developing a microservices application. Along the way, Crystal describes how certain next-gen application areas—such as machine learning—are particularly well suited to implementation in a microservices architecture rather than a legacy application paradigm.
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.