To date, mutable big data storage has primarily been the domain of nonrelational (NoSQL) systems such as Apache HBase. However, demand for real-time analytic architectures has led big data back to a familiar friend: relationally structured data storage systems. Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu.
Bob Patterson offers an overview of Hewlett Packard Enterprise's enterprise-grade Hadoop solution, which has everything you need to accelerate your big data journey: innovative hardware architectures for diverse workloads certified for all leading distros, infrastructure software, services from HPE and partners, and add-ons like object storage.
Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
Businesses struggle to build applications that harness all their data. RDBMS cannot handle modern data-intensive workloads, and NoSQL doesn't provide the capabilities for diverse applications. Anil Gadre explains how customers using a converged data platform are succeeding at creating breakthrough new apps for the enterprise.
In today’s world of data breaches and hackers, security is one of the most important components for big data systems, but unfortunately, it's usually the area least planned and architected. Matt Bolte and Toni LeTempt share Walmart's authentication journey, focusing on how decisions made early can have significant impact throughout the maturation of your big data environment.
Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.
Endless possibilities when we connect the unconnected. Raghunath Nambiar discusses the magnitude of new challenges and new opportunities across industry segments.
Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way.
Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
Speed and reliability in deploying big data clusters is key for effectiveness in the cloud. Drawing on ideas from his book Moving Hadoop to the Cloud, which covers essential practices like baking images and automating cluster configuration, Bill Havanki explains how you can automate the creation of new clusters from scratch and use metrics gathered using the cloud provider to scale up.
Whether an entity seeks to create trading algorithms or mitigate risk, predicting trade volume is an important task. Focusing on futures trading that relies on Apache Spark for processing the large amount data, Tobi Bosede considers the use of penalized regression splines for trade volume prediction and the relationship between price volatility and trade volume.
As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services.
Luke Han offers an overview of Apache Kylin and its enterprise version KAP and shares a case study of how a top finance company migrated to Apache Kylin on top of Hadoop from its legacy Cognos and DB2 system.
For many enterprises, the internet of things represents an opportunity to transform the business by examining its data from a holistic lifecycle perspective and generating, analyzing, and archiving the data to reengineer the enterprise. Han Yang explores the latest trends and the role of infrastructure in enabling such a transformation.
Statistical learning techniques applied to network data provide a comprehensive view of traffic behavior that would not be possible using traditional descriptive statistics alone. Amie Elcan shares an application of the random forest classification method using network data queried from a big data platform and demonstrates how to interpret the model output and the value of the data insight.
John Hitchingham shares insights into the design and operation of FINRA's data lake in the AWS cloud, where FINRA extracts, transforms, and loads over 75B transactions per day. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage.
The advances we see in machine learning would be impossible without hardware improvements, but building a high-performance hardware platform is tricky. It involves hardware choices, an understanding of software frameworks and algorithms, and how they interact. Mike Pittaro shares the secrets of matching the right hardware and tools to the right algorithms for optimal performance.
The data collaborative is a new form of public-private partnership that seeks to create public value for the world’s most marginalized children through the exchange of data and data science expertise. Natalia Adler offers an overview of the Data Collaboratives initiative, led by UNICEF and the GovLab at New York University's Tandon School of Engineering.
The growing availability of data—along with advances in fields such as data science and artificial intelligence—has profoundly changed businesses. Manuel García-Herranz explains how to leverage these advances for the most vulnerable, while making sure that the existing data divide does not increase the gap in inequality, and integrate these advances into the humanitarian and development systems.
In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.
AI is moving into the heart of the financial business model. Jike Chong discusses two fundamental business cycles in a financial institution: acquiring customers and sustaining customer relationships, highlighting opportunities in six areas where AI technologies can be readily deployed, along with reference use cases.
It doesn't matter how much data organizations collect; all that matters is how much data they can leverage. Aneesh Karve explores how a Fortune 500 bank leverages data packages to minimize data prep and maximize time spent on analysis—using a technique called source code-inspired data management.
Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources and the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate adoption and shares an approach to implementing an ACoE.
While the value of data and its role in informing decisions and communications is well known, its meaning can be incorrectly interpreted without data visualizations that provide context and accurate representation of the underlying numbers. Julie Rodriguez shares new approaches and visual design methods that provide a greater perspective of the data.
Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Michael McCune offers some answers as they relate to cloud-native platforms.
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.
How can deep learning be employed to create a system that monitors network traffic, operations data, and system logs to reliably flag risk and unearth potential threats? Satish Dandu, Joshua Patterson, and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in operational production systems, using best-of-breed GPU-accelerated open source tools.
There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.
How the largest minority group in the United States it's integrating to the new careers and making an entrance in the technology sector, overcoming cultural, social and economic obstacles and what the companies and the government are doing to integrate them.
With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth.
Michelle Casbon explores the machine learning and natural language processing that enables teams to build products that feel native to every user and explains how Qordoba is tackling the underserved domain of localization using open source tools, including Kubernetes, Docker, Scala, Apache Spark, Apache Cassandra, and Apache PredictionIO (incubating).
Data is powering the largest trucks on America’s interstates, the buses that take our children to school, and the military vehicles that help protect our country. Terry Kline and Mike Olson look at how machine learning and predictive analytics keep more than 300,000+ connected vehicles rolling.
Deepak Majeti explains why the separation of compute and storage has become critical to maximizing the benefits of cloud economics.
Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.
Manuela Veloso explores human-AI collaboration, particularly in terms of robots learning from human sources and robot explanation generation to respond to language-based requests about their autonomous experience. Manuela concludes with a further discussion of general human-AI interaction and the opportunities for transparency and trust building of AI systems.
DHL's Javier Esplugas and Conduce's Kevin Parent explain how the two companies have implemented an IoT pipeline that gives managers and executives real-time insight into warehouse operations, helping them to identify potential hazards, reduce costs, and increase productivity.
Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company's most innovative applications. Cesar Delgado joins Mike Olson to discuss how Apple is using its big data stack and expertise to solve non-data problems.
When analytics applications become business critical, balancing cost with SLAs for performance, backup, dev, test, and recovery is difficult. Karthikeyan Nagalingam discusses big data architectural challenges and how to address them and explains how to create a cost-optimized solution for the rapid deployment of business-critical applications that meet corporate SLAs today and into the future.
As the largest community college in the US, Ivy Tech ingests over 100M rows of data a day. Brendan Aldrich and Lige Hensley explain how Ivy Tech is applying predictive technologies to establish a true data democracy—a self-service data analytics environment empowering thousands of users each day to improve operations, achieve strategic goals, and support student success.
if(we)'s batch event processing pipeline is different from yours, but the process of migrating it from running in a data center to running in AWS is likely pretty similar. Chris Mills explains what was easier than expected, what was harder, and what the company wished it had known before starting the migration.
Sahaana Suri offers an overview of MacroBase, a new analytics engine from Stanford designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. MacroBase allows reconfigurable, real-time root-cause analyses that have already diagnosed issues in production streams in mobile, data center, and industrial applications.
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.
Dustin Cote shares his experience troubleshooting Apache Kafka in production environments and explains how to avoid pitfalls like message loss or performance degradation in your environment.
Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.
Have you ever wondered why Spotify just seems to know what you want? As a data-first company, Spotify is investing heavily in its analytics and machine learning capabilities to understand and predict user needs. Christine Hung shares how Spotify uses data and algorithms to improve user experience and drive business impact.
Abraham Thomas demonstrates how maritime data can be used to predict physical commodity flows, in a case study that covers every stage of the data lifecycle, from raw data acquisition, data cleansing and structuring, and machine learning and probabilistic modeling to conversion to tractable format, packaging for final audience, and commercialization and distribution.
At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. Nandu Jayakumar and Justin Erickson share Visa’s journey along with some best practices for organizations migrating workloads to Hadoop.
New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Francesca Lazzeri and Hong Lu share a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.
Paco Nathan demonstrates how to use PyTextRank—an open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learning—to move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.
Can you really use unstructured data as part of your investment process? Why are leading financial services firms building practices to handle all sorts of data from satellite to browsing? Is the IoT a significant data point for financial analysis? Rob Passarella explores current applications and use cases for data, which financial services firms are eagerly gobbling up with alpha in mind.
Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.
DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization.
Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources.
Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.
Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform.
While stream processing is now popular, streaming architectures must be more reliable and scalable than ever before—more like microservice architectures in fact. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools like Kafka, Spark, Flink, and Akka as examples, and argues that big data and microservices architectures are converging.
Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.
Nikita Shamgunov discusses the future of databases for fast-learning adaptable applications.
Text analytics are advancing rapidly, and new visualization techniques for text are providing new capabilities. Richard Brath and Scott Langevin offer an overview of these new ways to organize massive volumes of text, characterize subjects, score synopses, and skim through lots of documents.
Ben Lorica explores the age of machine learning.
Most analytics tools in use today provide static visuals that don’t reveal the full, real-time picture. Mike Driscoll shows how to take an interactive approach to analytics. From design techniques to discovering new forms of data exploration, he demonstrates how to put the full power of big data into the hands of the people who need it to make key business decisions.
Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.
Nick Selby (CJX, Inc. | Midlothian Police Department)
Nick Selby offers an overview of his study on traffic-stop data in Texas, which found evidence that the state targeted low-income residents (a disproportional number of whom are black and Latino) for heightened scrutiny and penalties. The problem is not necessarily an issue of racist cops—which means fixing the criminal justice system isn’t just an issue of addressing racism in uniform.
The IoT can deliver real outcomes that can transform organizations—and societies—for the better. But the IoT is not transformative without the power of big data. Chuck Yarbrough shares examples of where the IoT and big data have combined to solve significant business challenges and take advantage of business opportunities.
Jack Norris shares lessons learned by leading companies leveraging data to transform customer experiences, operational results, and overall growth and details the infrastructure, development, and data management principles used by successful leaders to drive agility regardless of application volume or scale.
While it's clear organizations need to have a comprehensive data strategy, few have actually developed a plan to improve the access, sharing, and usage of data. Evan Levy discusses the five essential components that make up a data strategy and explores the individual attributes of each.
A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Carlo Appugliese examines the impact these trends and changes will have on the future of data science and how machine learning is making data science available to all.
Salesforce recently released Einstein, which brings AI into its core platform to power every business. The secret behind Einstein is an underlying platform that accelerates AI development at scale for both internal and external data scientists. Simon Chan shares his experience building this unified platform for a multitenancy, multibusiness cloud enterprise.
Joanna Bryson (University of Bath | Princeton Center for Information Technology Policy)
AI has been with us for hundreds of years; there's no "singularity" step change. Joanna Bryson explains that the main threat of AI is not that it will do anything to us but what we are already doing to each other with it—predicting and manipulating our own and others' behavior.
Data science is key to addressing national challenges with greater agility. At the EPA, the prime challenge is to provide the best value to American citizens in an ever-changing world. Robin Thottungal explains how the EPA addresses this challenge through digital and analytical services.
Advanced data analytics is reshaping the enterprise with new discoveries, better customer experiences, and improved products and services, all enabled by actionable insight. Ziya Ma shares how Intel is driving a holistic approach to powering advanced analytics and artificial intelligence workloads and unleashing intelligent and scalable insights from the edge to the cloud to the enterprise.
Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.
Engaging, teaching, mentoring, and advising mature, mostly employed, often enthusiastic and ambitious adult learners at University of Toronto has taught Jerrard Gaertner more about analytics in the real world than he ever imagined. Jerrard shares stories he learned about everything from hyped-up expectations and internal sabotage to organizational streamlining and creating transformative insight.
Sam Lavigne offers an overview of White Collar Crime Risk Zones, a predictive policing application that uses industry-standard predictive policing methodologies to predict financial crime at the city-block level with an accuracy of 90.12%. Unlike typical predictive policing apps, which criminalize poverty, White Collar Crime Risk Zones criminalizes wealth.
Robots are going to take our jobs, they say. Tim O'Reilly says, "Only if that's what we ask them to do!" Tim has had his fill of technological determinism. He explains why technology is the solution to human problems and why we won't run out of work till we run out of problems.
The more that we rely on data to train our models and inform our systems, the more that this data becomes a target for those seeking to manipulate algorithmic systems and undermine trust in data. danah boyd explores how systems are being gamed, how data is vulnerable, and what we need to do to build technical antibodies.