The network, new data capabilities, and mobile devices rich in sensors have created fresh and unconventional possibilities to rethink workflows and processes in the real world. To succeed in creating totally new services and rethinking old ones, we must first adopt fresh thinking about the design process, and how sensors and algorithms are driving significant changes in what is possible.
Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set.
Are you looking for a deeper understanding of how to integrate components in
the Apache Hadoop ecosystem to implement data management and processing
solutions? Then this tutorial is for you. We'll provide a clickstream analytics example
illustrating how to architect solutions with Apache Hadoop along with providing best
practices and recommendations for using Hadoop and related tools.
Smartphones carry mighty sensors: GPS, wifi, acceleration, gyroscope, microphone, magnetic field, etc., tracking behavior and environment, giving answer to complex questions like "is the user driving in a car or riding on a train?" We will show cases from travel industry, sports retail, and health. We will propose, how to use such intrusive technology in an ethically correct way.
(IBM Emerging Internet Technologies )
Big Data & Analytics continues to be a disruptive business force. Are we entering a new phase – Big Data & Analytics 3.0?
The first step most organizations need to take to realize the benefits of big data is to invest in a new solution infrastructure. The decision process is not simple: CAPEX vs. OPEX, on-premise vs. on cloud, reutilize existing DWH&BI investments vs start from scratch, purpose-built appliances vs. open-source distributions.
Live demo of building an intelligent big data application from a web console. The tools and APIs behind are built on top of Spark, Shark, Tachyon, Mesos, Aurora, Cassandra, iPython and include: ELT pipeline (ingestion and transformation), data warehouse explorer, export to NoSql and generated APIs, predictive model building, training and publishing, dashboard UI, monitoring and instrumentation
Communicating Data Clearly describes how to draw clear, concise, accurate graphs that are easier to understand than many of the graphs one sees today. The tutorial emphasizes how to avoid common mistakes that produce confusing or even misleading graphs. Graphs for one, two, three, and many variables are covered as well as general principles for creating effective graphs.
The business drivers and objectives
Multi-tenancy concepts and architecture
Multi-tenancy features in EDH
Multi-tenancy configuration in EDH
Camille Fournier, Head of Engineering, Rent the Runway
McLaren Applied Technologies capitalises on the convergence of real-time data management, predictive analytics and simulation to produce high performance design of products and processes. In this talk we will describe how the approach of data-driven design can transform the way we go about creating and using products that are intrinsically intelligent and capable of adaptation
Too many big data sets live in walled gardens and thus limit innovation to a few players. Creating open data sets levels the playing field and allows open source hackers to participate.
Ever do something perfectly in practice, only to have it blow up as soon as you try it when it really counts? This little phenomenon sends skaters to the hospital on a regular basis, mainly because controlled environments usually can’t evoke the depths of human responses.
Complex relationships in big data require involved graphical displays which can be intimidating to users. This talk uses real world examples to identify confusing elements in online visualizations, and articulates a framework for using animation and story-telling to amplify their impact and usability. Tangible and generalizable techniques applicable across fields will be presented.
Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines.
The ggvis package makes it easy to create interactive data graphics with R, with a declarative syntax similar to that of ggplot2. Like ggplot2, ggvis uses concepts from the grammar of graphics, but it also adds the ability to create interactive graphics and deliver them over the web.
How can we change architecture to design more for the people and less for the architects? We present crowd-based solutions with which urban planners can get valuable information about what kind of urban design is attractive to the people. This leads to GPS systems that show you the "most beautiful" path to your destination and to indicators about the beauty of a city.
Your application is out-growing its database, you've started shopping NoSQL options. Maybe you've adopted Hadoop into your Data Warehouse. You've heard HBase might be an appropriate technology, but you need to know more. This talk is for you. To understand its use, first understand how it works. This talk explores the design of HBase and its critical paths to ground an understanding of its use.
This presentation addresses the geographic scalability of HDFS. It describes unique techniques implemented at WANdisco, which allow scaling HDFS over multiple geographically distributed data centers for continuous availability. . .
“Welcome to the era of big, bad, open information.”
Analysts have predicted huge numbers of Internet-connected devices in our future for years now. We may dispute the number, but it is clear that the Internet of Things (IoT) will produce a colossal amount of data.
By having understandable abstractions for important data objects, Etsy has enabled employees across the whole company to actively take part in the collection and analysis of data. Converting data to objects allows us to more naturally convert analysis questions into code, and enforce business rules and definitions consistently.
The talk will provide insight into how to achieve coordinated technological change in a highly agile IT organization; an organisational function that supports one of the UK’s most recognisable brands. Discover valuable lessons learned and begin to understand how your organization may want to take first steps in its engagement proving and implementing Big Data technology.
WANdisco CEO and Co-Founder David Richards will explore ‘mission critical’ applications of Big Data across industry sectors, and highlight the importance of continuous availability, performance, and scalability in its application.
Everyone knows that creating value from big data requires the right skills, but what does this mean in practice? We present findings of a research project where we measure the skills needs of data-driven companies in 6 sectors, quantify the impact of data talent on company performance, and identify good practices to find, create value from and retain data talent.
In this session, Alistair Croll, author of the best-selling Lean Analytics and chair of O’Reilly Strata, will share what he’s learned in a year of working with and interviewing intrapreneurs all over the world.
Mike Olson, CSO and Chairman, Cloudera
Breaking news from data that's already published, that's efficient Open Source Intelligence applied to journalism. The tools and methodologies available today make it possible to go big on a budget.
We will discuss requirements for IoT data processing platforms incl. stream processing, dealing with raw device data, ensuring business continuity and to enforce security and privacy. We will dissect a number of IoT applications, such as a manufacturer offering pro-active maintenance, optimisations of waste management as well as streamlining a supply chain.
Histograms and heatmaps are often used to summarize large data sets. We provide guidelines for using them effectively and efficiently. We illustrate this using the complete Dutch income tax data by looking at distributions in wealth and income. Analysis of this data set is complicated by the large amount of variables. We use clustering techniques to automatically find relevant patterns.
In this presentation Doug Cutting, Cloudera's Chief Architect, will discuss how we might both reap the benefits of data while avoiding its perils.
How can you turn raw data into predictions? How can you take advantage of both cloud scalability and state-of-the-art Open Source Software? This talk shows how we built a model that correctly predicted the outcome of 14 of 16 games in the World Cup using Google's Cloud Platform and tools like iPython and StatsModels.
We will detail the development of a bi-directional event stream recommendation system in RuneScape, a massively multiplayer online game. By capturing a feature rich relationship between player and content we were able to train different 'flavours' of recommendation. Delivered in real-time these 'flavours' balance engagement, monetisation and enjoyment according to shifting business needs.
Apache Mesos, Apache Hadoop, Apache Spark + Custom Enterprise Applications: This stack combined is greater than the sum of each of the pieces of this stack. Couple all of that with custom enterprise applications, and the data center turns into a well-oiled machine. When combined, this software stack delivers unlimited flexibility for the entire data center.
SAMOA is an open-source platform for mining big data streams that runs on several distributed stream processing engines (such as S4 and Storm), and includes streaming algorithms for the most common machine learning tasks such as classification and clustering.
More info at http://samoa-project.net
The need to categorize short text strings arises in many domains: online advertising, search engines, social networking, etc. In this session, we will share strategies for categorizing large volumes of queries and keywords in the advertising space, our successes with open document collections (Wikipedia, DBPedia, Freebase), and details on our solution using Hadoop and Solr.
Apache Spark: Streaming case studies based on interviews with the dev teams, compared and contrasted with alternative open source projects, plus an open source example that demonstrates integration of Spark Streaming, Spark SQL, and Tachyon within a single app.
This session will show the evolution of big data at UniCredit, from troubleshooting and application monitoring to the real-time analytics of ATMs, mobile banking, transactions and card usage. It will go under the hood of technical decisions in setting up a scalable and reliable architecture and dealing with a heterogeneous, geographical distributed and multi-layered environment.
This talk explores the critical importance of storytelling to science and what we can learn from that relationship.
How can big data make your journey to work better? In this case study we’ll explore how! Trains today are complex systems consisting of many embedded subsystems, which operate together with the overall goal of delivering a high quality transportation service...
As we've moved from simple statistical analyses of big data to decision-making based on big data and data-science models, we face an ironic "dirty secret." It is becoming increasingly difficult to understand why particular decisions have been made. In many applications, data-driven models now take as input massive numbers of "signals", including words in text, locations frequented...
A data strategy is only as good as its execution. In the world of Data Science it has become increasingly apparent that business leaders focus on the technical aspects for success in data projects, when in fact the quality of the data team is key. I will in this talk share my experiences training data scientists, and give some key insights into how to build a high-performing Data Science team.
Before Edward Snowden disclosed the US intelligence services’ digital surveillance, marketers had been collecting, aggregating and inferring behavioral profiles on consumers around the world. This talk describes the chief technologies firms use to transform online activities into target audience segments, as well as the current and proposed regulations and public policies being considered.
Drinking from the data lake is tempting, but what is it really? How did we get here, and what lessons can we learn from previous technologies? It’s tempting to see this as the solution to data silos, but what are the costs? Martin Willcox provides a practical guide to help you understand the realities…
Exploiting big data and analytics through the whole organisation is now business as usual for retail and online businesses.
But cities and buildings also create a whole lot of data, which could change lives for better or for worse.
This talk explores what’s happening right now in big data and analytics for cities and buildings, where it might head, and what we might want from it all.
The next generation of MapReduce, YARN, has widely touted job throughput and Apache Hadoop cluster utilization benefits. Less known are the pitfalls littering the migration path to YARN. Learn from our extensive field experience to avoid those pitfalls and get your YARN cluster configured right the first time.
Open data isn't just about waste pickup schedules and reporting pot holes—it can hold real monetary value for everyday business. Whether it's supply chain enhancement or improved customer segmentation, open data holds unexpected value for everyone.