Cultivate explores the business practices managers, people in product development groups who want to become managers, and project team leads need to thrive in the new world: enabling design thinking, collaboration, and agility. The focus is on the way corporate cultures have to change to adapt to current trends like rapid release cycles, the use of data to inform discussion, and building environments where everyone, including women and other underrepresented groups, can contribute freely.
Changing culture isn’t about making superficial organizational tweaks; these are significant changes, and they have to be made from the bottom up, as well as from the top down. The companies that can make those changes will prosper; the ones that can’t, won’t.
The command line, although invented decades ago, remains an amazing environment for doing data science. By combining small, yet powerful, command-line tools you can quickly obtain, scrub, explore, visualize, and model your data. In this hands-on tutorial you will gain a solid understanding of how to leverage the power of the command line and integrate it into your existing data science workflow.
Read more.
Are you looking for a deeper understanding of how to integrate components in the Apache Hadoop ecosystem to implement data management and processing solutions? Then this tutorial is for you. We'll provide a clickstream analytics example illustrating how to architect solutions with Apache Hadoop along with providing best practices and recommendations for using Hadoop and related tools.
Read more.
Python has become an increasingly important part of the data engineer and analytic tool landscape. Pydata at Strata provides in-depth coverage of the tools and techniques gaining traction with the data audience, including iPython Notebook, NumPy/matplotlib for visualization, SciPy, scikit-learn, and how to scale Python performance, including how to handle large, distributed data sets.
Read more.
All-Day: Strata's regular data science track has great talks with real world experience from leading edge speakers. But we didn't just stop there—we added the Hardcore Data Science day to give you a chance to go even deeper. The Hardcore day will add new techniques and technologies to your data science toolbox, shared by leading data science practitioners from startups, industry, consulting...
Read more.
Spark Camp, organized by the creators of the Apache Spark project at Databricks, will be a day long hands-on introduction to the Spark platform including Spark Core, the Spark Shell, Spark Streaming, Spark SQL, MLlib, and more.
Read more.
All-Day: For business strategists, marketers, product managers, and entrepreneurs, Data-Driven Business looks at how to use data to make better business decisions faster. Packed with case studies, panels, and eye-opening presentations, this fast-paced day focuses on how to solve today's thorniest business problems with Big Data. It's the missing MBA for a data-driven, always-on business world.
Read more.
Technologists focused on privacy and civil liberties will run through the material in their book. The workshop will cover how to think about privacy, privacy protection properties that a system can have and the architectures that implement them, related issues in information security, and privacy issues in data collection.
Read more.
D3.js has a very steep learning curve. However, there are three main concepts that, once you get your head around them, will make the climb much easier. Focusing on these three main concepts, we will walk through many examples to teach the fundamental building blocks of D3.js.
Read more.
From advanced visualization, collaboration, reproducibility to data manipulation, R Day at Strata covers a raft of current topics that analysts and R users need to pay attention to. The R Day tutorials come from leading luminaries and R committers, the folks keeping the R ecosystem apace of the challenges facing analysts and others who work with data.
Read more.
Big Data is reaching beyond the Internet and into the machines that drive our world. Visit Industrial Internet day to gain insights from the way that power plants, factories, cars, and airplanes make use of sensors and software intelligence to improve operations and help managers make good decisions.
Read more.
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. Add in Apache Spark and Kafka, you have an amazing time series solution. We will talk data models, go through deployment and code to build a functional, real-time application. Languages used: Java, Scala
Read more.
This tutorial will help you get a jump start on HBase development. We’ll start with a quick overview of HBase, the HBase data model, and architecture, and then we’ll dive directly into code to help you understand how to build HBase applications. We will also offer guidelines for good schema design, and will cover a few advanced concepts such as using HBase for transactions.
Read more.
What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop and big data ecosystem fit together in production to create a data platform supporting batch, interactive and realtime analytical workloads.
Read more.
This tutorial focuses on hands-on data science skills from prototyping to production. Using GraphLab tools, we walk through multiple case studies such as fraud detection, social network analysis, and building personalized recommendation services.
Read more.
Advanced math for business people: “just enough math” to take advantage of new classes of open source frameworks. Many take college math up to calculus, but never learn how to approach sparse matrices, complex graphs, or supply chain optimizations. This tutorial ties these pieces together into a conceptual whole, with use cases and simple Python code, as a new approach to computational thinking.
Read more.
Don't miss Startup Showcase, Strata Conference + Hadoop World's live demo program and competition for startups and early-stage companies. The judges will pick winners from 10 finalist companies selected to present at the showcase. This event is part of NYC Data Week.
Read more.
This talk introduces how Intel is working with scientists and physicians to help improve research, treatment, and drug development for Parkinson’s Disease using data science and enabling the Parkinson's research community to build upon an open platform for big data analytics.
Read more.
Data is an evolving story. It’s not a static snapshot of a point in time insight. With data from internal and external sources constantly updating, we are evolving from rear-view mirror dashboard views into an era of interactive Storytelling.
Read more.
Spark represents the next-step function leap in what is possible with Hadoop, but what does that mean for business analysts that are swimming in multi-structured data? This presentation discusses the new workflow required so that business analysts can work with massive volumes of multi-structured data to find new insights today, instead of continually having to wait for IT to make big data small.
Read more.
There are two essential skills for the data scientist: engineering and statistics. A great many data scientists are very strong engineers but feel like impostors when it comes to statistics. In this talk John will argue that the ability to program a computer gives you special access to the deepest and most fundamental ideas in statistics.
Read more.
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
Read more.
Jeffrey Heer (Trifacta | University of Washington)
Average rating:
(3.74, 23 ratings)
Interaction and visual design are exacting exercises. Designing for data -- especially in messy and massive forms -- brings a new set of challenges. How can we help people of varying backgrounds effectively transform and understand data at scale?
Read more.
An open data in government love story / case study - how a team of techies overcame political and procedural hurdles to change the financial marketplace.
Read more.
Uber has created an AI city simulation framework to optimize its dispatching system, minimize user wait times, and maximize driver partner earnings. Based on agent-based and swarm intelligence models, this framework generates plausible optimizations across many interacting, dynamic, non-linear parameters on a city-by-city basis.
Read more.
Business problems don’t reveal themselves neatly as data problems. The data community is obsessed with tools and techniques, but the real challenge is understanding how to solve problems with data. How do we bridge the gap? In this talk, we will teach you a methodology for figuring out the right problems to solve and making sure that the work stays smart.
Read more.
Find out how to run real-time analytics over raw data without requiring a manual ETL process targeted at an RDBMS. This talk describes Impala’s approach to on-the-fly data transformation and its support for nested data; examples demonstrate how this can be used to query raw data feeds in formats such as text, JSON and XML, at a performance level commonly associated with specialized engines.
Read more.
The explosion of internal data sources, external public data sources and feeds from the Internet of Things is causing a tsunami of diverse data sources for enterprises. Top-down data-integration tools and data scientist tools won’t scale to meet the demands of the modern enterprise. Learn how a scalable data curation platform can help enterprises connect and enrich their data to leverage it all.
Read more.
Goldman Sachs is a leading global investment banking, securities and investment management firm that provides a wide range of financial services. Goldman executes 100's of millions of financial transactions per day, across nearly every market in the world. Learn how Goldman is harnessing knowledge, data and compute power to maintain and increase its competitive edge.
Read more.
This session will outline Intel’s vision of an E2E Data Analytics Architecture for IoT as well as how we are enabling companies to elevate and transform the way they interact with their customers.
Read more.
Learn the critical success factors for organizational success with Hadoop and building the right team and skill sets for high performance Hadoop success from a veteran of three successful Hadoop projects.
Read more.
In this session, you will learn why it’s powered by Spark, hear key business use cases from customers across various industries using it and gain understanding of the five fundamentals of speeding disparate data analysis.
Read more.
There is a symbiotic relationship between predictive modeling and Big Data. Performance gets better with more data and predictive models demonstrate like few other techniques the value of Big Data. However, there is a surprising paradox: when you need models most, even all the data is not enough or just not suitable. So in the days and age of Big Data there remains an art to predictive modeling.
Read more.
Bob Mankoff, The New Yorker's cartoon editor, will analyze the lessons we learn from crowdsourced humor. Along the way, he'll explore how cartoons work (and sometimes don't); how he makes decisions about what cartoons to include; and what crowds can tell us about a good joke.
Read more.
Bad press, FTC consent decrees, and White House reports have all put a spotlight on bad data practices. Data scientists and designers have become increasingly aware of how privacy principles should guide their work. So, the geeks have met the wonks. Now, it’s time for the wonks to meet the geeks and use data analytics to keep pace with burgeoning data volumes, velocities, and innovations.
Read more.
How far can we take open data--and where can it take us? Brett Goldstein, who helped pioneer Chicago’s cutting-edge efforts in open data and analytics as CIO and CDO, will speak on how these act as a force multiplier on government efforts and can lead to smarter and more inclusive policy-making, while enhancing the government’s ability to anticipate and react to the needs of the public.
Read more.
This talk highlights William's success, challenges, and experiences creating a data driven operations model into Cisco’s engineering services organization. William highlights the role of data, the need for scale and security, the opportunity for new technology to accelerate business, the role of IT to help guide/partner, and the mind shift and cultural changes along the journey.
Read more.
Hyde shows how to quickly build a SQL interface to a NoSQL system using Optiq. He shows how to add rules and operators to Optiq to push down processing to the source system, and how to automatically build materialized data sets in memory for blazing-fast interactive analysis.
Read more.
Data transformation — traditionally the domain of IT specialists — is emerging as a critical, widespread problem in data analytics. In this session we discuss the advantages of using a domain-specific language for data transformation tasks. We illustrate these issues with Wrangle, a DSL designed for interactive data transformation.
Read more.
Transamerica is a financial services company moving to a more customer centric model using Big Data. Our approach to this effort spans our Insurance, Annuity, and Retirement divisions. We went from a simple proof of concept to establishing Hadoop as a viable element of our enterprise data strategy. We cover core components of our solution and focus on lessons learned from our experience.
Read more.
Up to 90% of your data is coming in new forms, in greater size, and at increasing speed. This multi-structured data requires a new workflow, putting the power of Hadoop and Spark into the hands of business analysts. In this session, we will share how Fortune 500 analysts have transformed their workflow by gaining insights into their business once never possible.
Read more.
In this session you will hear from big data experts with real world experience on the architectural patterns and platform integrations used to solve real business problems with data.
Read more.
Shifting workloads from the enterprise data warehouse (EDW) to Hadoop reduces costs, enables you to keep that data longer, and frees up EDW capacity for fast analytics. Check out our live demo and learn a proven framework for offloading workloads from the EDW to Hadoop: Identify & prioritize what to offload; Shift workloads to Hadoop; Optimize & secure your environment; and Visualize new insights.
Read more.
In this debate, two teams of the world's best data
scientists will debate the following proposition: "If you can't code,
you can't be a data scientist."
Read more.
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics. NOTE: BoFs are happening during lunch, which is not accessible to Expo Plus and Expo Only pass holders.
Read more.
Data scientists wear many hats -- how do you train a ready-for-prime-time data scientist in twelve weeks? We'll share some of the choices and models we used to create the Metis Data Science Bootcamp and select its first cohort of students.
Read more.
While the inexorable march of technology does threaten historical notions of privacy, privacy IS very much alive – a shifting, vital conversation society has with itself and its machines. This talk explores the principles of transparency, unlinkability, and intervenability to build a foundation for a design ethos for technologists.
Read more.
GeoSpatial BigData and types are special "animals" when it comes to storage, discovery and processing. This session will explore the various non-traditional ways to stream, extract, batch and visualize GeoSpatial Information for deeper geo-insight, such as "Where are the 3 nearest facilities to each of my customers based on current traffic conditions...nationwide ?"
Read more.
Over two years of running A/B testing at Pinterest on millions of users each day, Andrea learned about the nuances that can make or break an experimentation platform. Andrea will discuss how her approach to testing has adjusted over time to avoid critical errors at all levels, from organizational to analytical.
Read more.
When people think of big data processing, they think of Apache Hadoop, but that doesn't mean traditional databases don't play a role. In most cases users will still draw from data stored in RDBMS systems. Apache Sqoop can be used to unlock that data and transfer it to Hadoop, enabling users with information stored in existing SQL tables to use new analytic tools.
Read more.
Organizations often showcase the virtues of their data platforms, but rarely share the challenges and decisions faced along the way. Our session describes how we architected our analytics stack around Druid, an open source distributed data store, and how we overcame the challenges around scaling the system, balancing features with cost, and making performance consistent.
Read more.
American Express is transforming for the digital age! Learn how we unleashed Big Data into our ecosystem and built on the strength of our core capabilities to remain relevant in a rapidly changing environment. New commerce opportunities and innovative products are being delivered, and the chance to provide actionable insights, social analysis, and predictive modeling is growing exponentially.
Read more.
Today’s unstructured data is raw and complex, but everyone agrees it can provide context and hidden insights when it is easily accessed during the business intelligence lifecycle. . .
Read more.
This session will examine the distribution and storage of data in HDFS across multiple datacenters in a single coordinated, Paxos-based file system over a WAN. Efficient use of compute resources in a globally distributed HDFS cluster is also discussed.
Read more.
Big Data and Analytics is still a young space but novel new methods are on the way. Prominent among them is graph analytics. Actian will show radical and innovative graph analytic capabilities, from its investment in SPARQL City. Founded by database legend Barry Zane, SPARQL City and Actian are committed to delivering the industry’s highest performing in memory graph analysis engine.
Read more.
An increasingly common task for data science is the measurement and attribution of experimental impact. Using examples from healthcare.gov, Microsoft advertising, and Bing experimentation, we will explore the strengths, weaknesses, and pitfalls of techniques for dealing with impact and attribution in scenarios/data in which control experiments were not possible or otherwise not performed.
Read more.
Companies are deploying Hadoop “data lakes” to provide unprecedented access to data for data science and analytics. However, the advantages of frictionless ingest, flexible schema on read, and lack of data governance, turn into increasingly insurmountable challenges to enable true data self-service, and create a barrier to the enterprise adoption of Hadoop.
Read more.
“Leave the over-structured, complex Data Warehouse behind. Dive into the pure, sparkling waters of the Data Lake!” I suggest you enjoy the Instagram, but beware the hidden depths. The Data Lake is a misleading metaphor; it will become a watery grave for context, governance, and value. In reality, today's intricate information ecosystem demands a careful blend of architectures and technologies.
Read more.
The story of using predictive analytics for homelessness prevention in New York City. SumAll.org is currently piloting this approach with the city’s department of homeless services. Predicting at-risk families in a timely manner and micro-targeting social services is a game-changer. SumAll.org is a data analytics nonprofit, dedicated to leveraging the power of data for social innovation.
Read more.
Aadhaar, India's Unique Identity Project, is the largest biometric identity system in the world with more than 600 million people. Its strength lies in its design simplicity, sound strategy, and technology backbone issuing 1 million identity numbers and doing 600 trillion biometric matches every day! Pramod Varma, who is the Chief Architect of Aadhaar, shares his experience from this project.
Read more.
This discussion touches on the human response to analysis results, especially when they do not support long held beliefs and how this effects organizational change. This discussion also focuses on Predictive Analytics best practices, team skills, and a review of what it takes to build a sustainable Predictive Analytics program.
Read more.
The past year has seen the advent of various "low latency" solutions for querying big data such as Shark, Impala, and Presto. The Hive team at Yahoo has spent the past several months benchmarking several versions of Hive (and Tez), with several permutations of file-formats, compression, and query engine features, at various data sizes. In this talk, we present our tests, the results, and findings.
Read more.
Leveraging our experience from working on some of the largest-scale high-growth applications at Facebook and other companies, including building the most popular data analysis tool Scuba, this talk outlines 10 lessons learned, along with best practices towards extracting the most value out of data, while avoiding common pitfalls.
Read more.
The accumulation, access and analysis of customer data (“the original Big Data”) are ingrained for L.L.Bean, which has been doing customer modeling since the 1960’s. In line with today’s omnichannel imperative, however, the retailer has embraced a “new Big Data”-driven culture—democratizing data access and tools—in order to sustain its customer-centric philosophy.
Read more.
This session will cover how MemSQL’s hybrid transactional and analytic data processing capabilities and Apache Spark integration enable businesses to build real-time platforms for applications like operational analytics, position monitoring, and anomaly detection.
Read more.
Join TIBCO Software, an industry leader in infrastructure and analytics software, for a thought leadership discussion to learn how your organization can redefine its data strategy. Transition from a company of Big Data to Fast Data and convert your customers into fans while achieving a competitive advantage.
Read more.
In this talk Arun Murthy will share the very latest innovation from the community aimed at accelerating the interactive and realtime capabilities of enterprise Hadoop.
Read more.
A talk about how the largest professional social network in the world is digitally mapping the global economy to connect talent with opportunity at massive scale.
Read more.
There is a wave of challengers in the database world focused on the scaling costs of traditional RDBMSs. These potential giant killers have capitalized on explosive data growth and disruptive technologies like distributed computing (e.g., Hadoop and NoSQL). We’ll discuss the new breed of database buyers, the redefinition of “enterprise,” and apply lessons from past database wars.
Read more.
This session will help data scientists support healthcare leaders to harmonize health data with Open Source community data commons approaches. This enhances the value of mandated EMR adoption beyond Meaningful Use requirements by creating evidence-based community health intelligence at the pace and point of change, the everyday lives and activities of community members.
Read more.
In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott speaks with Geoff Guerdat of the Gilt Groupe, Will Moss of Airbnb, and Emil Ong of Lookout, to unbox their respective companies and examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
Read more.
We will discuss the basics of scaling, common mistakes and misconceptions, how different technology decisions affect performance, and how to identify and scale around the bottlenecks in a Storm deployment.
Read more.
An introduction to Tachyon, a memory centric storage system started from UC Berkeley. It enables different frameworks to share data at memory-speed. It is also a major component of Berkeley Data Analytics Stack (BDAS). The project is open source and is deployed at multiple companies. It has more than 30 contributors from over 10 institutions, including Yahoo, Intel, Redhat, Alibaba etc.
Read more.
CERN, home to the Large Hadron Collider (LHC) is at the forefront of science and technology. Come to this session to learn how projects at CERN are leveraging In-memory data management and Hadoop to derive real-time insights from sensor data helping to manage the technical infrastructure of the Large Hadron Collider (LHC).
Read more.
Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources.
Read more.
SQL is the natural language for querying data, but data lives in many places. We discuss the importance of SQL not only on Hadoop, but on relational databases, and noSQL stores. Additionally, we dive deep into the architecture of Big Data SQL, which can access all of these sources in a single query.
Read more.
Brian Granger (Cal Poly San Luis Obispo),
Fernando Perez (UC Berkeley and Lawrence Berkeley National Laboratory)
Average rating:
(5.00, 4 ratings)
The IPython Notebook is an open-source, web-based interactive computing environment. The Notebook enables users to author documents that combine live code, descriptive text, mathematical equations, images, videos, and arbitrary HTML. This talk will describe how IPython is evolving to support a wide range of programming languages relevant in data science, including Python, Julia, and R.
Read more.
Recent studies show the vast majority of Hadoop projects are stuck in development, with very few ever reaching production status. And those programs that do convert from pilot to production often view Hadoop as little more than an ETL tool. This session looks at why Hadoop implementations often stall out in the development phase and what companies can do to make Hadoop “production ready.”
Read more.
Open government data on healthcare, finance, education, energy, and other areas has become a major business resource. Joel Gurin, author of Open Data Now and director of the Open Data 500 study, will show how both startups and established companies are putting open data to work. He'll cover Open Data and Big Data, business models for open-data companies, and lessons from a range of case studies.
Read more.
At Etsy, we run dozens of experiments simultaneously and we have terabytes of data generated by the tens of millions of members of our community. We've worked hard to establish a product development process informed by -- and often driven by -- data. In this talk, Nell will discuss the tensions that arise in a data-driven product culture.
Read more.
In this session, Kleiner Perkins Caufield & Byers General Partner Michael Abbott speaks with Michael Stoppelman of Yelp and Siva Subramanian of Box to unbox their respective companies and examine the technology, architecture, and innovations they’ve harnessed to deliver superior products and services.
Read more.
Apache Samza is a framework for processing high-volume real-time event streams. In this session we will walk through our experiences of putting Samza into production at LinkedIn, discuss how it compares to other stream processing tools, and share the lessons we learnt about dealing with real-time data at scale.
Read more.
Apache Spark is a popular new paradigm for computation on Hadoop. It's particularly effective for iterative algorithms relevant to data science like clustering, which can be used to detect anomalies in data. Curious? Get a taste of Spark MLlib, Scala and k-means clustering in this walkthrough of anomaly detection as applied to network intrusion, using the KDD Cup '99 data set.
Read more.
FICO has been delivering analytic solutions, such as their renowned credit scores, for nearly 60 years. Big data technologies like Hadoop promise FICO analysts the ability to build models much faster, and with greater accuracy than before, but this new generation of tools challenge them to think differently.
Read more.
Deriving value from data depends on how well companies capture and manage that data. Learn how to create a centralized processing pool where data can be captured, cleansed, linked and structured in a consistent way. Use the scalability and flexibility of Hadoop to create a powerful processing and refinement engine to drive usable information across enterprise data bases and data marts.
Read more.
Join us for a panel discussion that includes customers, industry experts and partners who are ready to explore the latest advances in Hadoop, from affordability and appliances, to Apache Spark, simplification and security.
Read more.
Just in the US, we make over ~40 billion queries every month. From the time we wake up, search engines are one of the top activities we do online, this talk will show some examples on how this data can be used from funny things like determining which city wakes up earlier to more complex scenarios like finding adverse drug interactions.
Read more.
Come join us for an eclectic taste of Hell’s Kitchen cuisine and entertainment. Mix and mingle with fellow attendees at six distinctly different places within a few blocks of each other, including a piano bar, swing dancing, Memphis bbq, cajun creole, southeast Asian, and rock & roll lounge.
Read more.
In this presentation Eli Collins, Cloudera’s Chief Technologist, will discuss how we might both reap the benefits of data while avoiding its perils.
Read more.
This keynote will share insights from the world’s largest repository of consumer emotions and present the challenges and opportunities that this data presents for machine learning as well as data mining and visualization.
Read more.
Software and the rise of cloud services have given rise to revolutionary new economies – creating new markets for everything from self-published books, music and videos to mobile apps. Only a few years ago, it would have been hard to imagine developers authoring a million apps for smartphones. But that’s history.
Read more.
Karen Moon will discuss the characteristics of unstructured data that makes identifying and synthesizing fashion trends particularly challenging and how getting it right can be a competitive advantage.
Read more.
In this presentation, George L. Legendre, principal of IJP Architects and faculty at Harvard graduate School of Design, will show how the mathematical equations of pasta define the ultimate taxonomy of the genre.
Read more.
The world is a rapidly changing place, where time flies and technological innovations batter us fast and furiously. Hadoop is just nine years old; and just five years ago had nowhere near the audience, ecosystem, or impact it has now . . .
Read more.
In this session you'll see an application that builds on in-place existing technologies like Hadoop to deliver understandable results. You'll hear a story where analytics at rest was applied to unstructured data using a simple SQL-like development environment, a&findings were promoted to the frontier of the business to score, in real time, monetizable intent, assess reputations & more.
.
Read more.
Julia Angwin discusses how much she has spent trying to protect her privacy, and raises the question of whether we want to live in a society where only the rich can buy their way out of ubiquitous surveillance.
Read more.
Nathan Shetterley, Josh Patterson, and their team, set out to change the visual identity of the world's largest IT consulting firm in the world. From grass roots public visualization to a global visual literacy curriculum, see how they made Accenture more focused on data visualization. In addition, they will share insights into the business value of data visualization to their firm and clients.
Read more.
Last year, Douglas Merrill, CEO of ZestFinance and former Google CIO, discussed how success in big data analysis requires not just machines and algorithms, but also human analysis, or “data artists". Building on this notion, Mike Armstrong, CMO of ZestFinance, will discuss how companies can find, identify, and correct data inaccuracies.
Read more.
Industrial systems produce large volumes of real-time data that can be analyzed using Big Data technologies in the data center environments. In many cases, such data needs to be analyzed at the edge before leaving industrial machines or systems that control them. This is possible if machines have intelligence to process data and make decisions. GE will share such use cases and experience.
Read more.
To anticipate who will succeed and invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, we’ll consider the big trends in Big Data, asking top-tier VCs to look over the horizon discuss the visions they have two or more years in the future.
Read more.
This talk examines sources of latency in HBase, detailing steps along the read and write paths. We'll examine the entire request lifecycle, from client to server and back again. We'll also look at the different factors that impact latency, including GC, cache misses, and system failures. Finally, the talk will highlight some of the work done in 0.96+ to improve the reliability of HBase.
Read more.
What is the lambda architecture, and how do you put it to use for your streaming data? Flip Kromer and Q Ethan McCallum will explain how this works, using a live-updating recommendation engine as the supporting example.
Read more.
LinkedIn processes enormous amounts of events each day. In this talk, you will learn the background of the data challenges that LinkedIn faced, how the teams came together to construct the solution, and the underlying stack structure powering this solution including an interactive analytics infrastructure and a self-serve data visualization frontend solution at fast scale.
Read more.
The cloud is an amazing game changer for Data Science. This talk will show with demos and real world customer examples the magic that every data scientist can now perform in the cloud...
Read more.
Studies are showing the vast majority of Big Data projects involve 2 or more data platforms. Moving data is costly and must be carefully considered. . .
Read more.
Predictive modeling is as much art as it is science. The art is in matching your business questions to available data, and then pairing that data with the appropriate statistical techniques. Next comes model refinement, comparison and interpretation.
We’ll demonstrate how SAS® and Hadoop work together to turn raw data into valuable information – and how you can visualize it for better decisions.
Read more.
What does AI mean in 2014, and where is it headed? Every day brings news of purported breakthroughs, and some of the new applications are certainly impressive, but the field has witnessed boom/bust cycles before. What are the challenges that lie ahead this time? This talk will provide an overview of the state of the field, as well as a critical framework for thinking about the years ahead.
Read more.
An important skill of today's data scientists is data communication. Mapping and other types of data visualization have been sufficient to try and demonstrate the trends and patterns these professions find in data. However, there is an important shift happening in the way we consume data that means as a community, we need to think about our ability to turn data into stories.
Read more.
This session will cover the value that linking algorithms bring to identity risk management, and how to apply linking algorithms, data and super compute capability to the challenge of identity risk management and identity fraud. We will also look at patterns of identity fraud, namely those (stolen) identities that have come back from the dead and how to differ those from real, live identities.
Read more.
The trend towards cloud architectures we've seen over the last few years isn't sustainable. With tens of billions more Internet connected devices arriving over the next few years—far faster than any predicted increase in bandwidth to outside world—data is increasingly going to become a local problem, rather than a cloud problem.
Read more.
For a long time Internet retailers have been trying to move items they sell closer to customers. Flash sale site Gilt.com takes it to the extreme: we apply machine learning to predict customers' cravings for fashion products in different geographic regions without purchase history to draw from.
Read more.
Today, there are hundreds of production Apache HBase clusters running either entity-centric or event-based applications. Gathered from known clusters and a survey conducted by Cloudera's development, product, and services teams from their experiences with the nearly 20,000 HBase nodes under management, this talk categorizes these the gamut of use-case into a compact set of application archetypes.
Read more.
In this talk Michael will describe Spark SQL, the newest component of the Apache Spark stack. A key feature of Spark SQL is the ability to blur the lines between relational tables and RDDs, making it easy for developers to intermix SQL commands that query structured data with complex analytics in imperative or functional languages.
Read more.
In this panel discussion, individuals representing key stakeholders across the healthcare ecosystem will share the ways they're applying Hadoop to solve big data challenges that will ultimately improve the quality of patient care while driving better healthcare affordability.
Read more.
This session will describe the kinds of tools and solutions available in the market to tap into text sources. Two use cases will be discussed and short demos used to illustrate a tools and a solution approach.
Read more.
Developing Big Data applications for real-world business processes can be complex: method of processing, variety of systems, # of data sources. Large web companies have implemented a generic, scalable, fault-tolerant data processing architecture: LAMDA. We’ll explore this evolving architecture, design principles, layers/components, & use cases/lessons learned from real-world implementations.
Read more.
In this session we talk about how to design, build and manage large scale enterprise Big Data deployments, with its high disk IO apps to in-memory solutions, for both on-premise as well as multi-tenant cloud environments taking holistic view of all the components including compute, network and the software stack
Read more.
We debunk some popular approaches and attitudes we have encountered over the course of more than 50 real world Big Data implementations. We will describe each anti-pattern and its appeal--but also why they fail, and how to do it right.
Read more.
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics. NOTE: BoFs are happening during lunch, which is not accessible to Expo Plus and Expo Only pass holders.
Read more.
Shoving 1MM rows of query results into a chart or graph returns illegible results and kills interactivity. Smarter designs, however, will achieve data visibility. Furthermore, running on GPUs turns static designs into interactive tools. We will show how Graphistry does this in production with (a) new client/cloud GPU infrastructure and (b) GPU-accelerated languages like Superconductor.
Read more.
People living in the Information Age are faced with a conundrum. They wish to be connected on a series of global, interconnected networks but they also wish to protect their privacy and to be left alone…sometimes.
Read more.
A lot of stationary, big data begins its life as small data in rapid motion - think logs, sensors, social data. The pressure is on architects, infra devops, and app developers to harness real-time data, and expose it to the right data processing paradigm. Learn how on AWS, services like Amazon Kinesis, Redshift, and Elastic MapReduce can be composed to deliver a smarter big data infrastructure.
Read more.
The Big Data market is busy, with sky-high valuations and a rapid pace of innovation. This panel of data-focused Venture Capitalists will look at how they think about investing in and around the Big Data space—from the kind of deals they’re after, to how they like to work with entrepreneurs and founders.
Read more.
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
Read more.
We will demonstrate how to combine visual tools with Spark to apply three specific techniques to visually explore big data using a) summarize and visualize, b) sample and visualize, and c) model and visualize. We will use a real big dataset, such as Wikipedia traffic logs, to demonstrate these techniques in a live demo.
Read more.
Medicine is undergoing a renaissance made possible by analyzing and creating insights from this huge and growing number of genomes. This session will showcase how ETL and MapReduce can be applied in a clinical session.
Read more.
Visualizations can be easy on the eyes, until you need to view data at scale. In this session, James Dixon, CTO and Co-Founder of Pentaho will talk about ways of presenting large scale datasets. Using data from the City of Chicago, James will present practical examples that help distill large amounts of data in ways that are easier for users to comprehend.
Read more.
Join us to learn how to leverage new Distributed R open source technology from the HP Labs and HP Vertica. Distributed R platform introduces new easy to use distributed programming model and infrastructure for the R language. Distributed R includes out-of-the-box open source parallel R algorithms that can scale for terabytes of data.
Read more.
Agile data transformation uses Hadoop’s schema-on-read capability to manipulate raw data as needed for business purposes. Transforming data can be a barrier to data access and agility— consuming up to 80% of business analyst time. Hear directly from LinkedIn, Autodesk, MarketShare, and Orange about how predictive interactions make agile data transformation a reality on Hadoop.
Read more.
Microsoft Translator currently supports 100+ languages. We constantly improve the translation quality, add new scenarios, all with a constant team size. This session describes a production scale ML architecture using MS Translator as a case study. You will learn the mental model to approach your ML problem and concrete Do’s and Don’ts for the various components of the ML system architecture.
Read more.
Nanocubes is an open source project that can be used to visually explore large spatiotemporal datasets at interactive rates using a web browser.
Read more.
IDEO's Hybrid team brings all the design tools from IDEO's product design process to work with clients on data oriented projects. The team will share elements of their process and case studies to show how incorporating human-centered techniques from design can improve data as an input to decision making.
Read more.
As in a game of chess, successful use of machine learning techniques against adaptive adversaries, such as spammers and intruders, requires designing the learning algorithms having anticipated the opponent’s response to those algorithms. In this talk, we present techniques to design robust machine learning algorithms for adversarial environments and provide clarifying attack-defense examples.
Read more.
After babysitting Hadoop clusters for many years and knowing the limitations really well we had the chance to design and implement the cloud infrastructure for a large connected home platform from scratch. We’ll show how we’ve built that backend with Crate Data and Twitter Storm and why this is a perfect match for this workload.
Read more.
PDFs are the bane of data science, a jail from which machine-readable data struggles to escape. We'll explain how at Edmunds.com we freed data from diverse auto manufacturer PDFs, applied NLP and entity recognition, and integrated the results into the expert-driven process of defining vehicle models.
Read more.
This talk will cover resource management using YARN - the new resource management platform introduced in Hadoop 2.0. It will cover how it achieves effective cluster utilization, fair sharing of resources, and allow different type of applications to utilize the cluster. We will go over the architecture, recent improvements, and things coming down the pipeline.
Read more.
Automated image processing improves efficiency for a diverse range of applications from defect detection in manufacturing to tumor detection in medical images. We’ll go beyond traditional approaches to image processing, which fail for large image datasets, by leveraging Hadoop for processing a vast number of arbitrarily large images.
Read more.
Leveraging Hadoop data, served to users with advanced visualization in MicroStrategy, Netflix delivers effective, responsive insights quickly. This puts advanced analytics in the hands of business users who make the decisions that help the online entertainment network to outperform their rivals by serving consumers the content they want, how they want it.
Read more.
Data continues to drive innovation, yet it’s how we interpret and use that data that becomes imperative to success. Using a design approach called Natural Analytics, technology can leverage the way our human curiosity searches and processes information.
Read more.
Learn how Western Union uses Hadoop with Informatica to parse and integrate Omniture web log files, XML data, and relational transactions data to meet their current and future data analysis needs.
Read more.
Text adds clarity to visualizations and helps authors communicate. There are many text elements to consider when making a chart: axis titles, category and data labels, gridline labels, legends, citations, and annotations, to name a few. This talk will dive into the specifics of typography and text placement in information design.
Read more.
The Internet is a warzone. Any business with a digital presence needs to protect itself from threats that exist in cyberspace. In this presentation, we’ll show you how to build a real-time anomaly detection system using Sqrrl Enterprise and Apache Spark GraphX to monitor and surface advanced persistent threats and malicious actor attacks.
Read more.
Born as a solution built for RMS (the Australian Government agency managing and regulating the use of roads in New South Wales), this Internet of Things application for smarter transportation services provides a real-time data hub for transportation sensor networks, network information and traveler information, offering actionable insight into network performance, congestion, and incidents.
Read more.
Panel discussion, moderated by Liza Kindred, Founder, Third Wave Fashion. Panelists include David Whittemore, founder of Clotheshorse; Gina Mancusco, founder of Love That Fit; and Rasmus Thofte, head of North America at Virtusize.
Read more.
This presentation will show you how to get your Big Data into Apache HBase as fast as possible. Those 40 minutes will save you hours of debugging and tuning, with the added bonus of having a better understanding of how HBase works. You will learn things like the write path, bulk loading, HFiles, and more.
Read more.
In the last two years we've seen the introduction of several open-source SQL engines for Hadoop. There have been numerous marketing claims around SQL-on-Hadoop performance but what should you believe? How do these different engines compare on functionality? This talk will compare and contrast Hive, Impala, and Presto all from an non-vendor, unsponsored, independent point of view.
Read more.
Netflix continues evolve its big data architecture in the cloud with performance enhancements and updated OSS offerings. We will share our experiences and selections in file formats, interactive query engines, and instance types. Genie emerges with updates to support YARN applications and we will unveil a new performance visualization tool, Inviso.
Read more.
It's been twenty years since Red Hat first launched Linux. Since then Red Hat has fueled the rapid adoption of open source technologies. As Big Data transitions into enterprise mode, Red Hat is again poised to facilitate the innovation and communities needed to empower multiple data stakeholders across your organization so you can truly open the possibilities of your data.
Read more.
If the allure of Big Data is that you can throw it all in the data lake and process it cheaply and quickly, then the catch is how do you know what's in there and how do you govern it? A Big Data lake needs data governance to create trusted data, ensure consistency, and secure information appropriately. This session will discuss how to start putting a Big Data governance framework in place.
Read more.
Many big data instances overlook human created content, we will discuss the business value and technology that can be used to tap into the power and showcase real life use cases. This content makes up the majority of the information produced by most organizations. Despite this fact, it has been under-used, under-valued and under-analyzed because of legacy technology limitations.
Read more.
H2O presents the worlds fastest Distributed Parallel GBM. GBM is a ML algorithm used to win many recent Kaggle competitions, and is well known for it's high quality results.
Read more.
Whether you're lining up for an Apple Watch, using the heads-up display of Google Glass, or sporting one of the hundreds of activity and sleep trackers, it's clear that wearable technology is exploding. No longer bulky and cumbersome gadgets, today's wearables are fashionably... data-chic.
Read more.
In this session Guavus’ Chief Technology Officer, Roy Singh, will present a framework using an operational intelligence platform based on Apache Spark, for providing a pipeline for anomaly detection, causality analysis, anomaly prediction, and actionable alerts.
Read more.
Rio Tinto is one of the world’s leading mining companies. Our current technology focus is around using innovation to realize our vision of the Mine of The Future. Join us as we explore how natural resource companies are using Big Data techniques to visualize resource deposits, enable fully autonomous rail road systems and global system monitoring.
Read more.
A look at how we use public governmental data to answer questions about our customers and their behavior; data used by marketing, space management, and product managers. Other government data is used to support the company's sales forecast, which items to purchase, predict amount to be purchased, and determining which items to phase out. All Data driven management - made even easier with Hadoop.
Read more.
Impala provides the ability to easily analyze large, distributed data sets. This talk will cover the impyla package, which aims to make data science easier with Impala by integrating with Python. The impyla package currently supports programmatically interacting with Impala, running distributed machine learning in Impala, and compiling Python UDFs into assembly instructions via LLVM.
Read more.
The widespread adoption of web-based maps provides a familiar set of interactions for exploring large data spaces. Building on these techniques, Tile-based visual analytics provides interactive visualization of billions of points of data or more. This session provides an overview of technical challenges and promise using applications created with the open source Aperture Tiles framework on GitHub.
Read more.
Kaiser Permanente is dedicated to improving the quality of healthcare, and big data presents numerous opportunities to drive this mission forward.
Read more.
By reducing friction from deploying models and comparing competing models, data scientists can focus on high-value efforts. At Vast we've experimented with tools and strategies for this while shipping a suite of data products for consumers and agents in the midst of some of life’s biggest purchases. I'll share best practices and lessons learned, and help you free up time for the fun stuff.
Read more.