Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Speakers

New speakers are added regularly. Please check back to see the latest updates to the agenda.

Filter

Search Speakers

Justin Bleich is a senior data scientist at Coatue Management. Previously, Justin was the cofounder and CTO of Zodiac, an artificial intelligence startup that focused on predicting customer behavior to help brands retain their best customers and find more like them, and an adjunct professor at the Wharton School at the University of Pennsylvania, where he taught advanced data mining and predictive modeling. Justin holds a PhD in statistics from the Wharton School, where he focused on Bayesian machine learning and ensemble-of-trees algorithms.

Presentations

Probabilistic programming in finance using Prophet Session

Prophet is a Bayesian nonlinear time series forecasting model recently released by Facebook. Justin Bleich explains how Coatue—a hedge fund that uses data science to drive investment decisions—extends Prophet to include exogenous covariates when generating forecasts and applies it to nowcasting macroeconomic series using higher-frequency data available from sources such as Google Trends.

Ziya Ma is the general manager of the global Big Data Technologies organization in Intel’s Software and Services Group (SSG) in the System Technologies and Optimization (STO) Division. Her organization focuses on optimizing big data on Intel’s platform, leading open source efforts in the Apache community, and linking innovation in industry analytics to bring about the best and the most complete big data experiences. She works closely with Intel product teams, open source communities, partners from the industry, and academia to advise on implementing and optimizing the Intel platform for Hadoop or Spark ecosystems. Previously, Ziya held various management positions in Intel’s Technology Manufacturing Group (TMG), where she was responsible for delivering embedded software for factory equipment, databases for manufacturing execution and process control, UI software, and more, and was product development software director of Intel IT, where she delivered software lifecycle management tools and infrastructure and analytics solutions to Intel software teams worldwide. She also worked at Motorola earlier in her career. Ziya holds a PhD and MS in computer science and engineering from Arizona State.

Presentations

Unleashing intelligence and data analytics at scale (sponsored by Intel) Keynote

Advanced data analytics is reshaping the enterprise with new discoveries, better customer experiences, and improved products and services, all enabled by actionable insight. Ziya Ma shares how Intel is driving a holistic approach to powering advanced analytics and artificial intelligence workloads and unleashing intelligent and scalable insights from the edge to the cloud to the enterprise.

Natalia Adler is a global public policy executive at UNICEF who conceptualizes and drives high-profile initiatives that strengthen organizational capability and achieve large-scale mission-critical outcomes. As data, research, and policy manager, Natalia is trying to leverage data science to solve complex problems affecting children. Skilled in building public-private alliances that expand reach, Natalia has launched and scaled initiatives that connect people and ideas to inspire breakthroughs around the world and has worked with UNICEF Latin America, Nicaragua, and Mozambique.

Presentations

Creating public value through data collaboratives Session

The data collaborative is a new form of public-private partnership that seeks to create public value for the world’s most marginalized children through the exchange of data and data science expertise. Natalia Adler offers an overview of the Data Collaboratives initiative, led by UNICEF and the GovLab at New York University's Tandon School of Engineering.

Manish Ahluwalia is a security engineering at Nerdwallet. Manish has held software architect roles at Tibco Loglogic and Thales Vormetric and was a security engineer at Cloudera, where he focused on the security of the Hadoop ecosystem. Manish has been working in big data since its infancy in various companies in Silicon Valley. He is most passionate about security.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tyler Akidau is a senior staff software engineer at Google Seattle, where he leads technical infrastructure internal data processing teams for MillWheel and Flume. Tyler is a founding member of the Apache Beam PMC and has spent the last seven years working on massive-scale data processing systems. Though deeply passionate and vocal about the capabilities and importance of stream processing, he is also a firm believer that batch and streaming are two sides of the same coin and that the real endgame for data processing systems the seamless merging between the two. He is the author of the 2015 “Dataflow Model” paper and “Streaming 101” and “Streaming 102” blog posts. His preferred mode of transportation is by cargo bike, with his two young daughters in tow.

Presentations

Foundations of streaming SQL; or, How I learned to love stream and table theory Session

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.

With over 15 years in advanced analytical applications and architecture, John Akred is dedicated to helping organizations become more data driven. As CTO of Silicon Valley Data Science, John combines deep expertise in analytics and data science with business acumen and dynamic engineering leadership.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Running data science in the enterprise and architecting data platforms Ask Me Anything

John Akred, Stephen O'Sullivan, and Heather Nelson field a wide range of detailed questions on topics such as managing data science in the enterprise, architecting a data platform, and creating a modern enterprise data strategy. Even if you don’t have a specific question, join in to hear what others are asking.

Managing data science in the enterprise Tutorial

John Akred and Heather Nelson share methods and observations from three years of effectively deploying data science in enterprise organizations. You'll learn how to build, run, and get the most value from data science teams and how to work with and plan for the needs of the business.

Brendan Aldrich is the chief data officer at Ivy Tech Community College, the largest singly accredited community college in the nation, where he leads data modernization and democratization initiatives. A cross-industry data innovations specialist, Brendan has over 20 years of information technology experience building and leading top-performing teams that transformed the enterprise at companies such as the Walt Disney Company, Demand Media, Travelers Insurance, and the City Colleges of Chicago. His groundbreaking work at City Colleges of Chicago and Ivy Tech has been recognized with a 2014 Innovators Award from Campus Technology magazine and a 2017 Data and Analytics Excellence Award from Gartner. Brendan holds a bachelor’s degree from California State University, Los Angeles.

Presentations

Learning from higher education: How Ivy Tech is using predictive analytics and data democracy to reverse decades of entrenched practices Session

As the largest community college in the US, Ivy Tech ingests over 100M rows of data a day. Brendan Aldrich and Lige Hensley explain how Ivy Tech is applying predictive technologies to establish a true data democracy—a self-service data analytics environment empowering thousands of users each day to improve operations, achieve strategic goals, and support student success.

Sri is co-founder and ceo of 0xdata (@hexadata), the builders of H2O. H2O democratizes bigdata science and makes hadoop do math for better predictions. Before 0xdata, Sri spent time scaling R over bigdata with researchers at Purdue and Stanford. Prior to that Sri co-founded Platfora and was the Director of Engineering at DataStax. Before that Sri was Partner & Performance engineer at java multi-core startup, Azul Systems, tinkering with the entire ecosystem of enterprise apps at scale. Before that Sri was at sabbatical pursuing Theoretical Neuroscience at Berkeley. Prior to that Sri worked on nosql trie based index for semistructured data at in-memory index startup RightOrder.

Sri is known for his knack for envisioning killer apps in fast evolving spaces and assembling stellar teams towards productizing that vision. A regular speaker in the BigData, NoSQL and Java circuit, Sri leaves trail @srisatish.

Presentations

Streamline Data Science Pipeline with GPU Data Frame (sponsored by NVIDIA) Session

Joining Jim McHugh are founders of GOAI: - Todd Mostak, CEO of MapD - SriSatish Ambati, CEO and co-founder of H2O - Stan Seibert, Director of Community Innovation, Anaconda In this session, the speakers will provide an update on the latest advancement and customer use cases leveraging GOAI

Jesse Anderson is a data engineer, creative engineer, and managing director of the Big Data Institute. Jesse trains employees on big data—including cutting-edge technology like Apache Kafka, Apache Hadoop, and Apache Spark. He has taught thousands of students at companies ranging from startups to Fortune 100 companies the skills to become data engineers. He is widely regarded as an expert in the field and recognized for his novel teaching practices. Jesse is published by O’Reilly and Pragmatic Programmers and has been covered in such prestigious media outlets as the Wall Street Journal, CNN, BBC, NPR, Engadget, and Wired. You can learn more about Jesse at Jesse-Anderson.com.

Presentations

Meet the Expert with Jesse Anderson (Big Data Institute) Meet the Experts

Jesse will talk to you about creating data engineering teams that are productive and create excellent data products.

Real-time systems with Spark Streaming and Kafka 2-Day Training

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

Real-time Systems with Spark Streaming and Kafka (Day 2) Training Day 2

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.

The five dysfunctions of a data engineering team Session

Early project success is predicated on management making sure a data engineering team is ready and has all of the skills needed. Jesse Anderson outlines five of the most common nontechnology reasons why data engineering teams fail.

Carlo Appugliese is a data science executive for IBM Analytics, Watson, and Cloud. A technologist with a track record of leveraging emerging technologies and trends to drive business transformation, Carlo has held a number of roles, including computer programmer, manager of application development, and director of innovation.

Presentations

The future of data science and machine learning (sponsored by IBM) Session

A changing market landscape and open source innovations are having a dramatic impact on the consumability and ease of use of data science tools. Carlo Appugliese examines the impact these trends and changes will have on the future of data science and how machine learning is making data science available to all.

Assaf Araki is the senior architect for big data analytics at Intel, where his group is responsible for big data analytics path findings within the company. Assaf drives the overall work with the academy and industry for big data analytics and merges new technologies inside Intel Information Technology. He has over 10 years of experience in data warehousing, decision support solutions, and applied analytics within Intel.

Presentations

Hardcore Data Science welcome HDS

Hosts Ben Lorica and Assaf Araki welcome you to Hardcore Data Science day.

André Araujo is a solutions architect with Cloudera. Previously, he was an Oracle database administrator. An experienced consultant with a deep understanding of the Hadoop stack and its components, André is skilled across the entire Hadoop ecosystem and specializes in building high-performance, secure, robust, and scalable architectures to fit customers’ needs. André is a methodical and keen troubleshooter who loves making things run faster.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Tasso Argyros is the founder and CEO of ActionIQ, an enterprise software company that aims to bridge the gap between marketing and data for the Global 2000. He is also a venture partner at First Mark Capital and a cofounder of DataElite Ventures, a San Francisco-based seed-stage fund focused on big data companies. Previously, Tasso cofounded big data pioneer Aster Data (acquired by Teradata), after dropping out of the PhD program at Stanford, and served as copresident and general manager of Teradata’s Big Data Division. Tasso has received several awards and recognitions, including Businessweek’s best young tech entrepreneur for 2009, the World Economic Forum’s technology pioneer in 2010, and Forbes’s next-gen innovator in 2013. He holds a master’s degree in computer science from Stanford University and a diploma in computer engineering from the Technical University of Athens.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Eduardo Arino de la Rubia is chief data scientist at Domino Data Lab. Eduardo is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. He is a graduate of the MTSU Computer Science Department, General Assembly’s Data Science Program, and the Johns Hopkins Coursera Data Science Specialization. Eduardo is currently pursuing a master’s degree in negotiation, conflict resolution, and peacebuilding from CSUDH. You can follow him on Twitter at @earino.

Presentations

Leveraging open source automated data science tools Session

The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow.

Carme Artigas is the founder and CEO of Synergic Partners, a strategic and technological consulting firm specializing in big data and data science (acquired by Telefónica in 2015). She has more than 20 years of extensive expertise in the telecommunications and IT fields and has held several executive roles in both private companies and governmental institutions. Carme is a member of the Innovation Board of CEOE and the Industry Affiliate Partners at Columbia University’s Data Science Institute. An in-demand speaker on big data, she has given talks at several international forums, including Strata Data Conference, and collaborates as a professor in various master’s programs on new technologies, big data, and innovation. Carme was recently recognized as the only Spanish woman among the 30 most influential women in business by Insight Success. She holds an MS in chemical engineering, an MBA from Ramon Llull University in Barcelona, and an executive degree in venture capital from UC Berkeley’s Haas School of Business.

Presentations

Executive Briefing: Analytics centers of excellence as a way to accelerate big data adoption by business Session

Big data technology is mature, but its adoption by business is slow, due in part to challenges like a lack of resources and the need for a cultural change. Carme Artigas explains why an analytics center of excellence (ACoE), whether internal or outsourced, is an effective way to accelerate adoption and shares an approach to implementing an ACoE.

Acting Director for Data Science for the NordREE region (Norther & Eastern Europe incl. Russia) and currently works out of the Think Big Analytics office in Copenhagen, Denmark.
Sune holds a M.Sc. in Engineering from the Technical University of Denmark and a Ph.D. from the University of Copenhagen and has worked with machine learning and advanced analytics in different R&D organizations for more than 12 years.

Presentations

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race, and criminals are now using machine learning to improve their attack effectiveness. Sune Askjaer and Nadeem Gulzar explore how Danske Bank uses deep learning for better fraud detection, covering model effectiveness, TensorFlow versus boosted decision trees, operational considerations in training and deploying models, and lessons learned along the way.

Shivnath Babu is the chief scientist CTO at Unravel Data Systems and an adjunct professor of computer science at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark.

Josh Baer is a data infrastructure product lead at Spotify, where he is leading the data processing track of Spotify’s migration to Google Cloud Platform. During his time at Spotify, Josh has worked on growing Spotify’s Hadoop footprint from 180 machines to 2,000, enabling everyday real-time processing and providing infrastructure for advanced machine learning tasks.

Presentations

Spotify in the cloud: The next evolution of data at Spotify Session

In early 2016, Spotify decided that it didn’t want to be in the data center business. The future was the cloud. Josh Baer and Alison Gilles explain what it took to move Spotify to the cloud, covering Spotify's technology choices, challenges faced, and the lessons Spotify learned along the way.

Travis Bakeman is a senior manager of systems design and strategy at T-Mobile, with a focus on network performance management and big data analytics, where he is responsible for multiple teams that deliver enterprise solutions leveraging off the shelf options such as Splunk, Oracle RAC, and open source technologies like Cloudera Hadoop. During his tenure with T-Mobile, he has worked in operational support, database administration, data mediation, report development, data enrichment, and frontend application design. Previously, Travis worked in military intelligence in the United States Army. He started his career in the telecom industry in data center operations.

Presentations

How T-Mobile built a massive-scale network performance management platform on Hadoop Session

Travis Bakeman shares how T-Mobile ported its large-scale network performance management platform, T-PIM, from a legacy database to a big data platform with Impala as the main reporting interface, covering the migration journey, including the challenges the team faced, how the team evaluated new technologies, lessons learned along the way, and the efficiencies gained as a result.

Michael Balint is a senior manager of applied solutions engineering at NVIDIA. Previously, Michael was a White House Presidential Innovation Fellow, where he brought his technical expertise to projects like Vice President Biden’s Cancer Moonshot program and Code.gov. Michael has had the good fortune of applying software engineering and data science to many interesting problems throughout his career, including tailoring genetic algorithms to optimize air traffic, harnessing NLP to summarize product reviews, and automating the detection of melanoma via machine learning. He is a graduate of Cornell and Johns Hopkins University.

Presentations

GPU-accelerating a deep learning anomaly detection platform Session

How can deep learning be employed to create a system that monitors network traffic, operations data, and system logs to reliably flag risk and unearth potential threats? Satish Dandu, Joshua Patterson, and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in operational production systems, using best-of-breed GPU-accelerated open source tools.

Zbigniew Baranowski is a database system specialist and a member of a group that provides central database and Hadoop services at CERN.

Presentations

Scaling database and analytic workloads with Apache Kudu Session

Apache Kudu is a new, innovative distributed storage that combines low-latency data ingestion, scalable analytics, and fast data lookups. But what does it deliver in practice? Zbigniew Baranowski explains how to use Apache Kudu for scale-out database-like systems, such as those used at CERN, covering the advantages and limitations and measuring performance.

Kirit Basu is director of product management at StreamSets.

Presentations

Real-time image classification: Using convolutional neural networks on real-time streaming data Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale.

Dominikus Baur works to make data accessible in every situation. As a data visualization and mobile interaction designer and developer, Dominikus creates usable, aesthetic, and responsive visualizations for desktops, tablets and smartphones. As a freelancer, he has helped create beautiful visualizations for clients including the OECD, Microsoft Research, and Wincor Nixdorf. As a trainer for data visualization development, he holds workshops providing both a scientific and practical background. Dominikus is a regular speaker at various academic and industry conferences. He holds a PhD in media informatics from the University of Munich (Ludwig-Maximilians-Universität), where his research focused on making our growing personal databases of media, status updates and messages manageable.

Presentations

Data futures: Exploring the everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. Daniel Goddemeyer and Dominikus Baur share the findings from Data Futures, an MFA class in which students observed each other through their own data, and demonstrate the results with a live experiment with the audience that showcases some of the effects when personal data becomes accessible.

Michael Beal is the CEO of Data Capital Management. A systematic Hedge Fund that specializes in machine learning, advanced technologies and novel data sources to generate differentiated and uncorrelated returns for the clients of its DCM A.I. Absolute Return Fund.
Prior to DCM, Mr. Beal co-founded the Big Data and Advanced Analytics group for JP Morgan Chase and was an investor with TPG Capital and Morgan Stanley. Michael is a frequent Keynote and CNBC contributor and is passionate about investing, the onset of the “Data Economy” and the application of disruptive technologies to valuable “problems to solve”.
Michael earned a BA from Harvard College with honors in Economics and an MBA from Harvard Business School with distinction. Michael is an active Board Member of the City University of New York’s Medgar Evers College.

Presentations

Delivering alpha: Artificial intelligence in capital markets investing Findata

Michael Beal, CEO, Data Capital Management

Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services, where he focuses on AI and real-time streaming technologies and working with AWS customers to build data-driven products (whether batch or real time) and create solutions powered by ML in the cloud. Roy has worked in the data and analytics industry for over a decade and has helped hundreds of customers bring compelling data-driven products to the market. He serves on the advisory board of Applied Mathematics and Data Science at Post University in Connecticut. Roy holds a BSc in information systems and an MBA from the University Of Georgia.

Presentations

Creating a serverless real-time analytics platform powered by machine learning in the cloud Session

Speed matters. Today, decisions are made based on real-time insights, but in order to support the substantial growth of streaming data, companies are required to innovate. Roy Ben-Alta and Allan MacInnis explore AWS solutions powered by machine learning and artificial intelligence.

Tim Berglund is a teacher, author, and technology leader with Confluent, where he serves as the senior director of developer experience. Tim can frequently be found at speaking at conferences internationally and in the United States. He is the copresenter of various O’Reilly training videos on topics ranging from Git to distributed systems and is the author of Gradle Beyond the Basics. He tweets as @tlberglund, blogs very occasionally at Timberglund.com, and is the cohost of the DevRel Radio Podcast. He lives in Littleton, Colorado, with the wife of his youth and and their youngest child, the other two having mostly grown up.

Presentations

Ask me anything: Apache Kafka as a streaming platform Ask Me Anything

Tim Berglund answers your burning questions about Kafka architecture, the Streams API, KSQL, and message-based microservices integration. Even if you don't have a question of your own, stop by to hear what other people are asking.

Heraclitus, the metaphysics of change, and Kafka Streams Session

Tim Berglund offers a thorough introduction to the Streams API, an important recent addition to Kafka that lets us build sophisticated stream processing systems that are as scalable and fault tolerant as Kafka itself—and also happen to align quite well with the microservices sensibilities that are so common in contemporary architectural thinking.

Kristina Bergman is the founder and CEO of Integris Software. A reformed venture capitalist turned third-time entrepreneur, Kristina was previously a principal at Ignition Partners, where she led investments in cloud tech, big data, security, and IoT devices and served on the boards of Trifacta, Bromium, Nymi, Apprenda, and Tellwise, and held a variety of leadership roles in product, marketing, and partnering at Microsoft, Business Objects, and Crystal Decisions.

Presentations

Data privacy laws, their risks, and real-world solutions DCS

The EU's General Data Protection Regulation (GDPR) fines companies up to 4% of worldwide revenue for violations of people’s data privacy. Privacy experts expect this to have the same impact on all industries as SOX compliance had on the financial sector. Kristina Bergman shares customer stories about how CIOs, CSOs and CPOs are solving the challenges presented by emerging data privacy laws.

Sandeep Bhadra is a partner at VertexVentures, where he focuses primarily on cloud infrastructure, data-driven business applications, and cybersecurity. Previously, Sandeep was a principal at Menlo Ventures, where his investments included Platform9, Signifyd, Unravel Data, and Clarifai; served on Cisco’s corporate development team, focusing on Cisco’s investments in MapR, Platfora, and Moogsoft and its acquisitions of Metacloud, tail-f, and Memoir Systems; and worked at Texas Instruments’s R&D Center, where he designed protocols for 4G/LTE wireless networks and later led a small team to spec out the first software-defined network switch chip. Sandeep holds a BTech from IIT Madras and a PhD from the University of Texas at Austin, both in electrical engineering, and an MBA from INSEAD.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Dr. Rebecca Bilbro is Lead Data Scientist at Bytecubed, where she and her team use machine learning and Python to build custom data solutions for commercial and government clients. Dr. Bilbro is adjunct faculty at Georgetown University, an emeritus board member of Data Community DC, and co-author of the book Applied Text Analysis with Python (O’Reilly). As partner in District Data Labs, a DC-based open source collaborative, in her free time she enjoys collaborating with local developers on inclusive, high impact open source software projects like Scikit-Yellowbrick. Dr. Bilbro earned her doctorate from the University of Illinois, Urbana-Champaign, where her research centered on communication and visualization practices in engineering.

Presentations

Meet the Expert with Rebecca Bilbro (ByteCubed) Meet the Experts

Rebecca will be on hand to share practical advice on text analytics and suggestions for open source machine learning packages and chat about real-world use cases as they relate to text data and NLP in everyday applications.

Charles Boicey is the chief innovation officer for Clearsense, a healthcare analytics organization specializing in bringing big data technologies to healthcare. Previously, Charles was the enterprise analytics architect for Stony Brook Medicine, where he developed the analytics infrastructure to serve the clinical, operational, quality, and research needs of the organization. He was a founding member of the team that developed the Health and Human Services award-winning application NowTrending to assist in the early detection of disease outbreaks by utilizing social media feeds. Charles is a former president of the American Nursing Informatics Association.

Presentations

Spark clinical surveillance: Saving lives and improving patient care Session

Charles Boicey explains how Clearsense uses Spark Streaming to provide real-time updates to healthcare providers for critical healthcare needs, helping clinicians make timely decisions from the assessment of a patient's risk based on information gathered from streaming physiological monitoring along with streaming diagnostic data and the patient historical record.

Matt Bolte is a technical expert at Walmart. Matt has 19 years’ IT experience, five of them working with large secure enterprise Hadoop clusters.

Presentations

An authenticated journey through big data security at Walmart Session

In today’s world of data breaches and hackers, security is one of the most important components for big data systems, but unfortunately, it's usually the area least planned and architected. Matt Bolte and Toni LeTempt share Walmart's authentication journey, focusing on how decisions made early can have significant impact throughout the maturation of your big data environment.

Tobi Bosede is a machine learning engineer at Johns Hopkins University. Previously, she was a reviewer for Bayesian Methods for Hackers and taught R programming at Johns Hopkins University and Python programming for General Assembly. Tobi’s professional work spans multiple industries, from telecom at Sprint to finance at JPMorgan. She holds a bachelor’s degree in mathematics from the University of Pennsylvania and a master’s in applied mathematics and statistics from Johns Hopkins University.

Presentations

Big data analysis of futures trades Session

Whether an entity seeks to create trading algorithms or mitigate risk, predicting trade volume is an important task. Focusing on futures trading that relies on Apache Spark for processing the large amount data, Tobi Bosede considers the use of penalized regression splines for trade volume prediction and the relationship between price volatility and trade volume.

danah boyd is the founder and president of Data & Society, a research institute focused on understanding the role of data-driven technologies in society, a principal researcher at Microsoft Research, and a visiting professor in NYU’s Interactive Telecommunications Program. danah’s research focuses on the intersection of technology, society, and policy. She is currently doing work on questions related to bias in big data and artificial intelligence, how people negotiate privacy and publicity, and the social ramifications of using data in education, criminal justice, labor, and public life. For over a decade, she examined how American youth incorporate social media into their daily practices in light of different fears and anxieties that the public has about young people’s engagement with technologies like MySpace, Facebook, Twitter, YouTube, Instagram, and texting. She has researched a plethora of teen issues, ranging from privacy to bullying, racial inequality, and sexual identity. Her early findings were published in Hanging Out, Messing Around, and Geeking Out: Kids Living and Learning with New Media. Her 2014 monograph, It’s Complicated: The Social Lives of Networked Teens, has received widespread praise from scholars, parents, and journalists and has been translated into seven languages. This work was funded by both the MacArthur Foundation and Microsoft Research. Her most recent collaborative book project, Participatory Culture in a Networked Era, with Mimi Ito and Henry Jenkins, reflects on how digital participation has shaped different parts of society. Her work has been profiled by numerous publications, including the New York Times, Fast Company, the Boston Globe, and Forbes, and published in a wide range of scholarly venues.

In 2010, danah won the CITASA Award for Public Sociology. The Financial Times dubbed her “the high priestess of internet friendship,” Fortune magazine identified her as the smartest academic in tech, and Technology Review named her one of 2010’s young innovators under 35. danah was a 2011 Young Global Leader of the World Economic Forum and is a member of the Council on Foreign Relations. She is a director of both Crisis Text Line and the Social Science Research Council and a trustee of the National Museum of the American Indian. She sits on advisory boards for Electronic Privacy Information Center, Brown University’s Department of Computer Science, and the School of Information at the University of Michigan. She was a commissioner on the 2008–2009 Knight Commission on Information Needs of Communities in a Democracy. From 2009 to 2013, danah served on the World Economic Forum’s Social Media Global Agenda Council. At the Berkman Center, she codirected the Internet Safety Technical Task Force in 2008 with John Palfrey and Dena Sacco to work with companies and nonprofits to identify potential technical solutions for keeping children safe online. More recently, she codirected the Youth Media and Policy Working Group with John Palfrey and Urs Gasse, funded by the MacArthur Foundation from 2009 to 2011. In 2012, she and John Palfrey also helped the Born This Way Foundation and the MacArthur Foundation develop a research strategy to help empower youth to address meanness and cruelty. She is one of the hosts of the annual Data & Civil Rights Conference. Since 2015, she has also served on the US Commerce Department’s Data Advisory Council. She also created and managed a large online community for V-Day, a nonprofit organization working to end violence against women and girls worldwide. She has advised numerous other companies, sits on corporate, education, conference, and nonprofit advisory boards, and regularly speaks at a wide variety of conferences and events. danah holds a bachelor’s degree in computer science from Brown University (under Andy van Dam), a master’s degree in sociable media from the MIT Media Lab (under Judith Donath), and a PhD in information from the University of California, Berkeley (under Peter Lyman and Mimi Ito). She has worked as an ethnographer and social media researcher for various corporations, including Intel, Tribe.net, Google, and Yahoo. She blogs at Zephoria.org/thoughts/ and tweets as @zephoria.

Presentations

Ask me anything: Data & Society Ask Me Anything

Data & Society's danah boyd and Madeleine Elish answer your questions and discuss topics such as the manipulation of data-driven and AI technologies, humans in the loop in automated systems, and the future of work.

Your data is being manipulated. Keynote

The more that we rely on data to train our models and inform our systems, the more that this data becomes a target for those seeking to manipulate algorithmic systems and undermine trust in data. danah boyd explores how systems are being gamed, how data is vulnerable, and what we need to do to build technical antibodies.

David Boyle leads the work of the insight team at BBC Worldwide, the commercial and global wing of the BBC, where he helps to transform the relationship that BBC Worldwide has with its audience by building premium, industry-leading insight capabilities into consumers, BBC brands, and the market and determine what connects with audiences emotionally and inspires them. David has spent the last seven years constructing global insight capabilities for the publishing and music industries, which were widely acknowledged as having helped them make quicker, smarter, and bolder decisions for their brands. Previously, he was SVP of consumer insight at HarperCollins Publishers, where he helped the company better understand consumer behavior and attitudes toward books, authors, book discovery, and purchase, and worked at EMI Music, where he delivered insight to all parts of the business in more than 25 countries and helped to shift the organization’s decision making at all levels, from artist signing to product and brand development plans for EMI’s biggest artists, including the Beatles and Pink Floyd.

Presentations

From the weeds to the stars: How and why to think about bigger problems Session

Too many brilliant analytical minds are wasted on interesting but ultimately less-impactful problems. They are stuck in the weeds of the data or the challenges of our day to day. Too few ask what it means to reach for the stars—the big, shiny, business-changing issues. David Boyle explains why you must start asking bigger questions and making a bigger difference.

Katherine Boyle is an investor at General Catalyst, an early-stage venture capital firm with $3.7B under management, where she focuses on investments in highly regulated industries, including government, defense, aerospace, and autonomous mobility. Previously, she was a staff reporter at the Washington Post, covering creative industries, consumer retail, government accountability, and weird subcultures—the latter preparing her most for a career in venture capital. Katherine holds an MBA from Stanford’s Graduate School of Business, where she was research assistant to Condoleezza Rice for her course and upcoming book Managing Political Risk. She’s a graduate of Georgetown University and holds a master’s degree in public advocacy from the National University of Ireland, Galway.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Claudiu Branzan is the director of data science at G2 Web Services, where he designs and implements data science solutions to mitigate merchant risk, leveraging his 10+ years of machine learning and distributed systems experience. Previously, Claudiu worked for Atigeo building big data and data science-driven products for various customers.

Presentations

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

Richard Brath is a partner at Uncharted Software. Richard has been designing and building innovative information visualizations for 20 years, ranging from one of the first interactive 3D financial visualizations on the web in 1995 to visualizations embedded in financial data systems used every day by thousands of market professionals. Richard is pursuing a PhD in new data visualization techniques at LSBU.

Presentations

Text analytics and new visualization techniques Session

Text analytics are advancing rapidly, and new visualization techniques for text are providing new capabilities. Richard Brath and Scott Langevin offer an overview of these new ways to organize massive volumes of text, characterize subjects, score synopses, and skim through lots of documents.

Mikio Braun is delivery lead for recommendation and search at Zalando, one of the biggest European fashion platforms. Mikio holds a PhD in machine learning and worked in research for a number of years before becoming interested in putting research results to good use in the industry.

Presentations

Deep learning in practice Session

Deep learning has become the go-to solution for many application areas, such as image classification or speech processing, but does it work for all application areas? Mikio Braun offers background on deep learning and shares his practical experience working with these exciting technologies.

What is deep learning? Data 101

Mikio Braun reviews recent advances in deep learning, highlighting the kinds of problems deep learning can solve and the architectures used in different contexts. Mikio also covers the mechanics underlying the learning process of these systems and offers an overview of the technological advances like GPU computing, which have made the recent progress in this area possible.

Tamara Broderick is the ITT Career Development Assistant Professor in the Department of Electrical Engineering and Computer Science at MIT. Tamara’s recent research is focused on developing and analyzing models for scalable Bayesian machine learning, especially Bayesian nonparametrics. She is a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), the MIT Statistics and Data Science Center, and the Institute for Data, Systems, and Society (IDSS). Tamara has been awarded a Google faculty research award, the ISBA Lifetime Members Junior Researcher Award, the Savage Award (for an outstanding doctoral dissertation in Bayesian theory and methods), the Evelyn Fix Memorial Medal and Citation (for the PhD student on the Berkeley campus showing the greatest promise in statistical research), the Berkeley fellowship, an NSF Graduate Research Fellowship, a Marshall Scholarship, and the Phi Beta Kappa Prize (for the graduating Princeton senior with the highest academic average). She holds a PhD in statistics from the University of California, Berkeley, completed under Michael I. Jordan, an AB in mathematics from Princeton University, a master of advanced study for completion of Part III of the Mathematical Tripos from the University of Cambridge, an MPhil by research in physics from the University of Cambridge, and an MS in computer science from the University of California, Berkeley.

Presentations

Bayesian machine learning: Quantifying uncertainty and robustness at scale HDS

Tamara Broderick demonstrates new advances in computation for Bayesian machine learning that allow reliable quantification of uncertainty and robustness at modern data scales, illustrated with examples in microcredit and online advertising.

Kalah Brown is a senior Hadoop engineer at Big Fish Games, where she is responsible for the technical leadership and development of big data solutions. Previously, Kalah was a consultant in the greater Seattle area and worked with numerous companies, including Disney, Starbucks, the Bill and Melinda Gates Foundation, Microsoft, and Premera Blue Cross. She has 17 years of experience in software development, data warehousing, and business intelligence.

Presentations

Working within the Hadoop ecosystem to build a live-streaming data pipeline Session

Companies are increasingly interested in processing and analyzing live-streaming data. The Hadoop ecosystem includes platforms and software library frameworks to support this work, but these components require correct architecture, performance tuning, and customization. Stephen Devine and Kalah Brown explain how they used Spark, Flume, and Kafka to build a live-streaming data pipeline.

Kurt Brown leads the data platform team at Netflix, which architects and manages the technical infrastructure underpinning the company’s analytics, including various big data technologies like Hadoop, Spark, and Presto, Netflix open-sourced applications and services such as Genie and Lipstick, and traditional BI tools including Tableau and Redshift.

Presentations

20 Netflix-style principles and practices to get the most out of your data platform Session

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

Joanna J. Bryson is a transdisciplinary researcher on the structure and dynamics of human- and animal-like intelligence. Her research covers topics ranging from artificial intelligence through autonomy and robot ethics to human cooperation and has appeared in venues ranging from Reddit to Science. Joanna is a professor in the Department of Computer Science at the University of Bath, where she founded and for several years led the Intelligent Systems research group. Joanna is also affiliated with Bath’s Institutes for Policy Research and Mathematical Innovation, as well as their Centres for Networks and Collective Behaviour and for Digital Entertainment. She has held visiting academic positions with Princeton’s Center for Information Technology Policy (where she is still affiliated), the Mannheim Centre for Social Science Research, the Department of Anthropology at Oxford, where she worked on Harvey Whitehouse’s Explaining Religion project, the Methods & Data Institute at Nottingham, doing agent-based modeling in political science, and the Konrad Lorenz Institute for Evolution & Cognition Research in Austria, where she researched the biological origins of culture. She has conducted academic research in Edinburgh’s Human Communication Research Centre and Harvard’s Department of Psychology. Outside of academia, Joanna has worked in Chicago’s financial industry, an international organization management consultancy, and industrial AI research. Joanna has served on the senate, council, and court for the University of Bath, representing the academic assembly. She is currently a member of the College of the British Engineering and Physical Sciences Research Council (EPSRC) and serves as a member of the editorial board for several academic journals, including Adaptive Behaviour, AI & Society, Connection Science, and the International Journal of Synthetic Emotions. Joanna holds a degree in behavioural science (nonclinical psychology) from Chicago, an MSc in artificial intelligence and an MPhil in psychology from Edinburgh, and a PhD in artificial intelligence from MIT.

Presentations

The real project of AI ethics Keynote

AI has been with us for hundreds of years; there's no "singularity" step change. Joanna Bryson explains that the main threat of AI is not that it will do anything to us but what we are already doing to each other with it—predicting and manipulating our own and others' behavior.

Brandon Bunker is the senior director of artificial intelligence at Vivint, where he and his team have developed the world’s first smart home assistant that truly understands home occupancy, helping Vivint’s customers save money, energy, and time. In the past year, he scaled Vivint’s Smart Assistant from 0 to 700,000+ customers and won an editor’s choice award at CES. Brandon is passionate about using new tools and techniques to create value from data. His specialties include the IoT, big data, data science, online marketing, direct marketing analytics, social analytics, mobile analytics, segmentation, and web analytics.

Presentations

How Vivint Smart Home made home security and automation even smarter with Tableau (sponsored by Tableau) Session

Brandon Bunker explains how Vivint delivers fast analytics from big data on a bootstrap budget by leveraging Tableau as a strategic piece of its modern BI architecture. By interactively analyzing data as it lands in its Cloudera Hadoop data lake, Vivint is able to deliver security across homes and data alike, making smart homes even smarter and saving customers money in the process.

Ellsworth (Ells) Campbell is a health scientist in the Laboratory Branch of the Division of HIV/AIDS Prevention at the CDC. Ells began working at CDC as a PhD student and Oak Ridge Institute for Science Education (ORISE) fellow and recently transitioned to a full-time associate service fellowship. Ells holds bachelor’s and master’s degrees in biology from UC San Diego and is currently pursuing a PhD in biology at Penn State University.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Marc Carlson is a lead computational biologist in research informatics at Seattle Children’s Research Institute. Marc divides his time between helping architect new cloud-based infrastructure to serve the scientists at SCRI, working to make sure that new compute resources are brought online and properly configured for immediate utility, and helping users with their data and analysis needs via the Bioinformatics Unit, the goal of which is to make sure that scientists at SCRI can learn the most from their data. Marc’s contributions include creating and running training courses, periodic consultations, and helping with the bioinformatics user group. Previously, he held a postdoc in computational biology at UCLA and worked on the bioconductor core team at the Fred Hutchinson Cancer Research center, where he served the needs of the R-based computational biology community. Marc holds a BS in genetics and cell biology from Washington State University and a PhD in developmental and cell biology from the UC Irvine.

Presentations

Project Rainier: Saving lives one insight at a time Session

Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed most.

Connor Carreras is Trifacta’s manager for customer success in the Americas, where she helps customers use cutting-edge data wrangling techniques in support of their big data initiatives. Connor brings her prior experience in the data integration space to help customers understand how to adopt self-service data preparation as part of an analytics process. She is a coauthor of the O’Reilly book Principles of Data Wrangling.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Michelle Casbon is director of data science at Qordoba. Michelle’s development experience spans more than a decade across various industries, including media, investment banking, healthcare, retail, and geospatial services. Previously, she was a senior data science engineer at Idibon, where she built tools for generating predictions on textual datasets. She loves working with open source projects and has contributed to Apache Spark and Apache Flume. Her writing has been featured in the AI section of O’Reilly Radar. Michelle holds a master’s degree from the University of Cambridge, focusing on NLP, speech recognition, speech synthesis, and machine translation.

Presentations

How machine learning with open source tools helps everyone build better products Session

Michelle Casbon explores the machine learning and natural language processing that enables teams to build products that feel native to every user and explains how Qordoba is tackling the underserved domain of localization using open source tools, including Kubernetes, Docker, Scala, Apache Spark, Apache Cassandra, and Apache PredictionIO (incubating).

Tanya Cashorali is the founding partner of TCB Analytics, a Boston-based data consultancy. Previously, she worked as a data scientist at Biogen. Tanya started her career in bioinformatics and has applied her experience to other data-rich verticals such as telecom, finance, and sports. She brings over 10 years of experience using R in data scientist roles as well as managing and training data analysts, and she’s helped grow a handful of Boston startups.

Presentations

How to hire and test for data skills: A one-size-fits-all interview kit Session

Given the recent demand for data analytics and data science skills, adequately testing and qualifying candidates can be a daunting task. Interviewing hundreds of individuals of varying experience and skill levels requires a standardized approach. Tanya Cashorali explores strategies, best practices, and deceptively simple interviewing techniques for data analytics and data science candidates.

Sarah Catanzaro is an investor at Canvas Ventures, where she focuses on analytics, data infrastructure, and machine intelligence. Sarah has several years of experience in developing data acquisition strategies and leading machine and deep learning-enabled product development at organizations of various sizes. Most recently, she led the data team at Mattermark to collect and organize information on over one million private companies. Previously, she implemented analytics solutions for municipal and federal agencies as a consultant at Palantir and as an analyst at Cyveillance. She also led projects on adversary behavioral modeling and Somali pirate network analysis as a program manager at the Center for Advanced Defense Studies. Sarah holds a BA in international security studies from Stanford University.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Simon Chan is a senior director of product management for Salesforce Einstein, where he oversees platform development and delivers products that empower everyone to build smarter apps with Salesforce. Simon is a product innovator and serial entrepreneur with more than 14 years of global technology management experience in London, Hong Kong, Guangzhou, Beijing, and the Bay Area. Previously, Simon was the cofounder and CEO of PredictionIO, a leading open source machine learning server (acquired by Salesforce). Simon holds a BSE in computer science from the University of Michigan, Ann Arbor, and a PhD in machine learning from University College London.

Presentations

The journey to Einstein: Building a multitenancy AI platform that powers hundreds of thousands of businesses Session

Salesforce recently released Einstein, which brings AI into its core platform to power every business. The secret behind Einstein is an underlying platform that accelerates AI development at scale for both internal and external data scientists. Simon Chan shares his experience building this unified platform for a multitenancy, multibusiness cloud enterprise.

Bala has spent over 25 years in the intersection of Banking & Technology with significant background in Designing, Building & Delivering Enterprise Grade Platforms & Solutions across the globe.

Bala’s current focus in Barclays is standing up a set of transformative Data Platforms that help put Data at the heart of the bank’s customer-centric strategy. This includes a range of tools and technologies cutting across Big Data, NoSQL, real-time capabilities and cloud adoption.

Presentations

Enabling data science self-service with the Elastic Data Platform (sponsored by Dell EMC) Session

Barclays and Dell EMC have partnered on the deployment of a solution called the Elastic Data Platform. Ankit Tharwani offers an overview of this platform, which gives data scientists the ability to self-serve sandbox environments, cutting down the time to provision environments from months to hours.

Cheng Chang is a software engineer at Alluxio and the fourth highest contributor to the Alluxio open source project. Cheng is also the main developer of Alluxio Manager. He has presented talks at Strata Beijing, Spark Summit, and other leading industry events. He holds a degree in computer science from Tsinghua University.

Presentations

Best practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Haoyuan Li and Cheng Chang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together.

Karim Chine is a London-based software architect and entrepreneur and the author and designer of RosettaHUB. Previously, he held positions within academic research laboratories and industrial R&D departments, including Imperial College London, EBI, IBM, and Schlumberger. Karim’s interests include large-scale distributed software design, cloud computing applications in research and education, open source software ecosystems, and open science. Since 2009, he has collaborated with the European Commission as an independent expert for the Research E-infrastructure Program and for the Future and Emerging Technologies Program. He has also served as an evaluator and a reviewer of many of EU’s flagship projects related to grids, desktop grids, scientific clouds, and science gateways. Karim holds degrees from Ecole Polytechnique and Telecom ParisTech.

Presentations

rosettaHUB: A global hub for reproducible and collaborative data science Session

Karim Chine offers an overview of rosettaHUB—which aims to establish a global open data science metacloud centered on usability, reproducibility, auditability, and shareability—and shares the results of the rosettaHUB/AWS Educate initiative, which involved 30 higher education institutions and research labs and over 3,000 researchers, educators, and students.

Jike Chong is the chief data scientist at Acorns, the leading microinvestment app in US with over two million verified investors, which uses economic psychology to help the up-and-coming save and invest for a better financial future. Previously, Jike was the chief data scientist at Yirendai, an online P2P lending platform with more than $7B loans originated and the first of its kind from China to go public on NYSE; established and headed the data science division at Simply Hired, a leading job search engine in Silicon Valley; advised the Obama administration on using AI to reducing unemployment; and led quantitative risk analytics at Silver Lake Kraftwerk, where he was responsible for applying big data techniques to risk analysis of venture investment. Jike is also an adjunct professor and PhD advisor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, where he established the CUDA Research Center and CUDA Teaching Center, which focus on the application of GPUs for machine learning. Recently, he also developed and taught a new graduate level course on machine learning for internet finance at Tsinghua University in Beijing, China, where he is serving as an adjunct professor. He holds bachelor’s and master’s degrees in electrical and computer engineering from Carnegie Mellon University and a PhD from the University of California, Berkeley. He holds 10 patents (six granted, four pending).

Presentations

Deploying AI in mobile-first consumer-facing financial products: A tale of two cycles Findata

AI is moving into the heart of the financial business model. Jike Chong discusses two fundamental business cycles in a financial institution: acquiring customers and sustaining customer relationships, highlighting opportunities in six areas where AI technologies can be readily deployed, along with reference use cases.

Dhruv Choudhary is a research scientist at MZ, where he is researching stream anomaly detection algorithms for time series analysis and computer vision. Previously, Dhruv worked in the connected car space building data products around driver aggression, car behavior, and risk analysis. He holds a master’s degree from Georgia Tech, where he focused on applying control theory techniques to systems problems; his thesis formulated energy efficient thread scheduling for asymmetric architectures as an optimal control problem.

Presentations

Anomaly detection on live data Session

Services such as YouTube, Netflix, and Spotify popularized streaming in different industry segments, but these services do not center around live data—best exemplified by sensor data—which will be increasingly important in the future. Arun Kejariwal, Francois Orsini, and Dhruv Choudhary demonstrate how to leverage Satori to collect, discover, and react to live data feeds at ultralow latencies.

Michael Chui is a San Francisco-based partner in the McKinsey Global Institute, where he directs research on the impact of disruptive technologies, such as big data, social media, and the internet of things, on business and the economy. Previously, as a McKinsey consultant, Michael served clients in the high-tech, media, and telecom industries on multiple topics. Prior to joining McKinsey, he was the first chief information officer of the City of Bloomington, Indiana, and was the founder and executive director of HoosierNet, a regional internet service provider. Michael is a frequent speaker at major global conferences and his research has been cited in leading publications around the world. He holds a BS in symbolic systems from Stanford University and a PhD in computer science and cognitive science and an MS in computer science, both from Indiana University.

Presentations

Executive Briefing: Artificial intelligence—The next digital frontier? Session

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.

Eric Colson is chief algorithms officer at Stitch Fix as well as an advisor to several big data startups. Previously, Eric was vice president of data science and engineering at Netflix. He holds a BA in economics from SFSU, an MS in information systems from GGU, and an MS in management science and engineering from Stanford.

Presentations

Differentiating by data science Session

While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.

Meet the Expert with Eric Colson (Stitch Fix) Meet the Experts

Stop by to meet Eric to get ideas on how manage your data science team to foster innovation and how to transform a data science team that merely serves a supportive role to one that leads with data science.

Retail's panacea: How machine learning is driving product development Session

Karen Moon, Jared Schiffman, Eric Colson, and Catherine Twist explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

Riccardo Gianpaolo Corbella is a Milan-based consulting big data engineer at Data Reply IT, where he develops effective big data solutions based on open source technologies. Riccardo is interested in data mining and distributed systems. He holds a BSc and an MSc in computer science from the Università degli Studi di Milano.

Presentations

How an Italian company rules the world of insurance: Facing the technological challenges of turning data into value Session

With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth.

George Corugedo is a cofounder and chief technology officer at RedPoint Global, where he is responsible for directing the development of the RedPoint Customer Engagement Hub, RedPoint’s leading enterprise customer engagement solution. A former math professor and seasoned technology executive, George has more than two decades of business and technical experience. He left academia to cofound Accenture’s Customer Insights practice, which specializes in strategic data utilization, analytics, and customer strategy. Previously, he was also director of client delivery at ClarityBlue, a provider of hosted customer intelligence solutions, and COO and CIO of Riscuity, a receivables management company that specializes in using analytics to drive collections.

Presentations

Using real-time machine learning and big data to drive customer engagement and digital transformation (sponsored by RedPoint Global) Session

Driving digital transformation is a vital component of continued organizational success and more personalized customer engagement. The best results will come from operationalizing data to automate decisions with machine learning. George Corugedo explains how RedPoint’s customers use connected enterprise data, machine learning, and analytics to impact their businesses.

Dustin Cote is a customer operations engineer at Confluent. Over his career, Dustin has worked in a variety of roles from Java developer to operations engineer. His most recent focus is distributed systems in the big data ecosystem, with Apache Kafka being his software of choice.

Presentations

Mistakes were made, but not by us: Lessons from a year of supporting Apache Kafka Session

Dustin Cote shares his experience troubleshooting Apache Kafka in production environments and explains how to avoid pitfalls like message loss or performance degradation in your environment.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Findata welcome Tutorial

Alistair Croll and Rob Passarella welcome you to Findata Day.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Bradford Cross is a founding partner at DCVC, a leading machine learning and big data venture capital fund. Previously, Bradford founded Prismatic, which used machine learning for personalized content ranking and natural language processing for topic classification, and FlightCaster, which used machine learning to predict the real-time state of the global air traffic network using FAA, carrier, and weather data. A hedge fund investor and a venture investor, Bradford started his career working on statistical value and momentum strategies at O’Higgins Asset Management and was a founding partner of Data Collective. He was also a systems engineer and worked on distributed systems at Google. Bradford studied computer engineering and finance at Virginia Tech and mathematics at Berkeley.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

How machine learning is used in fintech Findata

Bradford Cross offers an overview of machine learning applications in the financial services sector across banking, insurance, investments, real estate, and consumer financial services and contrasts these approaches with traditional quant finance.

Michael Crutcher is the director of product management at Cloudera, where he is responsible for the direction of Cloudera’s storage products, which include HDFS, HBase, and Parquet. He’s also responsible for managing strategic partnerships with storage vendors.

Presentations

The sunset of lambda: New architectures amplify IoT impact Session

A long time ago in a data center far, far away, we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Michael Crutcher and Ryan Lippert explain why Apache Kudu, a relational storage layer for fast analytics on fast data, is the key to unlocking the value in IoT data.

Nick Curcuru is vice president of enterprise information management at Mastercard, where he is responsible for leading a team that works with organizations to generate revenue through smart data, architect next-generation technology platforms, and protect data assets from cyberattacks by leveraging Mastercard’s information technology and information security resources and creating peer-to-peer collaboration with their clients. Nick brings over 20 years of global experience successfully delivering large-scale advanced analytics initiatives for such companies as the Walt Disney Company, Capital One, Home Depot, Burlington Northern Railroad, Merrill Lynch, Nordea Bank, and GE. He frequently speaks on big data trends and data security strategy at conferences and symposiums, has publishing several articles on security, revenue management, and data security, and has contributed to several books on the topic of data and analytics.

Presentations

Architecting security across the enterprise: Instilling confidence and stewardship every step of the way Session

Cybersecurity is now a topic in the boardroom, as organizations are scrambling to increase their security posture. To decrease breach threats, Mastercard brings data security into its system design process. Nick Curcuru shares best practices and lessons learned protecting 160 million transactions per hour over Mastercard's network and securing 16+ petabytes of data at rest.

Paul Curtis is a principal solutions engineer at MapR, where he provides pre- and postsales technical support to MapR’s worldwide systems engineering team. Previously, Paul served as senior operations engineer for Unami, a startup founded to deliver on the promise of interactive TV for consumers, networks, and advertisers; systems manager for Spiral Universe, a company providing school administration software as a service; senior support engineer positions at Sun Microsystems; enterprise account technical management positions for both Netscape and FileNet; and roles in application development at Applix, IBM Service Bureau, and Ticketron. Paul got started in the ancient personal computing days; he began his first full-time programming job on the day the IBM PC was introduced.

Presentations

Why containers and microservices need streaming data Session

A microservices architecture benefits from the agility of containers for convenient, predictable deployment of applications, while persistent, performant message streaming makes both work better. Paul Curtis explores these infrastructure components and discusses the design of highly scalable real-world systems that take advantage of this powerful triad.

Doug Cutting is the chief architect at Cloudera and the founder of numerous successful open source projects, including Lucene, Nutch, Avro, and Hadoop. Doug joined Cloudera from Yahoo, where he was a key member of the team that built and deployed a production Hadoop storage-and-analysis cluster for mission-critical business analytics. Doug holds a bachelor’s degree from Stanford University and sits on the board of the Apache Software Foundation.

Presentations

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Brian Dalessandro is the head of data science at Zocdoc, an online doctor marketplace and booking tool, and an adjunct professor for the NYU Center for Data Science graduate program. Previously, Brian was vice president of data science at online advertising firm Dstillery. A veteran data scientist and leader with over 15 years of experience developing machine learning-driven practices and products, Brian holds several patents and has published dozens of peer-reviewed articles on the subjects of causal inference, large-scale machine learning, and data science ethics. Brian is also the drummer for the critically acclaimed indie rock band Coastgaard.

Presentations

Challenges in using machine learning to direct healthcare services Session

Zocdoc is an online marketplace that allows easy doctor discovery and instant online booking. However, dealing with healthcare involves many constraints and challenges that render standard approaches to common problems infeasible. Brian Dalessandro surveys the various machine learning problems Zocdoc has faced and shares the data, legal, and ethical constraints that shape its solution space.

Atul Dalmia is vice president of global information management at American Express, where he is responsible for leading the company’s data and platform strategy and driving innovation in acquisition, marketing, and servicing across the customer lifecycle and across channels. He is also responsible for accelerating development on American Express’s big data platform to drive innovation and speed to market while driving cost efficiencies for the enterprise. Atul holds a master’s degree from the Massachusetts Institute of Technology and a bachelor’s degree from the Indian Institute of Technology, Chennai.

Presentations

Enterprise digital transformation using big data Session

Big data decisioning is critical to driving real-time business decisions in our digital age. But how do you begin the transformation to big data? The key is enterprise adoption across a variety of end users. Atul Dalmia shares best practices learned from American Express's five-year journey, the biggest challenges you’ll face, and ideas on how to solve them.

Satish Varma Dandu is a data science and engineering manager at NVIDIA, where he leads teams that build massive end-to-end big data and deep learning platforms, handling billions of events per day for real-time analytics, data warehousing, and AI platforms using deep learning to improve the user experience for millions of users. Previously, Satish led data engineering teams at startups and large public companies. His areas of interest are in building large-scale engineering platforms, big data engineering, GPU data acceleration, and deep learning. Satish holds an MS in computer science from the University of Houston and is currently enrolled in the management program at Stanford University.

Presentations

GPU-accelerating a deep learning anomaly detection platform Session

How can deep learning be employed to create a system that monitors network traffic, operations data, and system logs to reliably flag risk and unearth potential threats? Satish Dandu, Joshua Patterson, and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in operational production systems, using best-of-breed GPU-accelerated open source tools.

Shirshanka Das is a principal staff software engineer and the architect for LinkedIn’s analytics platforms and applications team. He was among the original authors of a variety of open and closed source projects built at LinkedIn, including Databus, Espresso, and Apache Helix. He is currently working with his team to simplify the big data analytics space at LinkedIn through a multitude of mostly open source projects, including Pinot, a high-performance distributed OLAP engine, Gobblin, a data lifecycle management platform for Hadoop, WhereHows, a data discovery and lineage platform, and Dali, a data virtualization layer for Hadoop.

Presentations

Taming the ever-evolving compliance beast: Lessons learned at LinkedIn Session

Shirshanka Das and Tushar Shanbhag explore the big data ecosystem at LinkedIn and share its journey to preserve member privacy while providing data democracy. Shirshanka and Tushar focus on three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement platform, and a unified data access layer.

Michael Dauber is a general partner at Amplify Partners. Previously, Mike spent over six years at Battery Ventures, where he led early-stage enterprise investments on the West Coast, including Battery’s investment in a stealth security company that is also in Amplify’s portfolio. Mike has served on the boards of a number of companies, including Continuuity, Duetto, Interana, and Platfora. Mike’s investments include Splunk and RelateIQ, which was recently acquired by Salesforce. Mike began his career as a hardware engineer at a startup and held product, business development, and sales roles at Altera and Xilinx. Mike is a frequent speaker at conferences and is on the advisory board of both the O’Reilly Strata Conference and SXSW. He was named to Forbes magazine’s 2015 Midas Brink List. Mike holds a BS in electrical engineering from the University of Michigan in Ann Arbor and an MBA from the University of Pennsylvania’s Wharton School.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Gerard de Melo is an assistant professor of computer science at Rutgers University, where he heads a team of researchers working on big data analytics, natural language processing, and web mining. Gerard’s research projects include UWN/MENTA, one of the largest multilingual knowledge bases, and Lexvo.org, an important hub in the web of data. Previously, he was a faculty member at Tsinghua University, one of China’s most prestigious universities, where he headed the Web Mining and Language Technology Group, and a visiting scholar at UC Berkeley, where he worked in the ICSI AI Group. He serves as an editorial board member for computational intelligence, for the Journal of Web Semantics, the Springer Language Resources and Evaluation journal, and the Language Science Press TMNLP book series. Gerard has published over 80 papers, with best paper or demo awards at WWW 2011, CIKM 2010, ICGL 2008, and the NAACL 2015 Workshop on Vector Space Modeling, as well as an ACL 2014 best paper honorable mention, a best student paper award nomination at ESWC 2015, and a thesis award for his work on graph algorithms for knowledge modeling. He holds a PhD in computer science from the Max Planck Institute for Informatics.

Presentations

Learning meaning from web-scale big data HDS

How can we exploit the massive amounts of data now available on the web to enable more intelligent applications? Gerard de Melo shares results on applying deep learning techniques to web-scale amounts of data to learn neural representations of language and world knowledge. The resulting resources can be used in Spark to work with text in over 300 languages.

Beniamino Del Pizzo is a big data engineer at Data Reply IT, where he works on data ingest with a focus on Apache Kafka and Spark applications. Beniamino is passionate about big data, streaming applications, distributed computation, and data analysis. He holds a master’s degree in computer engineering; his thesis outlined an evolutionary approach to using Apache Spark with TSK-fuzzy systems for big data.

Presentations

How an Italian company rules the world of insurance: Facing the technological challenges of turning data into value Session

With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth.

Cesar Delgado is the Siri platform architect at Apple. He has also worked on iTunes, iCloud, News, and Maps. Previously, Cesar worked at various startups around Silicon Valley. He has been involved in the Apache Hadoop community since 2008.

Presentations

Journey to consolidation Keynote

Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company's most innovative applications. Cesar Delgado joins Mike Olson to discuss how Apple is using its big data stack and expertise to solve non-data problems.

Noemi Derzsy is a postdoctoral research associate at the Social Cognitive Network Academic Research Center at Rensselaer Polytechnic Institute, where she uses data sets to analyze, understand, and model complex systems using network science and data science techniques. She’s also a NASA datanaut. Noemi holds a PhD in physics.

Presentations

Topic modeling openNASA data Session

Open source data has enabled society to engage in community-based research and has provided government agencies with more visibility and trust from individuals. Noemi Derzsy offers an overview of the openNASA platform and discusses openNASA metadata analysis and tools for applying NLP and topic modeling techniques to understand open government dataset associations.

Stephen Devine is a Seattle-based data engineer at Big Fish Games, where he wrangles events sent from millions of mobile phones through Kafka into Hive. Previously, he did similar things for Xbox One Live Services using proprietary Microsoft technology and worked on several releases of Internet Explorer at Microsoft.

Presentations

Working within the Hadoop ecosystem to build a live-streaming data pipeline Session

Companies are increasingly interested in processing and analyzing live-streaming data. The Hadoop ecosystem includes platforms and software library frameworks to support this work, but these components require correct architecture, performance tuning, and customization. Stephen Devine and Kalah Brown explain how they used Spark, Flume, and Kafka to build a live-streaming data pipeline.

Thomas W. Dinsmore is director of product marketing for Cloudera Data Science. Previously, he served as a knowledge expert on the strategic analytics team at the Boston Consulting Group; director of product management for Revolution Analytics; analytics solution architect at IBM Big Data Solutions; and a consultant at SAS, PricewaterhouseCoopers, and Oliver Wyman. Thomas has led or contributed to analytic solutions for more than five hundred clients across vertical markets and around the world, including AT&T, Banco Santander, Citibank, Dell, J.C.Penney, Monsanto, Morgan Stanley, Office Depot, Sony, Staples, United Health Group, UBS, and Vodafone. His international experience includes work for clients in the United States, Puerto Rico, Canada, Mexico, Venezuela, Brazil, Chile, the United Kingdom, Belgium, Spain, Italy, Turkey, Israel, Malaysia, and Singapore.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, whether in the enterprise or on shared distributed systems, requires a bit more work. Tristan Zajonc and Thomas Dinsmore discuss common technology considerations and patterns for collaboration in large teams and for moving machine learning into production at scale.

Leo Dirac is a principal engineer on the Amazon AI team at Amazon Web Services. Previously, he led the engineering team that launched the Amazon Machine Learning service. Leo has a background in physics. He started writing software professionally in the 1980s. In 2012, he became fascinated with deep learning and has been building systems with it ever since.

Presentations

Practical deep learning for understanding images Session

Leo Dirac demonstrates how to apply the latest deep learning techniques to semantically understand images. You'll learn what embeddings are, how to extract them from your images using deep convolutional neural networks (CNNs), and how they can be used to cluster and classify large datasets of images.

Mark Donsky leads data management and governance solutions at Cloudera. Previously, Mark held product management roles at companies such as Wily Technology, where he managed the flagship application performance management solution, and Silver Spring Networks, where he managed big data analytics solutions that reduced greenhouse gas emissions. He holds a BS with honors in computer science from the University of Western Ontario.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Mike Driscoll is the founder and CEO of Metamarkets. Previously, Mike spent more than a decade focused on making the most of data to help companies grow and developed data analytics solutions for online retail, life sciences, digital media, insurance, and banking. He also successfully founded and sold two companies: Dataspora, a life science analytics company, and CustomInk, an early pioneer in customized apparel. Mike began his career as a software engineer for the Human Genome Project. He holds an AB in government from Harvard and a PhD in bioinformatics from Boston University.

Presentations

Meet the Expert with Mike Driscoll (Metamarkets) Meet the Experts

If you're interested in data in programmatic marketing, Mike has a wealth of information on how to get it right.

The cognitive design principles of interactive analytics Session

Most analytics tools in use today provide static visuals that don’t reveal the full, real-time picture. Mike Driscoll shows how to take an interactive approach to analytics. From design techniques to discovering new forms of data exploration, he demonstrates how to put the full power of big data into the hands of the people who need it to make key business decisions.

Leigh Drogen is the founder and CEO of Estimize, a crowdsourced financial estimates platform, facilitating a community of independent analysts, including financial professionals, offering a more accurate view of market expectations. Previously, Leigh ran Surfview Capital, a New York-based quantitative investment management firm trading medium frequency momentum strategies. He was also an early member of the team at StockTwits, where he worked on product and business development. Leigh started his career as an analyst at Geller Capital, a quantitative investment management firm in New York. He holds a BA from Hunter College with a focus in behavioral economics and war theory. When he’s not staring at rectangular lightboxes, Leigh can be found on the ice rink playing hockey, behind a grill, or off in search of waves to surf around the world.

Presentations

Crowdsourced alpha: The future of investment research Findata

Findata session with Leigh Drogen

Mathieu Dumoulin is a data scientist in MapR Technologies’s Tokyo office, where he combines his passion for machine learning and big data with the Hadoop ecosystem. Mathieu started using Hadoop from the deep end, building a full unstructured data classification prototype for Fujitsu Canada’s Innovation Labs, a project that eventually earned him the 2013 Young Innovator award from the Natural Sciences and Engineering Research Council of Canada. Afterward, he moved to Tokyo with his family, where he worked as a search engineer at a startup and a managing data scientist for a large Japanese HR company, before coming to MapR.

Presentations

State-of-the-art robot predictive maintenance with real-time sensor data Session

Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform.

Ted Dunning has been involved with a number of startups—the latest is MapR Technologies, where he is chief application architect working on advanced Hadoop-related technologies. Ted is also a PMC member for the Apache Zookeeper and Mahout projects and contributed to the Mahout clustering, classification, and matrix decomposition algorithms. He was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics. Opinionated about software and data-mining and passionate about open source, he is an active participant of Hadoop and related communities and loves helping projects get going with new technologies.

Presentations

Meet the Expert with Ted Dunning (MapR Technologies) Meet the Experts

Ted is happy to talk with you about stream-first architecture, machine learning logistics, and recent developments in t-digest.

Tensor abuse in the workplace Session

Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).

Mateusz Dymczyk is a Tokyo-based software engineer at H20.ai, where he works as a researcher on machine learning and NLP projects. He works on distributed machine learning projects including the core H2O platform and Sparkling Water, which integrates H2O and Apache Spark. Previously, he worked at Fujitsu Laboratories. Mateusz loves all things distributed and machine learning and hates buzzwords. In his spare time, he participates in the IT community by organizing, attending, and speaking at conferences and meetups. Mateusz holds an MSc in computer science from AGH UST in Krakow, Poland.

Presentations

State-of-the-art robot predictive maintenance with real-time sensor data Session

Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform.

Barbara Eckman is a principal data architect at Comcast, where she leads data governance for an innovative, division-wide initiative comprising near-real-time ingesting, streaming, transforming, storing, and analyzing big data. Barbara is a technical innovator and strategist with internationally recognized expertise in scientific data architecture and integration. Her experience includes technical leadership positions at a Human Genome Project center, Merck, GlaxoSmithKline, and IBM. She served on the IBM Academy of Technology, an internal peer-elected organization akin to the National Academy of Sciences.

Presentations

End-to-end data discovery and lineage in a heterogeneous big data environment with Apache Atlas and Avro Session

Barbara Eckman offers an overview of Comcast’s streaming data platform, comprised of a variety of ingest, transformation, and storage services, which uses Apache Avro schemas to support end-to-end data governance, Apache Atlas for data discovery and lineage, and custom asynchronous messaging libraries to notify Atlas of new data and schema entities and lineage links as they are created.

Bob Eilbacher is the vice president of operations at Caserta. An experienced operations and client services professional with a successful track record of providing technology solutions and services that focus on uncovering analytics insights and driving efficiency across an enterprise, Bob works directly with clients to develop strategies and implement solutions that transform structured and unstructured data into analytics-driven business insights. He has a strong background in technology and a deep appreciation for finding the right solution. Previously, he held executive roles at Verint and Ness Technologies.

Presentations

Creating a DevOps practice for analytics Session

Building an efficient analytics environment requires a strong infrastructure. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains.

Amie Elcan is a principal architect in CenturyLink’s Data Network Strategies organization, where her current areas of focus are traffic modeling, application traffic analytics, and data science. Amie has worked in the telecommunications industry for over 20 years delivering traffic-based assessments that drive optimal network architecture and engineering design decisions.

Presentations

Classification of telecom network traffic: Insight gained using statistical learning on a big data platform DCS

Statistical learning techniques applied to network data provide a comprehensive view of traffic behavior that would not be possible using traditional descriptive statistics alone. Amie Elcan shares an application of the random forest classification method using network data queried from a big data platform and demonstrates how to interpret the model output and the value of the data insight.

Madeleine Clare Elish is a researcher at Data & Society in New York. A cultural anthropologist focusing on the social impact of artificial intelligence and automation, her research investigates how new technologies reshape understandings of values, efficacy, and ethical norms and how this may advantage or disadvantage different populations. Madeleine has published ethnographic and historical research aimed at grounding and reframing policy debates around the rise of machine intelligence, including An AI Pattern Language, which presents a taxonomy of current social challenges and responses drawn from interviews with AI industry practitioners. She will receive her PhD in anthropology from Columbia University in Fall 2017 and holds an SM in comparative media studies from MIT. She can be found occasionally on Twitter as @mcette.

Presentations

Ask me anything: Data & Society Ask Me Anything

Data & Society's danah boyd and Madeleine Elish answer your questions and discuss topics such as the manipulation of data-driven and AI technologies, humans in the loop in automated systems, and the future of work.

Justin Erickson is a senior director of product management leading Cloudera’s platform team, which is responsible for the components in Cloudera Distribution, including Hadoop (CDH) above storage. Previously, he led the high-availability and disaster-recovery areas of Microsoft SQL Server.

Presentations

Optimizing the data warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. Nandu Jayakumar and Justin Erickson share Visa’s journey along with some best practices for organizations migrating workloads to Hadoop.

Javier “Xavi” Esplugas is the vice president of IT planning and architecture at DHL Supply Chain. Xavi has served in a number of roles at DHL Supply Chain. Previously, he drove the standardization and innovation agenda in Europe, which included DHL’s vision picking, robotics, and the internet of things. Xavi holds an MSC in computer engineering from Universitat Politècnica de Catalunya in Barcelona.

Presentations

Implementing a successful real-time project Data 101

DHL's Javier Esplugas and Conduce's Kevin Parent explain how the two companies have implemented an IoT pipeline that gives managers and executives real-time insight into warehouse operations, helping them to identify potential hazards, reduce costs, and increase productivity.

Seeing everything so managers can act on anything: The IoT in DHL Supply Chain operations Session

DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization.

Carson Farmer is lead data scientist at Set, a technology startup focused on building innovative new technologies to help mobile application developers make better use of behavioral data, with a focus on protecting users’ privacy. Carson is also an assistant professor of geocomputation in the Department of Geography at the University of Colorado Boulder, where his research focuses on human mobility and space-time interactions.

Presentations

Learning location: Real-time feature extraction for mobile analytics Session

Location-based data is full of information about our everyday lives, but GPS and WiFi signals create extremely noisy mobile location data, making it hard to extract features, especially when working with real-time data. Andrew Hill and Sander Pick explore new strategies for extracting information from location data while remaining scalable, privacy focused, and contextually aware.

Basil Faruqui is lead solutions manager at BMC, where he leads the development and execution of big data and multicloud strategy for BMC’s Digital Business Automation line of business (Control-M). Basil’s key areas of focus include evangelizing the role automation plays in delivering successful big data projects and advising companies on how to build scalable automation strategies for cloud and big data initiatives. Basil has over 15 years of industry experience in various areas of software research and development, customer support, and knowledge management.

Presentations

Automated data pipelines in hybrid environments: Myth or reality? (sponsored by BMC) Session

Are you building, running, or managing complex data pipelines across hybrid environments spanning multiple applications and data sources? Doing this successfully requires automating dataflows across the entire pipeline, ideally controlled through a single source. Basil Faruqui and Jon Ouimet walk you through a customer journey to automate data pipelines across a hybrid environment.

Jessica Forde is a technical writer at Jupyter.

Presentations

JupyterLab: Building blocks for interactive computing Session

With JupyterLab, users compute with multiple notebooks, editors, and consoles that work together in a tabbed layout. Jason Grout and Jessica Forde offer an overview of JupyterLab, the next generation of the Jupyter Notebook, demonstrate how to use third-party plugins to extend and customize many aspects of JupyterLab, and explain how it fits within the overall vision of Project Jupyter.

Parisa Foster is cofounder and president of mobile prediction game and brand engagement platform Play The Future. Previously, Parisa was the vice president of marketing and business development at Budge, a leading mobile game studio. A pioneer in the mobile space, Parisa began her career in business intelligence at Airborne Mobile, one of Canada’s first mobile startups, before joining the first Mobile and API team at Yellow Pages Group, transforming digital at Just for Laughs, and consulting for a number of international clients.

Presentations

Using data to play (and forecast) the future DCS

Technology startup Play The Future developed a mobile prediction game in which users predict trending events and get rewarded for accuracy. Parisa Foster explains Play The Future’s unique predictive gameplay, discusses the challenges of a groundbreaking project, and reveals emerging insights derived from its data about how people make predictions.

Eugene Fratkin is a director of engineering at Cloudera leading cloud infrastructure efforts. He was one of the founding members of the Apache MADlib project (scalable in-database algorithms for machine learning). Previously, Eugene was a cofounder of a Sequoia Capital-backed company focusing on applications of data analytics to problems of genomics. He holds PhD in computer science from Stanford University’s AI lab.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Michael J. Freedman is a professor in the Computer Science Department at Princeton University and the cofounder and CTO of TimescaleDB, which provides an open source time series database optimized for fast ingest and complex queries. His research broadly focuses on distributed systems, networking, and security. He developed and operates several self-managing systems, including CoralCDN (a decentralized content distribution network) and DONAR (a server resolution system that powered the FCC’s Consumer Broadband Test), both of which serve millions of users daily. Michael’s other research has included software-defined and service-centric networking, cloud storage and data management, untrusted cloud services, fault-tolerant distributed systems, virtual world systems, peer-to-peer systems, and various privacy-enhancing and anticensorship systems. Michael’s work on IP geolocation and intelligence led him to cofound Illuminics Systems, which was acquired by Quova (now part of Neustar). His work on programmable enterprise networking (Ethane) helped form the basis for the OpenFlow/software-defined networking (SDN) architecture. His honors include the Presidential Early Career Award for Scientists and Engineers (PECASE), a Sloan fellowship, the NSF CAREER Award, the Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Michael holds a PhD in computer science from NYU’s Courant Institute and both an SB and an MEng degree from MIT.

Presentations

Meet the Expert with Michael Freedman (TimescaleDB | Princeton) Meet the Experts

Talk with Michael about handling time series data across an organization. (Hint: it's not just for DevOps.)

When boring is awesome: Making PostgreSQL scale for time series data Session

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.

Jon Fuller is an application scientist at KNIME, where he works with customers to deploy advanced analytics and help them understand the power of working with cloud resources. Previously, Jon was a postdoctoral researcher at the Heidelberg Institute for Theoretical Studies, where he published several papers on computational biology topics. Jon is a lapsed physicist. He holds a PhD in bioinformatics from the University of Leeds.

Presentations

Deploying deep learning to assist the digital pathologist Session

Jon Fuller and Olivia Klose explain how KNIME, Apache Spark, and Microsoft Azure enable fast and cheap automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end users as a web application using the KNIME WebPortal.

Anil Gadre is the executive vice president of product management at MapR. Previously, Anil was the executive vice president of product management at Silver Spring Networks, where he was responsible for product strategy, planning, and marketing of networking and software products focused on the smart grid for the energy industry, and executive vice president of the Application Platform Software organization at Sun Microsystems, where he previously served as chief marketing officer leading global branding, demand creation, and an extensive developer ecosystem program. Anil’s roles at Sun Microsystems covered diverse product lines ranging from networked desktop and enterprise servers systems to market-leading software products, such as the Solaris Operating system, Java, the MySQL database, and various middleware products. He holds a BSEE from Stanford University and an MM from the Kellogg School at Northwestern University.

Presentations

A whole new way to think about your next-gen applications (sponsored by MapR Technologies) Keynote

Businesses struggle to build applications that harness all their data. RDBMS cannot handle modern data-intensive workloads, and NoSQL doesn't provide the capabilities for diverse applications. Anil Gadre explains how customers using a converged data platform are succeeding at creating breakthrough new apps for the enterprise. 

Jerrard Gaertner is co-developer and instructor for big data education at the University of Toronto School of Continuing Studies and President of Managed Analytic Services. Jerrard is a CPA, a security and privacy specialist, a futurist, and an ethicist.

Presentations

What I learned from teaching 1,500 analytics students Session

Engaging, teaching, mentoring, and advising mature, mostly employed, often enthusiastic and ambitious adult learners at University of Toronto has taught Jerrard Gaertner more about analytics in the real world than he ever imagined. Jerrard shares stories he learned about everything from hyped-up expectations and internal sabotage to organizational streamlining and creating transformative insight.

Kaushal Gandhi is a senior software engineer at Trifacta, where he built Trifacta’s fast interactive transformation engine (Photon) along with various data transformation features that improve user utility and usability of the product. Previously, Kaushal built prediction and estimation software at NVIDIA. He holds an MS in computer science and engineering.

Presentations

Interactive data exploration and analysis at enterprise scale Session

Sean Kandel and Kaushal Gandhi share best practices for building and deploying Hadoop applications to support large-scale data exploration and analysis across an organization.

Eddie Garcia is chief information security officer at Cloudera, a leader in enterprise analytic data management, where he draws on his more than 20 years of information and data security experience to help Cloudera Enterprise customers reduce security and compliance risks associated with sensitive datasets stored and accessed in Apache Hadoop environments. Previously, Eddie was the vice president of infosec and engineering for Gazzang prior to its acquisition by Cloudera, where he architected and implemented secure and compliant big data infrastructures for customers in the financial services, healthcare, and public sector industries to meet PCI, HIPAA, FERPA, FISMA, and EU data security requirements. He was also the chief architect of the Gazzang zNcrypt product and is author of three patents for data security.

Presentations

Machine learning to spot cybersecurity incidents at scale Session

Machine data from firewalls, network switches, DNS servers, and many other devices in your organization may be untapped potential for cybersecurity threat analytics using machine learning. Eddie Garcia explores how companies are using Apache Hadoop-based approaches to protect their organizations and explains how Apache Spot is tackling this challenge head-on.

Manuel García-Herranz is the chief scientist at UNICEF’s Office of Innovation, where he focuses on bridging the gap between data science and the most vulnerable, exploring how to apply big data, complex systems theory, and AI to help the most deprived and invisible, from addressing growing humanitarian problems such as epidemics, natural disasters, and migration and transversal issues such as monitoring development indicators, data representativeness, and the inclusion of inequality concepts in the body of computer science theories. Manuel holds a PhD in computer science from the Universidad Autónoma de Madrid.

Presentations

Data science for the most vulnerable at UNICEF Innovation Keynote

The growing availability of data—along with advances in fields such as data science and artificial intelligence—has profoundly changed businesses. Manuel García-Herranz explains how to leverage these advances for the most vulnerable, while making sure that the existing data divide does not increase the gap in inequality, and integrate these advances into the humanitarian and development systems.

Paul George is a software engineer at Cloudera, working on cloud products such as Cloudera Altus. Previously, Paul worked at Palantir Technologies and cofounded a company focused on building data systems for genomics. He holds a PhD in electrical and computer engineering from Cornell University.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Alison Gilles is director of engineering for data infrastructure at Spotify, where she coaches and leads teams in backend services and data infrastructure. Previously, she led engineering teams at nonprofit organizations in education and corporate social responsibility.

Presentations

Spotify in the cloud: The next evolution of data at Spotify Session

In early 2016, Spotify decided that it didn’t want to be in the data center business. The future was the cloud. Josh Baer and Alison Gilles explain what it took to move Spotify to the cloud, covering Spotify's technology choices, challenges faced, and the lessons Spotify learned along the way.

Lucas Glass is the Global Analytics Lead within the Analytics Center of Excellence at QuintilesIMS. His teams build data science and artificial intelligence microservices to make the design, planning, and execution of clinical research more efficient. Prior to QuintilesIMS, Lucas worked on healthcare fraud analytics at the Department of Justice. He holds a masters in biostatistics from Drexel University and is a PhD candidate at Temple University in Statistics.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, whether in the enterprise or on shared distributed systems, requires a bit more work. Tristan Zajonc and Thomas Dinsmore discuss common technology considerations and patterns for collaboration in large teams and for moving machine learning into production at scale.

Daniel Goddemeyer is the founder of OFFC NYC, a New York City-based research and design studio that works with global brands, research institutions, and startups to explore future product applications for today’s emerging technologies. Daniel’s research explores how the increasing proliferation of these technologies in our future lives will transform our everyday interactions. His work has been exhibited internationally at the Westbound Shanghai Architecture Biennial, the Data in the 21st Century exhibition at V2 Rotterdam, Data Traces Riga, and the Big Bang Data exhibition at London’s Somerset House, among others, and he has won or been recognized by the Art Directors Club, the Red Dot Award, the German Design Price, the Kantar Information Is Beautiful Award, and the Industrial Designer Society of America.

Presentations

Data futures: Exploring the everyday implications of increasing access to our personal data Session

Increasing access to our personal data raises profound moral and ethical questions. Daniel Goddemeyer and Dominikus Baur share the findings from Data Futures, an MFA class in which students observed each other through their own data, and demonstrate the results with a live experiment with the audience that showcases some of the effects when personal data becomes accessible.

Jonathan Gray is the founder and CEO of Cask. Jonathan is an entrepreneur and software engineer with a background in startups, open source, and all things data. Previously, he was a software engineer at Facebook, where he helped drive HBase engineering efforts, including Facebook Messages and several other large-scale projects, from inception to production. An open source evangelist, Jonathan was responsible for helping build the Facebook engineering brand through developer outreach and refocusing the open source strategy of the company. Prior to Facebook, Jonathan founded Streamy.com, where he became an early adopter of Hadoop and HBase. He is now a core contributor and active committer in the community. Jonathan holds a bachelor’s degree in electrical and computer engineering from Carnegie Mellon University.

Presentations

Hybrid data lakes: Unlocking the inevitable (sponsored by Cask) Session

To take advantage of the latest big data technology options in the cloud, more and more enterprises are building hybrid, self-service data lakes. Jonathan Gray discusses the importance of a portability strategy, addresses implementation challenges, and shares customer use cases that will inspire enterprises to embark on a multi-environment data lake journey.

Jason Grout is a Jupyter developer at Bloomberg, working primarily on JupyterLab and the interactive Jupyter widgets library. He has also been a major contributor to the open source Sage mathematical software system and co-organizes the PyDataNYC Meetup. Previously, Jason was an assistant professor of mathematics at Drake University in Des Moines, Iowa. He holds a PhD in mathematics from Brigham Young University.

Presentations

JupyterLab: Building blocks for interactive computing Session

With JupyterLab, users compute with multiple notebooks, editors, and consoles that work together in a tabbed layout. Jason Grout and Jessica Forde offer an overview of JupyterLab, the next generation of the Jupyter Notebook, demonstrate how to use third-party plugins to extend and customize many aspects of JupyterLab, and explain how it fits within the overall vision of Project Jupyter.

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Nadeem Gulzar is the head of advanced analytics and architecture at Danske Bank Group, a Nordic bank with strong roots in Denmark and a focus on becoming the most trusted financial partner in the Nordics. Nadeem has taken the lead in establishing advanced analytics and big data technologies within Danske. Previously, he worked with Credit and Marketrisk, where he headed a program to build-up capabilities to calculate risk using Monte Carlo simulation methods. Nadeem holds a BS in computer science, mathematics, and psychology and a master’s degree in computer science, both from Copenhagen University.

Presentations

Fighting financial fraud at Danske Bank with artificial intelligence Session

Fraud in banking is an arms race, and criminals are now using machine learning to improve their attack effectiveness. Sune Askjaer and Nadeem Gulzar explore how Danske Bank uses deep learning for better fraud detection, covering model effectiveness, TensorFlow versus boosted decision trees, operational considerations in training and deploying models, and lessons learned along the way.

Alexandra Gunderson is a data scientist at Arundo Analytics. Her background is in mechanical engineering and applied numerical methods.

Presentations

IIoT data fusion: Bridging the gap from data to value Session

One of the main challenges when working with industrial data is linking the large amount of data and extracting value. Alexandra Gunderson shares a comprehensive preprocessing methodology that structures and links data from different sources, converting the IIoT analytics process from an unorganized mammoth to one more likely to generate insight.

Sijie Guo is the cofounder of Streamlio, a company focused on building a next-generation real-time data stack. Previously, he was the tech lead for messaging group at Twitter, where he cocreated Apache DistributedLog, and worked on push notification infrastructure at Yahoo. He is the PMC chair of Apache BookKeeper.

Presentations

Messaging, storage, or both: The real-time story of Pulsar and Apache DistributedLog Session

Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Yufeng Guo is a developer advocate for the Google Cloud Platform, where he is trying to make machine learning more understandable and usable for all. He enjoys hearing about new and interesting applications of machine learning, so be sure to share your use case with him.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Yunsong Guo is a staff engineer at Pinterest developing home feed ranking ML models. Yunsong is the founding member of the home feed ranking team and has led key projects to turn Pinterest home feed ranking from time based to logistic regression based and later to GBDT-powered ranking systems. Such projects and feature improvements resulted in more than 100% home feed user engagement gains. Previously, he spent a few years working in London and Hong Kong on algorithmic trading, high-frequency trading, and statistical arbitrage using machine-learned models. Yunsong holds a PhD in computer science from Cornell University with a focus on machine learning.

Presentations

How Pinterest uses machine learning to achieve ~200M monthly active users HDS

Pinterest has always prioritized user experiences. Yunsong Guo explores how Pinterest uses machine learning—particularly linear, GBDT, and deep NN models—in its most important product, the home feed, to improve user engagement. Along the way, Yunsong shares how Pinterest drastically increased its international user engagement along with lessons on finding the most impactful features.

Sebastian Gutierrez is a data entrepreneur who focuses on data-driven companies. Sebastian founded DashingD3js.com to provide online and corporate training in data visualization and D3.js to a diverse client base, including corporations like the New York Stock Exchange, American Express, Intel, General Dynamics, Salesforce, Thomson Reuters, Oracle, Bloomberg Businessweek, universities, and dozens of startups. More than 1,000 people have attended his live training sessions, and many more have succeeded with his online D3.js training. Sebastian also cofounded DataScienceWeekly.org, which provides news, analysis, and commentary in data science. Its Data Science Weekly newsletter reaches tens of thousands of aspiring and professional data scientists. He is also the author of Data Scientist at Work, a collection of interviews with many of the world’s most influential and interesting data scientists from across the spectrum of public companies, private companies, startups, venture investors, and nonprofits. Sebastian holds a BS in mathematics from MIT and an MA in economics from the University of San Francisco.

Presentations

Improve business decision making with the science of human perception Session

You likely already use business metrics and analytics to achieve success in your data-driven organization. Sebastian Gutierrez demonstrates how to use the science of human perception to drastically improve your data visualizations, reports, and dashboards to drive better decisions and results.

Alex Gutow is senior product marketing manager at Cloudera, where she focuses on the analytic database platform solution and technologies. Previously, she managed technical marketing and PR for Basho Technologies and managed consumer and enterprise marketing for Truaxis, a Mastercard company. Alex holds a BS in marketing and a BA in psychology from Carnegie Mellon University.

Presentations

Adaptive analytics: Transitioning from legacy systems to a modern platform with MicroStrategy and Cloudera (sponsored by MicroStrategy) Session

Alex Gutow discusses the importance of adaptive analytics and shares everything you need to know while transitioning from legacy data warehouses to Hadoop-based platforms. Join in to find out why you need modern platforms to move, host, and analyze your data with MicroStrategy and Cloudera.

Felix GV is a software engineer working on LinkedIn’s data infrastructure. He works on Voldemort and Venice and keeps a close eye on Hadoop, Kafka, Samza, Azkaban, and other systems.

Presentations

Introducing Venice: A derived datastore for batch, streaming, and lambda architectures Session

Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Patrick Hall is a senior director for data science products at H2o.ai where he focuses mainly on model interpretability. Patrick is also currently an adjunct professor in the Department of Decision Sciences at George Washington University, where he teaches graduate classes in data mining and machine learning.

Previously, Patrick held global customer-facing and R&D research roles at SAS Institute. He holds multiple patents in automated market segmentation using clustering and deep neural networks. Patrick is the eleventh person worldwide to become a Cloudera Certified Data Scientist. He studied computational chemistry at the University of Illinois before graduating from the Institute for Advanced Analytics at North Carolina State University.

Presentations

Interpretable AI: Not just for regulators Session

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. People who use these technologies have the right to trust and understand AI. Patrick Hall and Sri Satish share techniques for interpreting deep learning and machine learning models and telling stories from their results.

Eui-Hong (Sam) Han is the director of big data and personalization at the Washington Post. Sam is an experienced practitioner of data mining and machine learning and has an in-depth understanding of analytics technologies. He has successfully applied these technologies to solve real business problems. At the Washington Post, he leads a team building an integrated big data platform to store all aspects of customer profiles and activities from both digital and print circulation, content metadata, and business data. His team is building an infrastructure, tools, and services to provide personalized experience to customers, empower the newsroom with data for better decisions, and provide targeted advertising capability. Previously, he led the Big Data practice at Persistent Systems, started the Machine Learning Group in Sears Holdings’s online business unit, and worked for a data mining startup company. Sam’s expertise includes data mining, machine learning, information retrieval, and high-performance computing. He holds a PhD in computer science from the University of Minnesota.

Presentations

Automatic comments moderation with ModBot at the Washington Post Session

The quality of online comments is critical to the Washington Post. However, the quality management of the comment section currently requires costly manual resources. Eui-Hong Han and Ling Jiang discuss ModBot, a machine learning-based tool developed for automatic comments moderation, and share the challenges they faced in developing and deploying ModBot into production.

Luke (Qing) Han is the coounder and CEO of Kyligence, which provides a leading intelligent data platform powered by Apache Kylin to simplify big data analytics from on-premises to the cloud. Luke is the cocreator and PMC chair of Apache Kylin, where he contributes his passion to driving the project’s strategy, roadmap, and product design. For the past few years, Luke has been working on growing Apache Kylin’s community, building its ecosystem, and extending its adoption globally. Previously, he was big data product lead at eBay, where he managed Apache Kylin, engaged customers, and coordinated various teams from different geographical locations, and chief consultant at Actuate China.

Presentations

Building enterprise OLAP on Hadoop in finance with Apache Kylin (sponsored by Kyligence) Session

Luke Han offers an overview of Apache Kylin and its enterprise version KAP and shares a case study of how a top finance company migrated to Apache Kylin on top of Hadoop from its legacy Cognos and DB2 system.

Tom Hanlon is a senior instructor at Skymind, where he delivers courses on the wonders of the Hadoop ecosystem. Before beginning his relationship with Hadoop and large distributed data, he had a happy and lengthy relationship with MySQL with a focus on web operations. He has been a trainer for MySQL, Sun, and Percona.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

David is a Product Manager at MicroStrategy responsible for Big Data customer engagements. David was one of the original beta consultants on the Advanced Technology team at MicroStrategy. He has led over 20 customer visits all over the world helping MicroStrategy’s largest customers take advantage of recent development work. A graduate from University of Virginia with a degree in Finance and Statistics, David also has a wide variety of work experience from working as a consultant for a filter factory in South Africa to interning on Capitol Hill for a senator.

Presentations

Adaptive analytics: Transitioning from legacy systems to a modern platform with MicroStrategy and Cloudera (sponsored by MicroStrategy) Session

Alex Gutow discusses the importance of adaptive analytics and shares everything you need to know while transitioning from legacy data warehouses to Hadoop-based platforms. Join in to find out why you need modern platforms to move, host, and analyze your data with MicroStrategy and Cloudera.

Behrooz Hashemian is a researcher and chief data officer at MIT’s Senseable City Lab, where he investigates the innovative implementation of big data analytics and artificial intelligence in smart cities, finance, and healthcare. A data scientist with expertise in developing predictive analytics strategies, machine learning solutions, and data-driven platforms for informed decision making, Behrooz endeavors to bridge the gap between academic research and industrial deployment of big data analytics and artificial intelligence. He is also leading an unprecedented project on anonymized data fusion, which provides a multidimensional insight into urban activities and customer behaviors from multiple sources.

Presentations

Anonymized data fusion: Privacy versus utility Session

People are leaving an increasing amount of digital traces in their everyday life. Since these traces are mostly anonymized, the information gained by advanced data analytics is limited to each individual trace. Behrooz Hashemian explains how to fuse various traces and build multidimensional insight by taking advantage of patterns in people's behavior.

Bill Havanki is a software engineer at Cloudera, where he contributes to Hadoop components and systems for deploying Hadoop clusters into public cloud services. Previously, Bill worked for 15 years developing software for government contracts, focusing mostly on analytic frameworks and authentication and authorization systems. He holds a BS in electrical engineering from Rutgers University and an MS in computer engineering from North Carolina State University. A New Jersey native, Bill currently lives near Annapolis, Maryland, with his family.

Presentations

Automating cloud cluster deployment: Beyond the book Session

Speed and reliability in deploying big data clusters is key for effectiveness in the cloud. Drawing on ideas from his book Moving Hadoop to the Cloud, which covers essential practices like baking images and automating cluster configuration, Bill Havanki explains how you can automate the creation of new clusters from scratch and use metrics gathered using the cloud provider to scale up.

Katherine Heller is an assistant professor in Duke University’s Departments of Statistical Science, Computer Science, and Electrical and Computer Engineering and at the Center for Cognitive Neuroscience, where she develops new methods and models to discover latent structure in data, including cluster structure, using Bayesian nonparametrics, hierarchical Bayes, time series techniques, and other Bayesian statistical methods, and applies these methods to problems in the brain and cognitive sciences, human social interactions, and clinical medicine. Previously, she was an NSF postdoctoral fellow in the Computational Cognitive Science Group at MIT and an EPSRC postdoctoral fellow at the University of Cambridge. Katherine has been the recipient of a first-round NSF BRAIN Initiative award, a Google faculty research award, and an NSF CAREER award. She holds a PhD from the Gatsby Unit at University College London.

Presentations

Machine learning for healthcare data HDS

Katherine Heller discusses multiple ways in which healthcare data is acquired and explains how machine learning methods are currently being introduced into clinical settings.

Seth Hendrickson is a top Apache Spark contributor and data scientist at Cloudera. He implemented multinomial logistic regression with elastic net regularization in Spark’s ML library and one-pass elastic net linear regression, contributed several other performance improvements to linear models in Spark, and made extensive contributions to Spark ML decision trees and ensemble algorithms. Previously, he worked on Spark ML as a machine learning engineer at IBM. He holds an MS in electrical engineering from the Georgia Institute of Technology.

Presentations

Boosting Spark MLlib performance with rich optimization algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai explain when and how to use this new API and walk you through creating your own Spark ML optimizer. Along the way, they also share performance benefits and real-world use cases.

Extending Spark ML: Adding your own tools and algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines.

Lige Hensley is chief technology officer for Ivy Tech Community College of Indiana, where he leads a highly efficient and agile technical staff to bring a competitive advantage to the organization. A 24-year veteran of the IT industry, with experience ranging from successful startup companies to the Fortune 500, Lige has worked in a wide variety of industries, such as agriculture, military, entertainment, logistics, healthcare, government, education, manufacturing, telematics, and many more. He is an alum of the Rose-Hulman Institute of Technology and brings a solid engineering background and a passion for innovation to every endeavor.

Presentations

Learning from higher education: How Ivy Tech is using predictive analytics and data democracy to reverse decades of entrenched practices Session

As the largest community college in the US, Ivy Tech ingests over 100M rows of data a day. Brendan Aldrich and Lige Hensley explain how Ivy Tech is applying predictive technologies to establish a true data democracy—a self-service data analytics environment empowering thousands of users each day to improve operations, achieve strategic goals, and support student success.

JC Herz is cofounder and COO at Ion Channel, a data and microservices platform that automates situational awareness and enables risk management of the software supply chain. She has 15 years of analytics experience in healthcare and national security. JC was a White House special consultant to the Pentagon’s CIO office and coauthored the DoD’s open technology development roadmap. A published author, she has been contributing to Wired magazine since 1993.

Presentations

Confounding factors galore: Using software ecosystem data to risk-rate code Session

Automating security for DevOps means continuous analysis of open source software dependencies, vulnerabilities, and ecosystem dynamics. But the data is confounding: a flurry of reported vulnerabilities or infrequent commits that could be good or bad, depending on a project's scope and lifecycle. JC Herz illuminates nonintuitive insights from the software supply chain.

Andrew Hill is cofounder and CEO of Set, where he is building technology to predict human behavior in the physical world. Set provides an SDK to access behavioral predictions in mobile applications, allowing developers to personalize in-app experiences and push notifications. Previously, Andrew was chief science officer at CARTO. He holds a PhD from the University of Colorado, Boulder.

Presentations

Learning location: Real-time feature extraction for mobile analytics Session

Location-based data is full of information about our everyday lives, but GPS and WiFi signals create extremely noisy mobile location data, making it hard to extract features, especially when working with real-time data. Andrew Hill and Sander Pick explore new strategies for extracting information from location data while remaining scalable, privacy focused, and contextually aware.

John Hitchingham is director of performance engineering at FINRA, where he is responsible for driving technical innovation and efficiency across a cloud application portfolio that processes over 75 billion market events per day to detect fraud, market manipulation, insider trading, and abuse. Previously, John worked at both large and boutique consulting firms providing technical design and consulting services to startup, media, and telecommunications clients. John holds a BS in electrical engineering from Rutgers University.

Presentations

Cloud data lakes: Analytic data warehouses in the cloud Session

John Hitchingham shares insights into the design and operation of FINRA's data lake in the AWS cloud, where FINRA extracts, transforms, and loads over 75B transactions per day. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage.

Vincent-Charles Hodder is the cofounder and CEO of Local Logic, an information company providing location insights on cities to help travelers, home buyers, and investors make better, more informed decisions. Vincent is passionate about cities, tech, and how they can work together to change the way we live. He has a background in finance and urban planning and worked in real estate development before starting Local Logic.

Presentations

Mapping cities through data to model risk in retail and real estate Findata

The location characteristics of a retail or real estate development dictate the types of customers they attract and the customer experience they deliver. Vincent-Charles Hodder explains how to model future demand for specific retail offerings and real estate projects as well as target marketing efforts to the most relevant locations in the city based on specific customer profiles.

Felipe Hoffa is a developer advocate for big data at Google, where he inspires developers around the world to leverage the Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several videos, blog posts, and conferences around the world.

Presentations

What can we learn from 750 billion GitHub events and 42 TB of code? Session

With Google BigQuery anyone can easily analyze the more than five years of GitHub metadata and 42+ terabytes of open source code. Felipe Hoffa explains how to leverage this data to understand the community and code related to any language or project. Relevant for open source creators, users, and choosers, this is data that you can leverage to make better choices.

Carla Holtze is the cofounder and CEO of digital identification SaaS company Parrable. Previously, Carla worked at the BBC, the Economist’s Intelligence Unit, and Lehman Brothers in both New York and Hong Kong. Carla serves on the advisory board of the San Francisco Symphony’s Sound Box and Symphonix and mentors emerging entrepreneurs and technology companies through Startup Mexico (SUM). She holds an MBA from Columbia Business School, an MS in journalism from Columbia University, and a BS from Northwestern University.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

John Horcher is the CRO of Virtual Cove. John has extensive financial markets experience in trading, investment banking, and analyst roles. Previously, he held senior-level roles with firms including SunGard, Business Intelligence Advisors, TIM Group, EDS, and Intergraph and served as managing director of Halpern Capital, where he drove the investor base for research sales and investment banking opportunities, which included raising over $300 million in equity and debt.

Presentations

Discovering insights in financial data with immersive reality Session

Immersive reality enables powerful new information design concepts. Most importantly, the new technology enables the telling of powerful stories using more insightful thinking. John Horcher explores how immersive reality deployments in financial markets have enabled quicker time to insight and therefore better decision making.

Shant Hovsepian is a cofounder and CTO of Arcadia Data, where he is responsible for the company’s long-term innovation and technical direction. Previously, Shant was an early member of the engineering team at Teradata, which he joined through the acquisition of Aster Data. Shant interned at Google, where he worked on optimizing the AdWords database, and was a graduate student in computer science at UCLA. He is the coauthor of publications in the areas of modular database design and high-performance storage systems.

Presentations

Streaming visual analytics: What's possible today and what's coming tomorrow Session

Streaming visual analytics is a technique for visualizing and interacting with streaming data in near real time. Shant Hovsepian explains how lambda- and polling-based architectures are being disrupted by reactive visualization systems, as streaming engines embrace the CQRS pattern, and offers analysis of visualizing streams from Apache Kafka, Apache Flink, and Apache Spark.

Fabian Hueske is a committer and PMC member of the Apache Flink project. He was one of the three original authors of the Stratosphere research system, from which Apache Flink was forked in 2014. Fabian is a cofounder of data Artisans, a Berlin-based startup devoted to fostering Flink, where he works as a software engineer and contributes to Apache Flink. He holds a PhD in computer science from TU Berlin and is currently spending a lot of his time writing a book, Stream Processing with Apache Flink.

Presentations

Stream analytics with SQL on Apache Flink Session

Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.

Kevin Huiskes is the director of marketing in Intel’s Data Center Group. In his 16 years at Intel, Kevin has held a variety of senior business and marketing positions throughout the company, including two years as chief of staff to the executive vice president of Intel’s Data Center Group. His experience includes managing the Intel Data Center Group central marketing organization, managing the Intel Xeon processor E7 product line, business development, and a variety of other product management roles. Prior to Intel, Kevin served as a legislative assistant and committee aide to a member of congress in the US House of Representatives. He holds an MBA from Georgetown University and a BA in political science from Wheaton College.

Presentations

Accelerating insight with analytics and AI (sponsored by Intel) Session

Kevin Huiskes and Radhika Rangarajan discuss Intel's strategy to lower barriers to advanced analytics and AI, make results faster and more efficient, and enable data scientists and developers to make better use of existing infrastructure, emphasizing solutions based on the latest Intel Xeon Scalable platform and the open source framework BigDL.

Christine Hung leads the data solutions team at Spotify, which collaborates with business groups across the company to build scalable analytics solutions and provide strategic business insights. Previously, Christine ran the data science and engineering team at the New York Times, where her team partnered closely with the newsroom to build audience development tools and predictive algorithms to drive performance; she was also head of sales analytics for iTunes at Apple and a business analyst at McKinsey & Company. Christine grew up in Taiwan and currently lives in Manhattan with her family. She holds an MBA from Stanford Business School.

Presentations

Music, the window into your soul Keynote

Have you ever wondered why Spotify just seems to know what you want? As a data-first company, Spotify is investing heavily in its analytics and machine learning capabilities to understand and predict user needs. Christine Hung shares how Spotify uses data and algorithms to improve user experience and drive business impact.

Alysa Z. Hutnik is a partner at Kelley Drye & Warren LLP in Washington, DC, where she delivers comprehensive expertise in all areas of privacy, data security, and advertising law. Alysa’s experience ranges from counseling to defending clients in FTC and state attorneys general investigations, consumer class actions, and commercial disputes. Much of her practice is focused on the digital and mobile space in particular, including the cloud, mobile payments, calling and texting practices, and big data-related services. Ranked as a leading practitioner in the privacy and data security area by Chambers USA, Chambers Global, and Law360, Alysa has received accolades for the dedicated and responsive service she provides to clients. The US Legal 500 notes that she provides “excellent, fast, efficient advice” regarding data privacy matters. In 2013, she was one of just three attorneys under 40 practicing in the area of privacy and consumer protection law to be recognized as a rising star by Law360.

Presentations

Executive Briefing: Legal best practices for making data work Session

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik shares legal best practices and practical tips to avoid becoming a big data “don’t.”

Ihab Ilyas is a professor in the Cheriton School of Computer Science at the University of Waterloo, where his research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction. Ihab is also a cofounder of Tamr, a startup focusing on large-scale data integration and cleaning. He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton faculty fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees and an associate editor of ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.

Presentations

Solving data cleaning and unification using human-guided machine learning Session

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

Pramod Immaneni is a PMC member of Apache Apex and lead architect at DataTorrent Inc, where he works on the Apex platform and specializes in big data applications. Prior to DataTorrent he was a founder of technology startups. He was CTO of Leaf Networks, a company he co-founded and was later acquired by Netgear Inc. He built products in the core networking space and holds patents in peer-to-peer VPNs. Before that he was involved in starting a company where he architected a dynamic content customization engine for mobile devices.

Presentations

Building a scalable streaming ingestion application with exactly once semantics using Apache Apex Session

Apache Apex is an open source stream processing platform that runs on Hadoop. Common usages of Apex is in big data ingestion, streaming analytics, ETL, fast batch, real-time actions, threat detection, etc. The talk will go into building an ingestion application with some lightweight etl that is scalable, fault tolerant and has exactly once semantics.

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Nandu Jayakumar is a software architect and engineering leader at Visa, where he is currently responsible for the long-term architecture of data systems and leads the data platform development organization. Previously, as a senior leader of Yahoo’s well-regarded data team, Nandu built key pieces of Yahoo’s data processing tools and platforms over several iterations, which were used to improve user engagement on Yahoo websites and mobile apps. He also designed large-scale advertising systems and contributed code to Shark (SQL on Spark) during his time there. Nandu holds a bachelor’s degree in electronics engineering from Bangalore University and a master’s degree in computer science from Stanford University, where he focused on databases and distributed systems.

Presentations

Optimizing the data warehouse at Visa Session

At Visa, the process of optimizing the enterprise data warehouse and consolidating data marts by migrating these analytic workloads to Hadoop has played a key role in the adoption of the platform and how data has transformed Visa as an organization. Nandu Jayakumar and Justin Erickson share Visa’s journey along with some best practices for organizations migrating workloads to Hadoop.

Chad W. Jennings is a product manager for BigQuery at Google Cloud. Chad came to Google from the startup world. He is an avid skier and surfer. When he’s not working on big things or playing in nature, he’s at home with his wife and two young children. Chad holds a PhD in aeronautics and astronautics from Stanford University.

Presentations

Emotional arithmetic: A deep dive into how machine learning and big data help you understand customers in real time (sponsored by Google) Session

Doing “algebra” with emotions can lead to new insights about customer behavior. Chad Jennings presents a serverless big data analytics platform that allows you to capture and analyze raw data and train machine learning models that can process text to discern not just the sentiment but also the underlying emotion driving that sentiment.

Emotional arithmetic: How machine learning helps you understand customers in real time (sponsored by Google) Keynote

Chad W. Jennings walks you through a serverless big data architecture on Google Cloud that helps unravel the mysteries of human emotion.

Ling Jiang is a data scientist at the Washington Post, where she works on data mining and knowledge discovery from large volumes of data and has successfully built several data-powered products using machine learning and NLP techniques. Ling is skilled in using various machine learning and data mining techniques to tackle business problems. She holds a PhD in information science from Drexel University.

Presentations

Automatic comments moderation with ModBot at the Washington Post Session

The quality of online comments is critical to the Washington Post. However, the quality management of the comment section currently requires costly manual resources. Eui-Hong Han and Ling Jiang discuss ModBot, a machine learning-based tool developed for automatic comments moderation, and share the challenges they faced in developing and deploying ModBot into production.

Ivan Jibaja is a FlashBlade engineer at Pure Storage, where he leads the team building a big data analytics pipeline for streaming telemetry data from Pure Storage’s testing infrastructure to classify, prioritize, and understand root causes of bugs in the software development cycle. Ivan holds a PhD in computer science from the University of Texas at Austin with a concentration in compilers and programming languages.

Presentations

Continuous integration at scale: Streaming 50 billion events per day for real-time feedback with Kafka and Spark (sponsored by Pure Storage) Session

Ivan Jibaja explains offers an overview of Pure Storage's streaming big data analytics pipeline, which uses open source technologies like Spark and Kafka to process over 30 billion events per day and provide real-time feedback in under five seconds.

David Kale is a deep learning engineer at Skymind and a PhD candidate in computer science at the University of Southern California, where he is advised by Greg Ver Steeg of the USC Information Sciences Institute. His research uses machine learning to extract insights from digital data in high-impact domains, such as healthcare, and he collaborates with researchers from Stanford Center for Biomedical Informatics Research and the YerevaNN Research Lab. Recently, David pioneered the application of deep learning to modern electronic health records data. At Skymind, he works with clients and partners to develop and deploy deep learning solutions for real world problems. David co-organizes the Machine Learning for Healthcare Conference (MLHC) and has served as a judge in several XPRIZE competitions, including the upcoming IBM Watson AI XPRIZE. He is the recipient of the Alfred E. Mann Innovation in Engineering Fellowship.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Joseph Kambourakis is a data science instructor at Databricks. Joseph has more than 10 years of experience teaching, over five of them with data science and analytics. Previously, Joseph was an instructor at Cloudera and a technical sales engineer at IBM. He has taught in over a dozen countries around the world and been featured on Japanese television and in Saudi newspapers. He is a rabid Arsenal FC supporter and competitive Magic: The Gathering player. Joseph holds a BS in electrical and computer engineering from Worcester Polytechnic Institute and an MBA with a focus in analytics from Bentley University. He lives with his wife and daughter in Needham, MA.

Presentations

Apache Spark for machine learning and data science 2-Day Training

Joseph Kambourakis walks you through using Apache Spark to perform exploratory data analysis (EDA), developing machine learning pipelines, and using the APIs and algorithms available in the Spark MLlib DataFrames API.

Apache Spark for machine learning and data science (Day 2) Training Day 2

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

Supun Kamburugamuve is a PhD candidate in computer science at Indiana University, where he researches big data applications and frameworks with a focus on data streaming for real-time data analytics. Recently, he has been working on high-performance enhancements to big data systems with HPC interconnect such as Infiniband and Omnipath. Supun is an Apache Software Foundation member and has contributed to many open source projects including Apache Web Services projects. Before joining Indiana University, Supun worked on middleware systems and was a key member of a team developing an open source enterprise service bus that is being widely used for enterprise integrations.

Presentations

Low-latency streaming: Twitter Heron on Infiniband Session

Modern enterprises are data driven and want to move at light speed. To achieve real-time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14 ms. Karthik Ramasamy and Supun Kamburugamuvee explain how they ported Heron to Infiniband to achieve latencies as low as 7 ms.

Sean Kandel is the founder and chief technical officer at Trifacta. Sean holds a PhD from Stanford University, where his research focused on new interactive tools for data transformation and discovery, such as Data Wrangler. Prior to Stanford, Sean worked as a data analyst at Citadel Investment Group.

Presentations

Interactive data exploration and analysis at enterprise scale Session

Sean Kandel and Kaushal Gandhi share best practices for building and deploying Hadoop applications to support large-scale data exploration and analysis across an organization.

Daniel Kang is a PhD student in the Stanford InfoLab, where he is supervised by Peter Bailis and Matei Zaharia. Daniel’s research interests lie broadly at the intersection of machine learning and systems. Currently, he is working on deep learning applied to video analysis.

Presentations

NoScope: Querying videos 1,000x faster with deep learning HDS

Video is one of the fastest-growing sources of data with rich semantic information, and advances in deep learning have made it possible to query this information with near-human accuracy. However, inference remains prohibitively expensive: the most powerful GPU cannot run the state of the art at real time. Daniel Kang offers an overview of NoScope, which runs queries over video 1,000x faster.

Holden Karau is a transgender Canadian Apache Spark committer, an active open source contributor, and coauthor of Learning Spark and High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden speaks internationally about Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and machine learning. Prior to IBM, she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She holds a bachelor of mathematics in computer science from the University of Waterloo.

Presentations

Extending Spark ML: Adding your own tools and algorithms Session

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines.

Meet the Expert with Holden Karau (IBM) Meet the Experts

Do you have Apache Spark questions (especially about Python, performance, or ML)? Come chat with Holden.

Stefan Karpinski is one of the cocreators and core developers of the Julia language. He is an applied mathematician and data scientist by trade, having worked at Akamai, Citrix Online, and Etsy, but currently is focused on advancing Julia’s design, implementation, documentation, and community.

Presentations

Julia and Spark, better together Session

Spark is a fast and general engine for large-scale data. Julia is a fast and general engine for large-scale compute. Viral Shah and Stefan Karpinski explain how combining Julia's compute and Spark's data processing capabilities makes amazing things possible.

Aneesh Karve is cofounder and CTO at Quilt, a data virtualization platform for data scientists. Previously, Aneesh was a product manager, lead designer, and software engineer at companies including Microsoft, NVIDIA, and Matterport and the general manager for AdJitsu, the first real-time 3D advertising platform for iOS (acquired by Amobee in 2012). Aneesh’s research background spans proteomics, machine learning, and algebraic number theory. He holds degrees in chemistry, mathematics, and computer science.

Presentations

Empowering quants to trade faster: From Excel files to data packages DCS

It doesn't matter how much data organizations collect; all that matters is how much data they can leverage. Aneesh Karve explores how a Fortune 500 bank leverages data packages to minimize data prep and maximize time spent on analysis—using a technique called source code-inspired data management.

Sravan Kasarla is an industry recognized technology leader with over 20 years of experience in Information Management, Business Intelligence and Enterprise Architecture. As Technology Leader and Chief Architect delivered results for Fortune 100 Insurance, Financial Services and Retail IT organizations. Sravan holds a Masters in Business Administration from Kakatiya University and a Bachelor of Technology from the National Institute of Technology at Warangal.

Presentations

Using an AI-driven approach to managing data lakes in the cloud or on-premises (sponsored by Informatica) Session

In the face of regulatory and competitive pressures, why not use artificial intelligence, along with smart best practices, to manage data lakes? Murthy Mathiprakasam shares a comprehensive approach to data lake management that ensures that you can quickly and flexibly ingest, cleanse, master, govern, secure, and deliver all types of data in the cloud or on-premises.

Arun Kejariwal is a statistical learning principal at Machine Zone (MZ), where he leads a team of top-tier researchers and works on research and development of novel techniques for install and click fraud detection and assessing the efficacy of TV campaigns and optimization of marketing campaigns. In addition, his team is building novel methods for bot detection, intrusion detection, and real-time anomaly detection. Previously, Arun worked at Twitter, where he developed and open-sourced techniques for anomaly detection and breakout detection. His research includes the development of practical and statistically rigorous techniques and methodologies to deliver high-performance, availability, and scalability in large-scale distributed clusters. Some of the techniques he helped develop have been presented at international conferences and published in peer-reviewed journals.

Presentations

Anomaly detection on live data Session

Services such as YouTube, Netflix, and Spotify popularized streaming in different industry segments, but these services do not center around live data—best exemplified by sensor data—which will be increasingly important in the future. Arun Kejariwal, Francois Orsini, and Dhruv Choudhary demonstrate how to leverage Satori to collect, discover, and react to live data feeds at ultralow latencies.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Elsie Kenyon is a senior product manager at AI platform company Nara Logics, where she works with enterprise customers to define product needs and with engineers to build implementations that address them, with a focus on data processing and machine learning. Previously, Elsie was a researcher and casewriter at Harvard Business School. She holds a BA from Yale University.

Presentations

Learning from customers, keeping humans in the loop Session

Enterprises today pursue AI applications to replace logic-based expert systems in order to learn from customer and operational signals. But training data is often limited or nonexistent, and applying or extrapolating the wrong dataset can be costly to a company's business and reputation. Elsie Kenyon explains how to harness institutional human knowledge to augment data in deployed AI solutions.

Juthika Khargharia is a senior solutions architect at SAS, where she focuses on analytics and data science and helps customers solve business problems using SAS advanced analytics tools to tackle a variety of challenges involving data preparation, data exploration, predictive modeling, machine learning, and visualization. Juthika holds a PhD in astrophysics and planetary sciences from the University of Colorado.

Presentations

Real-time recommendation engines using SAS technology (sponsored by SAS) Session

How does your favorite website serve up the perfect content just for you? It's all based on machine learning. By continuously adjusting machine learning models based on real-time data, you can visualize changes and take action on the new information in real time. Juthika Khargharia explains how to build a recommendation engine to surface these recommendations on real-time data.

Sander Kieft is the ICT architect at Sanoma Media, where he is responsible for the common services and performance-based titles within Sanoma. His team designs and builds (web) services for some of the largest websites and most popular mobile applications in the Netherlands, Belgium, and Finland. Sander has been working with large-scale data in media for 15 years and with Hadoop and big data platforms in production for nearly a decade. Previously, he was a developer, architect, and technology manager for some of the largest websites in the Netherlands.

Presentations

The pitfalls of running a self-service big data platform Session

Sanoma has been running big data as a self-service platform for over five years, mainly as a service for business analysts to work directly on the source data. The road to getting business analysts to directly do their analyses on Hadoop was far from smooth. Sander Kieft explores Sanoma's journey and shares some lessons learned along the way.

Kimoon Kim is a software engineer at Pepperdata. Kimoon has hands-on experience with large distributed systems processing massive datasets. Previously, he worked for the Google Search and Yahoo Search teams for many years.

Presentations

HDFS on Kubernetes: Lessons learned Session

There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark.

James Kirkland is the advocate for Red Hat’s initiatives and solutions for the internet of things (IoT) and is the architect of Red Hat’s strategy for IoT deployments. This open source architecture combines data acquisition, integration, and rules activation with command and control data flows among devices, gateways, and the cloud to connect customers’ operational technology environments with information technology infrastructure and provide agile IoT integration. James serves as the head subject-matter expert and global team leader of system architects responsible for accelerating IoT implementations for customers worldwide. Through his collaboration with customers, partners, and systems integrators, Red Hat has grown its IoT ecosystem, expanding its presence in industries including transportation, logistics, and retail and accelerating adoption of IoT in large enterprises. James has a deep knowledge Unix and Linux variants that spans the course of his 20-year career at Red Hat, Racemi, and Hewlett-Packard. He is a steering committee member of the IoT working group for Eclipse.org, a member of the IIC, and a frequent public speaker and author on a wide range of technical topics.

Presentations

An open source architecture for the IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Terry Kline is senior vice president and chief information officer at Navistar, where he is responsible for all information and technology requirements globally, leading Navistar’s connected vehicle strategy, including OnCommand Connection (OCC) and over-the-air programming, and growing adoption of OCC. Terry holds a bachelor’s degree in computer science and engineering from the University of Toledo and an MBA from Indiana University. He is a member of the Institute of Electrical and Electronics Engineers and the Association of Computing Machinery.

Presentations

How the IoT and machine learning keep America truckin' Keynote

Data is powering the largest trucks on America’s interstates, the buses that take our children to school, and the military vehicles that help protect our country. Terry Kline and Mike Olson look at how machine learning and predictive analytics keep more than 300,000+ connected vehicles rolling.

Olivia Klose is a software development engineer in the Technical Evangelism and Development Group at Microsoft, where she focuses on all analytics services on Microsoft Azure, in particular Hadoop (HDInsight), Spark, and Machine Learning. Olivia is a frequent speaker at conferences both in Germany and around the world, including TechEd Europe, PASS Summit, and Technical Summit. She studied computer science and mathematics at the University of Cambridge, the Technical University of Munich, and IIT Bombay, with a focus on machine learning in medical imaging.

Presentations

Deploying deep learning to assist the digital pathologist Session

Jon Fuller and Olivia Klose explain how KNIME, Apache Spark, and Microsoft Azure enable fast and cheap automated classification of malignant lymphoma type in digital pathology images. The trained model is deployed to end users as a web application using the KNIME WebPortal.

Keith Kohl is vice president of product management at Syncsort, where he is responsible for product management strategy, roadmap, and feature definition across Syncsort’s product portfolio. Keith has more than 16 years of data management market experience. Previously, Keith served as vice president of product management at Trillium Software, where he focused on Trillium’s global product strategy for enterprise data quality solutions, encompassing both established and emerging big data solutions, as deployed on-premises and in the cloud.

Presentations

A governance checklist for making your big data into trusted data (sponsored by Syncsort) Session

If users get conflicting analytics results, wild predictions, and crazy reports from the data in your data lake, they will lose trust. From the beginning of your data lake project, you need to build in solid business rules, data quality checking, and enhancement. Keith Kohl shares an actionable checklist that shows everyone in your enterprise that your big data can be trusted.

Priya Koul is the vice president of engineering for digital partnerships and closed-loop capabilities at American Express, where she leads key initiatives that transform its enterprise network information assets into innovative digital products and create unique value for customers in mobile and web applications and on partner platforms. Priya also leads the company’s end-to-end technology strategy, capability development, and technology platforms to launch innovative digital products and partnerships while advancing core platforms powering Amex’s closed-loop. Her team partners closely with several business and technology teams in driving the launch of key digital products all the way from ideation and product definition and design to deployment and ongoing management across all American Express markets. Priya has led the launch of several unique, groundbreaking, and industry-first digital products enabling strategic partnerships with Foursquare, Facebook, Twitter, Xbox, Apple, Samsung, and TripAdvisor. She also led the development and launch of key AXP network platforms such as Card SYNC, Smart Offers, and Tweet to Buy, leading the journey from payments to commerce.

Her team launched the ability for American Express card members to pay with points in NYC taxi cabs, from within the Uber app, on BestBuy online, at Rite Aid, McDonald’s, and Chili’s restaurants, and on Airbnb. She also leads the global American Express SafeKey payer authentication capability that adds an extra layer of security for online shoppers and drives the advancement of strategic on network global platforms that power Amex’s Digital Offer ecosystem and campaigns like Small Business Saturday and Shop Small globally. Priya led the Digital Payment platform Amex Express Checkout across multiple online merchants, and her team was also responsible for building an API payment platform that facilitates B2B payments for large and middle-market payment partners. In addition to advancement and delivery of tech platforms, she also leads the community of technical practice for artificial intelligence across American Express. Priya’s core strengths include partnership building across internal and external partners. She places the highest importance on developing her team, fostering an innovative mindset, and collaboration. She is consistently recognized as a strategic leader with technical expertise and the ability to lead and motivate high-performing teams.

Presentations

AI at scale at American Express: Walking the talk Findata

The AI landscape is rapidly evolving, offering a lot of promise. . .and a lot of hype. Priya Koul explains how American Express is building an AI ecosystem at scale to unlock differentiated customer experiences and open up new business opportunities.

Chi-Yi Kuan is director of business analytics at LinkedIn. He has over 15 years of extensive experience in applying big data analytics, business intelligence, risk and fraud management, data science, and marketing mix modeling across various business domains (social network, ecommerce, SaaS, and consulting) at both Fortune 500 firms and startups. Chi-Yi is dedicated to helping organizations become more data driven and profitable. He combines deep expertise in analytics and data science with business acumen and dynamic technology leadership.

Presentations

The EOI framework for big data analytics to drive business impact at scale Session

Michael Li and Chi-Yi Kuan offer an overview of the EOI (enable-optimize-innovate) framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions, such as product, marketing, and sales.

Sanjeev Kulkarni is the cofounder of Streamlio, a company focused on building a next-generation real-time stack. Previously, he was the technical lead for real-time analytics at Twitter, where he cocreated Twitter Heron; worked at Locomatix handling the company’s engineering stack; and led several initiatives for the AdSense team at Google. Sanjeev holds an MS in computer science from the University of Wisconsin-Madison.

Presentations

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Jared P. Lander is chief data scientist of Lander Analytics, where he oversees the long-term direction of the company and researches the best strategy, models, and algorithms for modern data needs. He specializes in data management, multilevel models, machine learning, generalized linear models, data management, visualization, and statistical computing. In addition to his client-facing consulting and training, Jared is an adjunct professor of statistics at Columbia University and the organizer of the New York Open Statistical Programming Meetup and the New York R Conference. He is the author of R for Everyone, a book about R programming geared toward data scientists and nonstatisticians alike. Very active in the data community, Jared is a frequent speaker at conferences, universities, and meetups around the world and was a member of the 2014 Strata New York selection committee. His writings on statistics can be found at Jaredlander.com. He was recently featured in the Wall Street Journal for his work with the Minnesota Vikings during the 2015 NFL Draft. Jared holds a master’s degree in statistics from Columbia University and a bachelor’s degree in mathematics from Muhlenberg College.

Presentations

Machine learning in R Tutorial

Modern statistics has become almost synonymous with machine learning—a collection of techniques that utilize today's incredible computing power. Jared Lander walks you through the available methods for implementing machine learning algorithms in R and explores underlying theories such as the elastic net, boosted trees, and cross-validation.

Philip Langdale is the engineering lead for cloud at Cloudera. He joined the company as one of the first engineers building Cloudera Manager and served as an engineering lead for that project until moving to working on cloud products. Previously, Philip worked at VMware, developing various desktop virtualization technologies. Philip holds a bachelor’s degree with honors in electrical engineering from the University of Texas at Austin.

Presentations

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Jennifer Wu, Philip Langdale, and Kostas Sakellis explore the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Scott Langevin is a partner and research scientist at Uncharted Software. Scott has more than 12 years of industry and academic experience. He holds a PhD in computer science with a focus on machine learning. Scott’s research interests include large-scale visual analytics and adaptive user interfaces.

Presentations

Text analytics and new visualization techniques Session

Text analytics are advancing rapidly, and new visualization techniques for text are providing new capabilities. Richard Brath and Scott Langevin offer an overview of these new ways to organize massive volumes of text, characterize subjects, score synopses, and skim through lots of documents.

Sam Lavigne is an editor at the New Inquiry and an instructor at NYU and the New School. An artist, programmer, and teacher, Sam has exhibited his work—which deals with data, cops, surveillance, natural language processing, and automation—at Rhizome, Flux Factory, Lincoln Center, SFMOMA, Pioneer Works, DIS, and the Smithsonian, among others.

Presentations

White Collar Crime Risk Zones Keynote

Sam Lavigne offers an overview of White Collar Crime Risk Zones, a predictive policing application that uses industry-standard predictive policing methodologies to predict financial crime at the city-block level with an accuracy of 90.12%. Unlike typical predictive policing apps, which criminalize poverty, White Collar Crime Risk Zones criminalizes wealth.

Reuven Lax is a senior staff software engineer at Google, the tech lead for cloud-based stream processing (i.e., the streaming engine behind Google Cloud Dataflow), and the former tech lead of MillWheel.

Presentations

Realizing the promise of portability with Apache Beam Session

Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Reuven Lax offers an overview of Beam basic concepts and demonstrates that portability in action.

Francesca Lazzeri is a data scientist at Microsoft, where she is part of the algorithms and data science team. Francesca is passionate about innovations in big data technologies and the applications of advanced analytics to real-world problems. Her work focuses on the deployment of machine learning algorithms and web service-based solutions to solve real business problems for customers in the energy, retail, and HR analytics sectors. Previously, she was a research fellow in business economics at Harvard Business School. She holds a PhD in innovation management.

Presentations

Putting data to work: How to optimize workforce staffing to improve organization profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Francesca Lazzeri and Hong Lu share a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Pig. Julien was previously an architect at Dremio and the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Presentations

The columnar roadmap: Apache Parquet and Apache Arrow Session

Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data.

David Leach is CEO at Qrious, a New Zealand-based software company specializing in big data and advanced analytics. David is a qualified engineer with nearly 20 years of experience in business strategy, transformation, and people leadership. David has a passion for technology innovation and improving customer experience and is very effective at fostering high-performance teams, initiating change, and driving innovation. Previously, David held numerous roles at Orion Health and sat on the company’s executive leadership team. David also represented Orion Health on the Board for Precision Driven Health, a $38M public-provider research partnership to promote the new era of precision-driven healthcare.

Presentations

Executive panel: Big data use cases around the world Session

Big data and the cloud have spread around the world, and Singapore, New Zealand, Australia, and Canada are already seeing dramatic investments and returns. In a panel moderated by Steve Totman, senior executives from a variety of leading companies, including DBS, CIBC, and Qrious, share use cases, challenges, and how to be successful.

Toni LeTempt is a senior technical expert at Walmart. Toni has 18 years’ IT experience, five of them working with large secure enterprise Hadoop clusters.

Presentations

An authenticated journey through big data security at Walmart Session

In today’s world of data breaches and hackers, security is one of the most important components for big data systems, but unfortunately, it's usually the area least planned and architected. Matt Bolte and Toni LeTempt share Walmart's authentication journey, focusing on how decisions made early can have significant impact throughout the maturation of your big data environment.

Bob Levy is CEO of Virtual Cove, Inc., commercializing new uses of virtual and augmented reality for making sense of data at scale. He brings over two decades’ industry and product leadership experience with firms including IBM & MathWorks. Mr. Levy was founding president of the BPMA in 2001, a 6,000+ person industry group based in Boston.

Presentations

The limits of human cognition Findata

Bob Levy looks at the science of understanding and the frontiers of immersive reality that are helping auditors, model designers, fraud analysts, and other knowledge workers better process data.

Evan Levy is the vice president of data management programs at SAS, where he advises clients on strategies to address business challenges using data, technology, and creative approaches that align IT with the business capability and offers practical advice on addressing these challenges in a manner that utilizes a company’s existing skills, coupled with new methods to ensure IT and business success. A speaker, writer, and consultant in the areas of enterprise data strategy and data management, Evan is also a faculty member of TDWI as well as a best practices judge in the areas of business intelligence, data integration, and data management. Evan is the coauthor of the first book on MDM, Customer Data Integration: Reaching a Single Version of the Truth, which describes the business breakthroughs achieved with integrated customer data and explains how to make master data management successful.

Presentations

The five components of a data strategy Session

While it's clear organizations need to have a comprehensive data strategy, few have actually developed a plan to improve the access, sharing, and usage of data. Evan Levy discusses the five essential components that make up a data strategy and explores the individual attributes of each.

Haoyuan Li is founder and CEO of Alluxio (formerly Tachyon Nexus), a memory-speed virtual distributed storage system. Before founding the company, Haoyuan was working on his PhD at UC Berkeley’s AMPLab, where he cocreated Alluxio. He is also a founding committer of Apache Spark. Previously, he worked at Conviva and Google. Haoyuan holds an MS from Cornell University and a BS from Peking University.

Presentations

Best practices for using Alluxio with Spark Session

Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Haoyuan Li and Cheng Chang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together.

Junxia Li is a senior data scientist at Think Big Analytics, a Teradata company, where she focuses on solving recommendation problems using the latest techniques like deep and wide learning across clients from several verticals. Junxia has successfully implemented advanced machine learning models for clients from a number of different industries, such as automotive, telecommunications, and retail. Junxia is enthusiastic about emerging advancements in machine learning, especially deep learning and AI, and she enjoys reading cutting-edge research papers and experimenting with new ideas. She holds a master’s degree in business and IT and a dual bachelor’s degree in economics and information systems.

Presentations

Deep learning for recommender systems Tutorial

Junxia Li and Mo Patel demonstrate how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest for recommendations using embeddings. You'll also learn how to achieve wide and deep learning with WALS matrix factorization—now used in production for the Google Play store.

Lisha Li is an investor at Amplify Partners, where she focuses on companies that leverage machine learning and data to solve problems. She is excited to be investing at a time when algorithmic and data-driven methods have such incredible potential for impact. Lisha holds a PhD from UC Berkeley, where her research, under David Aldous and Joan Bruna, focused on deep learning and probability applied to the problem of clustering in graphs. While at Berkeley, she also did statistical consulting, advised on methods and analysis for experimentation and interpretation, and interned as a data scientist at Pinterest and Stitch Fix. She was also a lecturer in discrete mathematics and a graduate instructor in probability and statistics and intro CS theory. Lisha holds an MS with highest distinction in mathematics from the University of Toronto, where she was advised by Balazs Szegedy and worked in the area of graph limits. Lisha is the recipient of the prestigious NSERC CGS fellowship.

Presentations

Where the puck is headed: A VC panel discussion Session

In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road.

Tianhui Michael Li is the founder and CEO of the Data Incubator. Michael has worked as a data scientist lead at Foursquare, a quant at D.E. Shaw and JPMorgan, and a rocket scientist at NASA. At Foursquare, Michael discovered that his favorite part of the job was teaching and mentoring smart people about data science. He decided to build a startup that lets him focus on what he really loves. He did his PhD at Princeton as a Hertz fellow and read Part III Maths at Cambridge as a Marshall scholar.

Presentations

Meet the Expert with Michael Li (The Data Incubator) Meet the Experts

Michael will answer questions about building a data-driven culture.

Michael Li is head of analytics at LinkedIn, where he helps define what big data means for LinkedIn’s business and how it can drive business value through the EOI analytics framework. Michael is passionate about solving complicated business problems with a combination of superb analytical skills and sharp business instincts. His specialties include building and leading high-performance teams to quickly meet the needs of fast-paced, growing companies. Michael has a number of years’ experience in big data innovation, business analytics, business intelligence, predictive analytics, fraud detection, analytics, operations, and statistical modeling across financial, ecommerce, and social networks.

Presentations

The EOI framework for big data analytics to drive business impact at scale Session

Michael Li and Chi-Yi Kuan offer an overview of the EOI (enable-optimize-innovate) framework for big data analytics and explain how to leverage this framework to drive and grow business in key corporate functions, such as product, marketing, and sales.

Zhichao Li is a senior software engineer at Intel focused on distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor to Spark. Previously, Zhichao worked in Morgan Stanley’s FX Department.

Presentations

Building advanced analytics and deep learning on Apache Spark with BigDL Session

Yuhao Yang and Zhichao Li discuss building end-to-end analytics and deep learning applications, such as speech recognition and object detection, on top of BigDL and Spark and explore recent developments in BigDL, including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, and 3D image convolutions.

Julia Lintern is a senior data scientist at Metis, where she coteaches the data science bootcamp, develops curricula, and focuses on various other special projects. Previously, Julia worked as a data scientist at JetBlue, where she used quantitative analysis and machine learning methods to provide continuous assessment of the aircraft fleet. Julia began her career as a structures engineer designing repairs for damaged aircraft. In her free time, she collaborates on various projects, such as the development of a trap music generator; she has also worked on creative side projects such as Lia Lintern, her own fashion label. Julia holds an MA in applied math from Hunter College, where she focused on visualizations of various numerical methods including collocation and finite element methods and discovered a deep appreciation for the combination of mathematics and visualizations, leading her to data science as a natural extension of these ideas.

Presentations

A deep dive into deep learning with Keras Tutorial

Julia Lintern offers a deep dive into deep learning with Keras, beginning with basic neural nets and before exploring convolutional neural nets and recurrent neural nets. Along the way, Julia explains both the design theory behind and the Keras implementations of today's most widely used deep learning algorithms.

Todd Lipcon is an engineer at Cloudera, where he primarily contributes to open source distributed systems in the Apache Hadoop ecosystem. Previously, he focused on Apache HBase, HDFS, and MapReduce, where he designed and implemented redundant metadata storage for the NameNode (QuorumJournalManager), ZooKeeper-based automatic failover, and numerous performance, durability, and stability improvements. In 2012, Todd founded the Apache Kudu project and has spent the last three years leading this team. Todd is a committer and PMC member on Apache HBase, Hadoop, Thrift, and Kudu, as well as a member of the Apache Software Foundation. Prior to Cloudera, Todd worked on web infrastructure at several startups and researched novel machine learning methods for collaborative filtering. Todd holds a bachelor’s degree with honors from Brown University.

Presentations

A brave new world in mutable big data: Relational storage Session

To date, mutable big data storage has primarily been the domain of nonrelational (NoSQL) systems such as Apache HBase. However, demand for real-time analytic architectures has led big data back to a familiar friend: relationally structured data storage systems. Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu.

Ryan Lippert is a senior product marketing manager at Cloudera, where he is responsible for the company’s Operational Database offering and for marketing its storage products. Previously, Ryan served in a variety of roles at Cisco Systems. He holds an economics degree from the University of Guelph and an MBA from Stanford.

Presentations

The sunset of lambda: New architectures amplify IoT impact Session

A long time ago in a data center far, far away, we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Michael Crutcher and Ryan Lippert explain why Apache Kudu, a relational storage layer for fast analytics on fast data, is the key to unlocking the value in IoT data.

Julie Lockner is cofounder of 17 Minds Corporation, a startup focusing on improving care and education plans for children with special needs. She has held executive roles at InterSystems, Informatica, and EMC and was an analyst at ESG. She was founder and CEO of CentricInfo, a data management consulting firm. Julie holds an MBA from MIT and a BSEEfrom WPI.

Presentations

Predicting tantrums with wearable data and real-time analytics Session

How can we empower individuals with special needs to reach their full potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible.

Ben Lorica is the chief data scientist at O’Reilly Media. Ben has applied business intelligence, data mining, machine learning, and statistical analysis in a variety of settings, including direct marketing, consumer and market research, targeted advertising, text mining, and financial engineering. His background includes stints with an investment management company, internet startups, and financial services.

Presentations

Hardcore Data Science welcome HDS

Hosts Ben Lorica and Assaf Araki welcome you to Hardcore Data Science day.

The age of machine learning Keynote

Ben Lorica explores the age of machine learning.

Thursday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

Piet Loubser is vice president of product and solutions marketing at Hortonworks, where he is responsible for the holistic positioning of the Hortonworks product and solution portfolio. Piet has more than 25 years of experience in the IT industry driving strategic marketing, product marketing, sales, and software development and has worked with organizations across the globe to use data to drive strategic transformations. Previously, he headed up platform product marketing at Informatica; held executive marketing roles at HP and SAP; and led portfolio market strategies at Business Objects (acquired by SAP), where he also held numerous positions in regional sales management and business development in offices within the United States, Europe, and South Africa. Piet holds a bachelor’s degree in computer science from the University of Stellenbosch in South Africa. 

Presentations

Powering business outcomes with data science in a connected world (sponsored by Hortonworks) Session

Data has become the new fuel for business success. As a result, business intelligence and analytics are among the top priorities for CIOs today. Piet Loubser outlines the tectonic shift currently taking place in the market and explains why next-gen connected architectures are crucial to meet the demands of an intelligent, connected world.

Hong Lu is a data scientist at Microsoft. Hong is passionate about innovations in big data technologies and application of advanced analytics to real-world problems. During her time at Microsoft, Hong has built end-to-end data science solutions for customers in energy, retail, and education sectors. Previously, she worked on optimizing advertising platforms in the video advertising industry. Hong holds a PhD in biomedical engineering from Case Western Reserve University, where her research focused on machine learning-based medical image analysis.

Presentations

Putting data to work: How to optimize workforce staffing to improve organization profitability Session

New machine learning technologies allow companies to apply better staffing strategies by taking advantage of historical data. Francesca Lazzeri and Hong Lu share a workforce placement recommendation solution that recommends staff with the best professional profile for new projects.

Neng Lu is a software engineer from Twitter. He is currently the core committer to the Heron project and the leading engineer for Heron development at Twitter. He also worked on Twitter’s monitoring and key-value storage systems. Before joining Twitter, he got his master degree from UCLA and bachelor degree from Zhejiang Univeisity.

Presentations

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Zhenxiao Luo is a software engineer at Uber working on Presto and Parquet. Previously, he led the development and operations of Presto at Netflix and worked on big data and Hadoop-related projects at Facebook, Cloudera, and Vertica. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.

Presentations

Geospatial big data analysis at Uber Session

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.

Thiruvalluvan M G is vice president of engineering at Aqfer. Previously, Thiru was a distinguished architect at Yahoo; principal hacker at Altiscale; and an architect at Stata Labs, where he built desktop search engine Bloomba. He also held a number of technical and managerial engineering roles at Accel and Hewlett-Packard. He is a committer and PMC member of the Apache Avro project. Thiru holds a BE in electronics and communications engineering from Anna University.

Presentations

SETL: An efficient and predictable way to do Spark ETL Session

Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources.

Allan MacInnis is a solutions architect at Amazon Web Services, where he works on streaming data and analytics and helps AWS customers build solutions that enable them to gain immediate insight into their business and operations. Allan has held a number of roles at Amazon, including software development manager, where he helped to build innovative new products such as the Amazon Kindle and Amazon Flex. Previously, he spent several years as a software developer and architect at Dell. Allan holds a degree in electrical engineering from Dalhousie University.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, Allan MacInnis, and Ben Snively walk you through building a big data application using a combination of open source technologies and AWS managed services.

Creating a serverless real-time analytics platform powered by machine learning in the cloud Session

Speed matters. Today, decisions are made based on real-time insights, but in order to support the substantial growth of streaming data, companies are required to innovate. Roy Ben-Alta and Allan MacInnis explore AWS solutions powered by machine learning and artificial intelligence.

Santhosh Mahendiran is global head of technology at Standard Chartered Bank. An industry-recognized leader in the BFSI sector, Santhosh has 15+ years of experience spread across a wide range of systems and technologies, spanning conceptualizing, building, and managing variety of core banking systems, state-of the art frontend systems (both customer and staff facing) coupled with APIs, data analytics, and business intelligence platforms, and over his career has taken initiatives to improve and streamline business and IT processes that directly or indirectly generates revenue and decrease company running cost. Santhosh is extremely passionate about technology innovations, process improvements, and bringing the best out of people working with him. He is a regular speaker in various tech forums and events across the globe. He’s also an active squash player in the Singapore league. Santhosh holds dual master’s degrees in computer applications and software engineering from the National University of Singapore.

Presentations

A winning combination: The power of big data and the democracy of information (sponsored by Paxata) Session

Santhosh Mahendiran explains how financial services company Standard Chartered Bank is using self-service data prep and machine learning technologies to democratize its data lake, offering trusted information to analysts, subject-matter experts, and line-of-business executives across 70 countries to help monitor fraud, track money-laundering activities, and perform regulatory compliance reporting.

Deepak Majeti is a systems software engineer at Vertica. He is also a committer and an active contributor to Hadoop’s two most popular file formats: ORC and Parquet. His interests lie in getting the best from HPC and big data and building scalable, high-performance, and energy-efficient data analytics tools for modern computer architectures. Deepak holds a PhD in the high-performance computing (HPC) domain from Rice University.

Presentations

How the separation of compute and storage impacts your big data analytics way of life (sponsored by Micro Focus Security and Big Data Analytics) Session

Deepak Majeti explains why the separation of compute and storage has become critical to maximizing the benefits of cloud economics.

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Presentations

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Presentations

Increasing velocity, accuracy and learning at scale Data 101

Sarah manages the Analytics Engineering team at Etsy. Her team’s analytics pipeline, datasets and tooling make it feasible to rapidly A/B test and monitor product feature changes on live traffic across all platforms, and continually quantify the impact of any change within a scientific methodology.

Bruce Martin is a senior instructor at Cloudera, where he teaches courses on data science, Apache Spark, Apache Hadoop, and data analysis. Previously, Bruce was principal architect and director of advanced concepts at SunGard Higher Education, where he developed the software architecture for SunGard’s Course Signals Early Intervention System, which uses machine learning algorithms to predict the success of students enrolled in university courses. Bruce’s other roles have included senior staff engineer at Sun Microsystems and researcher at Hewlett-Packard Laboratories. Bruce has written many papers on data management and distributed system technologies and frequently presents his work at academic and industrial conferences. Bruce has authored patents on distributed object technologies. Bruce holds a PhD and master’s degree in computer science from the University of California, San Diego, and a bachelor’s degree in computer science from the University of California, Berkeley.

Presentations

Cloudera big data architecture workshop 2-Day Training

Bruce Martin leads you through designing and architecting solutions to a challenging business problem. You'll explore big data application architecture concepts in general and then apply them to the design of a challenging system.

Cloudera Big Data Architecture Workshop (Day 2) Training Day 2

The Cloudera Big Data Architecture Workshop (BDAW) is a 2-day leaning event that addresses advanced big data architecture topics. BDAW brings together technical contributors into a group setting to design and architect solutions to a challenging business problem. The workshop addresses big data architecture problems in general, and then applies them to the design of a challenging system.

Hilary Mason is founder and CEO of Fast Forward Labs, a machine intelligence research company, and data scientist in residence at Accel Partners. Previously Hilary was chief scientist at Bitly. She cohosts DataGotham, a conference for New York’s homegrown data community, and cofounded HackNY, a nonprofit that helps engineering students find opportunities in New York’s creative technical economy. Hilary served on Mayor Bloomberg’s Technology Advisory Board and is a member of Brooklyn hacker collective NYC Resistor.

Presentations

Executive Briefing: Talking to machines—Natural language today Session

Progress in machine learning has led us to believe we might soon be able to build machines that talk to us using the same interfaces that we use to talk to each other: natural language. But how close are we? Hilary Mason explores the current state of natural language technologies and some applications where this technology is thriving today and imagines what we might build in the next few years.

Dana Mastropole is a data scientist in residence at the Data Incubator and contributes to curriculum development and instruction. Previously, Dana taught elementary school science after completing MIT’s Kaufman teaching certificate program. She studied physics as an undergraduate student at Georgetown University and holds a master’s in physical oceanography from MIT.

Presentations

Machine learning with TensorFlow 2-Day Training

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explore TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.

Machine learning with TensorFlow (Day 2) Training Day 2

Dana Mastropole and Michael Li demonstrate TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data.

Murthy Mathiprakasam is a director of product marketing for Informatica’s big data products, where he is responsible for outbound marketing activities. Murthy has a decade and a half of experience working with emerging high-growth software technologies, including roles at Mercury Interactive/HP, Google, eBay, VMware, and Oracle. Murthy holds an MS in management science from Stanford University and BS degrees in management science and computer science from the Massachusetts Institute of Technology.

Presentations

Using an AI-driven approach to managing data lakes in the cloud or on-premises (sponsored by Informatica) Session

In the face of regulatory and competitive pressures, why not use artificial intelligence, along with smart best practices, to manage data lakes? Murthy Mathiprakasam shares a comprehensive approach to data lake management that ensures that you can quickly and flexibly ingest, cleanse, master, govern, secure, and deliver all types of data in the cloud or on-premises.

Carlos Matos is AIG’s CTO Big Data, Data Innovation and Advanced Technologies. This role is responsible for enterprise introduction and management of application and technology solutions for business intelligence, analytics, and core data systems including data governance, data quality, and data modeling. Additionally, he supports a wide array of application development, data strategy, and emerging technology consulting services. Prior to AIG, Carlos was a core leader at Kaiser Permanente for the design and operational management of their primary electronic medical record technology environment supporting over 100 thousand clinical, financial, business, and analytics users and 10 million members across the United States, and was a pioneer for the design and introduction of the organizations’ IaaS private cloud solution. Carlos received his Bachelor of Science degree in Business Administration with a Concentration in Finance, from California State University.

Presentations

Architect and operationalize your enterprise data lake (sponsored by Zaloni) Session

Envision the next phase of your company’s data future: providing centralized data services for streamlined yet controlled access to data for end users across lines of business. Carlos Matos and Ben Sharma share strategies for developing an enterprise-wide data lake service to drive shared data insights across the organization. Are you ready?

Andy Mauro is the cofounder and CEO of Automat, a conversational marketing platform that uses AI to allow companies to have personalized one-on-one messaging conversations with their customers to better understand and serve them. Automat’s cofounders and team collectively have over 50 years’ experience and 17 patents in the fields of speech recognition, natural language understanding, virtual assistants, and AI.

Presentations

Executive Briefing: Conversational marketing for brands—Why it's better to talk to your customers than monitor them Session

Andy Mauro explains why the last 15 years of digital marketing was really about monitoring customers and how recent advancements in artificial intelligence and the dominance of messaging as the primary consumer channel provide an opportunity to achieve every marketer's dream of simply talking to customers—providing a personalized experience that drives engagement, brand loyalty, and conversions.

Tony McAllister is the director of enterprise architecture at Be the Match, part of the National Marrow Donor Program, where he and his team design and build technology solutions that deliver cellular therapy solutions to patients in need of a transplant. The team is currently building a real-time, distributed computing search engine on the Hadoop platform to find the best donor match from a global registry of over 30 million donors.

Presentations

Implementing Hadoop to save lives Session

The National Marrow Donor Program (Be the Match) recently moved its core transplant matching platform onto Cloudera Hadoop. Tony McAllister explains why the program chose Cloudera Hadoop and shares its big data goals: to increase the number of donors and matches, make the process more efficient, and make transplants more effective.

Michael McCune is a software developer in Red Hat’s Emerging Technology Group, where he develops and deploys application for cloud platforms. He is an active contributor to several radanalytics.io projects and is a core reviewer for the OpenStack API working group. Previously, Michael developed Linux-based software for embedded global positioning systems.

Presentations

From notebooks to cloud native: A modern path for data-driven applications Session

Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Michael McCune offers some answers as they relate to cloud-native platforms.

Jim McHugh is vice president and general manager at NVIDIA. He currently leads DGX-1, the world’s first AI supercomputer in a box. Jim focuses on building a vision of organizational success and executing strategies to deliver computing solutions that benefit from GPUs in the data center. With over 25 years of experience as a marketing and business executive with startup, mid-sized, and high-profile companies, Jim has a deep knowledge and understanding of business drivers, market/customer dynamics, technology-centered products, and accelerated solutions. Previously, Jim held leadership positions with Cisco Systems, Sun Microsystems, and Apple, among others.

Presentations

Harness the Power of AI and Deep Learning for Business (sponsored by NVIDIA) Keynote

AI is transforming industry and society. Accelerated computing, deep learning platforms, and intelligent machines supercharge digital transformation to harness the power of AI. This session will feature examples of AI-accelerated businesses and dive into specific approaches enterprises are taking to adopting AI and accelerated analytics.

Streamline Data Science Pipeline with GPU Data Frame (sponsored by NVIDIA) Session

Joining Jim McHugh are founders of GOAI: - Todd Mostak, CEO of MapD - SriSatish Ambati, CEO and co-founder of H2O - Stan Seibert, Director of Community Innovation, Anaconda In this session, the speakers will provide an update on the latest advancement and customer use cases leveraging GOAI

Jason McIntyre is Accenture’s Digital Ecosystem Alliance management lead.

Presentations

Executive Briefing: Data ecosystem strategy Session

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Jason McIntyre and Mark Milazzo share examples of teaming models and leading practices for accelerating value from your ecosystem strategy.

Tim McKenzie is general manager of big data solutions at Pitney Bowes, where he leads a global team dedicated to helping clients unlock the value that is hidden in the massive amounts of data collected about customers, infrastructure, and products. With over 17 years of experience engaging with customers about technology, Tim has a proven track record of delivering value in every engagement.

Presentations

Big data, location analytics, and geoenrichment to drive better business outcomes (sponsored by Pitney Bowes) Session

Organizations need to have a data strategy that includes the tools to derive location intelligence, enhance existing data with geographic enrichment (geoenrichment), and perform location analytics to reveal strategic and operational insights. Tim McKenzie shares new data quality and location intelligence approaches that operate natively within Hadoop and Spark environments.

Fiona McNeill is global text analytics product manager at SAS, where she focuses on new, emerging analytical technology. Fiona has been described as a pioneer in the field of analytics and has helped organizations in virtually every industry, including some of the largest global organizations, derive tangible benefit from the strategic use of technology applied to real-world business scenarios. Fiona is a well-known speaker, author, and innovator in the field of analytics. She coauthored Heuristics in Analytics and is a member of the Cognitive Computing Consortium working group.

Presentations

Meeting the challenges of the analytics economy (sponsored by SAS) Session

Much is being written about the economy of everything, but where does the analytics economy fit in? Fiona McNeill shares SAS's vision and roadmap for meeting the unique challenges of the analytics economy, including thoughts on intersections with related technologies like machine learning, deep learning, cognitive computing, and more.

David Mellor is the vice president and chief architect at Curriculum Associates and an adjunct professor at Boston University’s Graduate School of Computer Science. Previously, David was the chief architect of the Next Generation Application Platform at Oracle. He is the author of multiple books, including The Common Warehouse Metamodel Developer’s Guide. David also holds two patents in the area of extension mechanisms for web content.

Presentations

Building a real-time feedback loop for education (sponsored by MemSQL) Session

Curriculum Associates has a mission to make classrooms better places for teachers and students. To achieve this, the company introduces innovative and exciting new products that give every student the chance to succeed. David Mellor explains how Curriculum Associates developed a real-time data pipeline with MemSQL, which empowered teachers to provide immediate and accurate student feedback.

Ramesh Menon is head of products at Infoworks. Ramesh has over 20 years of experience building enterprise analytics and data management products. Previously, he led the team at YarcData that built the world’s largest shared-memory appliance for real-time data discovery and one of the industry’s first Spark-optimized platforms and worked at Informatica, where he was responsible for the go-to-market strategy for Informatica’s MDM and Identity Resolution products.

Presentations

Deploying an automated data platform, from data ingestion to consumption: A real-world enterprise example (sponsored by Infoworks) Session

Enterprises want to implement analytics use cases at the speed of business yet spend more time on complicated data management than on creating business value. The solution is automation. Ramesh Menon explains how a large enterprise automated data ingestion, data synchronization, and the building of data models and cubes to create a big data warehouse for the rapid deployment of analytics.

Michelle Mensing is a software engineer on the SAP HANA native development team at SAP, where she works on cross-platform ETL modeling tools and designs and builds modeling solutions for data transformation and management across multiple data lakes. Michelle holds a BS in IT systems engineering from the Hasso-Plattner-Institute in Potsdam, Germany, and is an alum of the HPI School of Design Thinking. During her studies, she focused on the integration of business processes with sensor data streams and monitoring solutions for user-driven decision management.

Presentations

Meet the Expert with Michelle Mensing (SAP) Meet the Experts

Michelle will be on hand to discuss SAP Data Hub solutions, challenges for building comprehensive data pipelines, and data orchestration and execution.

Orchestrating your complex data pipeline across your enterprise (sponsored by SAP) Session

Evolving big data architectures are creating an increasingly complex landscape. Michelle Mensing explains how to simplify data orchestration across various big data and enterprise sources, demonstrating how to create a complex pipeline and execute the pipeline in Kubernetes clusters, covering data acquisition, transformation, cleaning data, and running the algorithms.

William Merchan is the chief strategy officer at DataScience.com, where he leads business and corporate development, partner initiatives, and strategy. Previously, he served as senior vice president of strategic alliances and general manager of dynamic pricing at MarketShare, where he oversaw global business development and partner relationships and successfully led the company to a $450 million acquisition by Neustar.

Presentations

Data science platforms: Your key to actionable analytics (sponsored by DataScience.com) Session

The number of inefficiencies in the data science workflow is staggering. Data science platforms have emerged to combat these inefficiencies. William Merchan outlines the key components of a data science platform and demonstrates how these platforms are enabling organizations to realize the potential of their data science teams.

Matteo Merli is a software engineer at Streamlio, where he works on messaging and storage technologies. Previously, he spent several years building database replication systems and multitenant messaging platforms at Yahoo. Matteo was the architect and lead developer for Pulsar and is a PMC member of Apache BookKeeper.

Presentations

Messaging, storage, or both: The real-time story of Pulsar and Apache DistributedLog Session

Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production.

A 20-year veteran of the technology market, with the last 15 focused on developing and managing alliance and vendor management programs, Mark Milazzo is the global lead at Accenture responsible for developing and managing the Accenture Insights Platform Partner Program.

Presentations

Executive Briefing: Data ecosystem strategy Session

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Jason McIntyre and Mark Milazzo share examples of teaming models and leading practices for accelerating value from your ecosystem strategy.

Chris Mills is big data lead at the Meet Group. Chris has been coding since grade school. Unable to choose between science and engineering, he has spent his career working on projects incorporating both fields, in genetics, natural language processing, distance learning, content syndication, automated categorization, and recommender systems. Chris loves games and puzzles of all sorts and thinks that the intersection of big data and human behavior offers some of the very best puzzles available.

Presentations

Lessons from an AWS migration Session

if(we)'s batch event processing pipeline is different from yours, but the process of migrating it from running in a data center to running in AWS is likely pretty similar. Chris Mills explains what was easier than expected, what was harder, and what the company wished it had known before starting the migration.

Presentations

Retail's panacea: How machine learning is driving product development Session

Karen Moon, Jared Schiffman, Eric Colson, and Catherine Twist explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

Harjinder Mistry is a member of the developer tools team at Red Hat, where he is incorporating data science into next-generation developer tools powered by Spark. Previously, he was a member of IBM’s analytics team, where he developed Spark ML Pipelines components for the IBM Analytics platform, and spent several years on the DB2 SQL Query Optimizer team building and fixing the mathematical model that decides the query execution plan. Harjinder holds an MTech from IIIT, Bangalore, India.

Presentations

AI-driven next-generation developer tools Session

Bargava Subramanian and Harjinder Mistry explain how machine learning and deep learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Karen Moon is cofounder and CEO of Trendalytics, a style-centric visual data platform that measures consumer engagement with merchandise trends. Karen has more than 12 years of experience in retail and technology working with companies across the supply chain, including department stores, luxury retailers, and independent designers. Previously, she executed Goode Partners’s investment in SkullCandy and worked on the turnaround of a luxury specialty retaile; worked in Gap’s Corporate Strategy group, where she assessed acquisition and new retail concept opportunities such as Piperlime.com; and held positions at Goldman Sachs, where she executed over $1 billion in technology and media transactions. She’s been featured in the Wall Street Journal, Forbes, and other publications. Karen holds an MBA from Harvard Business School and a BA (summa cum laude) from UCLA. Her research at Harvard included studies in multichannel retailing, luxury diffusion brands, and supply chain innovation for emerging designers. Karen initially pursued a BA in fashion design at Otis College of Art & Design.

Presentations

Retail's panacea: How machine learning is driving product development Session

Karen Moon, Jared Schiffman, Eric Colson, and Catherine Twist explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

John Morrell is senior director of product marketing at Datameer, where he leads the go-to-market efforts for the Datameer product family and focuses on how customers use Datameer to solve their business problems. John has a 25-year history in enterprise software, bringing to market numerous enterprise software products and working extensively to help solve difficult business problems in data management, BI, and analytics for such companies as Aleri, Coral8, Active Software, webMethods, Oracle, Informix, and Fair Isaac. John holds an MBA from Bentley College and a BS in computer engineering from Syracuse University.

Presentations

Finally, an interactive experience for your data lake (sponsored by Datameer) Session

While companies have flooded data lakes with billions of records, the technical limitations of Hadoop have kept analysts from interactively exploring this data and delivering real value—until now. John Morrell explores a solution helping analysts interactively and rapidly explore billions of records in Hadoop, offering a truly interactive experience and ushering in the era of Data Lake 2.0.

Jason Morton is an advisor at Ascendant and a visiting professor in computer science at Harvard University.

Presentations

Detecting a spoofing overlay Tutorial

Regulators increasingly require market participants to self-monitor to prevent manipulative practices such as spoofing and layering. Jason Morton shares methods for detecting a spoofing overlay on top of a legitimate strategy from a flow of time-stamped order message and cancellation data.

Todd Mostak is the founder of MapD. He is a graduate of Harvard’s Kennedy School of Government.

Presentations

Accelerate your analytics with a GPU Data Frame (sponsored by MapD) Session

For all of the innovation occurring across the GPU software ecosystem, the platforms themselves still remain isolated from each other—until now. Todd Mostak debuts the GPU Open Analytics Initiative’s first project, the GPU Data Frame (GDF), and explains how GDF enables efficient intra-GPU communication between different processes running on the GPUs.

Streamline Data Science Pipeline with GPU Data Frame (sponsored by NVIDIA) Session

Joining Jim McHugh are founders of GOAI: - Todd Mostak, CEO of MapD - SriSatish Ambati, CEO and co-founder of H2O - Stan Seibert, Director of Community Innovation, Anaconda In this session, the speakers will provide an update on the latest advancement and customer use cases leveraging GOAI

Karthikeyan Nagalingam is a senior technical marketing engineer at NetApp, where he is responsible for defining and developing big data analytics data protection technologies, producing best practices documentation, and helping customers implement Hadoop and NoSQL solutions. Karthikeyan has extensive experience architecting Hadoop solutions in the cloud, hybrid cloud, and on-premises and deploying and developing in Linux environments. He has developed numerous proofs of concept, worked with customers on deploying Hadoop solutions, and spoken at many industry, customer, and partner events. He holds a patent for distributed data storage and processing techniques. Karthikeyan holds an MS in software systems from Birla Institute of Technology and Science and a bachelor of engineering from SriRam Engineering College.

Presentations

Key big data architectural considerations for deploying in the cloud and on-premises (sponsored by NetApp) Session

When analytics applications become business critical, balancing cost with SLAs for performance, backup, dev, test, and recovery is difficult. Karthikeyan Nagalingam discusses big data architectural challenges and how to address them and explains how to create a cost-optimized solution for the rapid deployment of business-critical applications that meet corporate SLAs today and into the future.

Milind Nagnur is the managing director and head of CTO data services at Citi, where he and his team deliver strategic solutions to clients focused on revenue discovery, regulatory compliance, and business performance transformation leveraging digital and data analytics. Milind has more than 18 years of business and IT experience across data, IT strategy, architecture, infrastructure, and application development, as well as experience managing large, complex global transformation initiatives, such as Citi’s next-generation Enterprise Analytics Platform (EAP 2.0). Previously, Milind was the principal architect and apps development group manager in trade and treasury services at JPMorgan Chase and the systems integration consultant for financial services clients at Pricewaterhouse Coopers. He is project sponsor, mentor, and senior advocate for Citi’s Women’s Leadership Council Developing Talent Program (DTP) and Emerging Talent Program (ETP) and serves on the advisory board for Citi Ventures, where he focuses on the firm’s data and analytic investments. Milind holds a BTech in mechanical engineering from the Indian Institute of Technology, Mumbai and an MBA in finance and computer information systems from the Indian Institute of Management, Calcutta.

Presentations

Next-generation data management Session

Milind Nagnur explores the requirements for a next-generation platform for data management, covering everything from controlled exploratory sandboxes to hosting transactional applications, and explains how modern, industry-leading data management tools and self-service analytics can address these needs.

Raghunath Nambiar is the chief technology officer of Cisco’s Unified Computing System (UCS) business. He helps define strategies for next generation architectures, systems, and datacenter solutions, as well as leads a team of engineers and product leaders focused on emerging technologies and solutions – big data, analytics, internet of things and artificial intelligence. He has played an instrumental role in accelerating the growth of the Cisco UCS to a top datacenter compute platform. Raghu was previously a Cisco distinguished engineer and chief architect of big data and analytics solution engineering responsible for incubating and growing it to a mainstream portfolio. He brings years of technical accomplishments with significant expertise in systems architecture, performance engineering, and creating disruptive technology solutions. Raghu has served in leadership positions on industry standards committees for performance evaluation and leading academic conferences. He chaired industry’s first standards committee for benchmarking big data systems, industry’s first standards committee for benchmarking Internet of Things, and founding chair of TPC’s International Conference Series on Performance Evaluation and Benchmarking. He has published more than 50 peer-reviewed papers and book chapters, 10 books in Lecture Series in Computer Science (LNCS), and holds five patents with several pending. Prior to Cisco, Raghu was an architect at Hewlett-Packard responsible for several industry-first and disruptive technology solutions and a decade of performance benchmark leadership. He holds master’s degrees from University of Massachusetts and Goa University and completed an advanced management program from Stanford University.

Raghu’s recent book titled Transforming Industry Through Data Analytics examines the role of analytics in enabling digital transformation, how the explosion in internet connections affects key industries, and how applied analytics will impact our future.

Presentations

Analytics everywhere, from things to cities (sponsored by Cisco) Keynote

Endless possibilities when we connect the unconnected. Raghunath Nambiar discusses the magnitude of new challenges and new opportunities across industry segments.

Paco Nathan leads the Learning Group at O’Reilly Media. Known as a “player/coach” data scientist, Paco led innovative data teams building ML apps at scale for several years and more recently was evangelist for Apache Spark, Apache Mesos, and Cascading. Paco has expertise in machine learning, distributed systems, functional programming, and cloud computing with 30+ years of tech-industry experience, ranging from Bell Labs to early-stage startups. Paco is an advisor for Amplify Partners and was cited in 2015 as one of the top 30 people in big data and analytics by Innovation Enterprise. He is the author of Just Enough Math, Intro to Apache Spark, and Enterprise Data Workflows with Cascading.

Presentations

PyTextRank: Graph algorithms for enhanced natural language processing Session

Paco Nathan demonstrates how to use PyTextRank—an open source Python implementation of TextRank that builds atop spaCy, datasketch, NetworkX, and other popular libraries to prepare raw text for AI applications in media and learning—to move beyond outdated techniques such as stemming, n-grams, or bag-of-words while performing advanced NLP on single-server solutions.

Heather Nelson is a senior solution architect at Silicon Valley Data Science, where she draws from her diverse background in business and technology consulting to find the best solutions for her clients’ toughest data problems. A problem solver by nature, Heather is passionate about helping organizations leverage data to drive competitive advantage.

Presentations

Ask me anything: Running data science in the enterprise and architecting data platforms Ask Me Anything

John Akred, Stephen O'Sullivan, and Heather Nelson field a wide range of detailed questions on topics such as managing data science in the enterprise, architecting a data platform, and creating a modern enterprise data strategy. Even if you don’t have a specific question, join in to hear what others are asking.

Managing data science in the enterprise Tutorial

John Akred and Heather Nelson share methods and observations from three years of effectively deploying data science in enterprise organizations. You'll learn how to build, run, and get the most value from data science teams and how to work with and plan for the needs of the business.

Chris Neumann is a venture partner at 500 Startups focused on big data, machine learning, and AI. Previously, Chris was the founder and CEO of DataHero (acquired by Cloudability), which brought to market the first self-service cloud BI platform, and the first employee at Aster Data (acquired by Teradata), where he helped create the big data space.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Alan Nichol is cofounder and CTO of leading open source conversational AI company Rasa, where he helps create the software that enables developers to build conversational software that really works. Rasa is trusted by thousands of developers in enterprises worldwide, including UBS, ERGO, and Helvetia. Alan has years of experience building AI-powered products in industry. He holds a PhD in machine learning from the University of Cambridge.

Presentations

Deep learning for understanding language and holding conversations HDS

There's a large body of research on machine learning-based dialogue, but most voice and chat systems in production are still implemented using a state machine and a set of rules. Alan Nichol offers an overview of Rasa's applied AI research in language understanding and dialogue and explains how open source implementations bring the state of the art to thousands of developers.

Ryan Nienhuis is a senior technical product manager on the Amazon Kinesis team, where he defines products and features that make it easier for customers to work with real-time, streaming data in the cloud. Previously, Ryan worked at Deloitte Consulting, helping customers in banking and insurance solve their data architecture and real-time processing problems. Ryan holds a BE from Virginia Tech.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, Allan MacInnis, and Ben Snively walk you through building a big data application using a combination of open source technologies and AWS managed services.

Jack Norris is the senior vice president of data and applications at MapR Technologies. Jack has a wide range of demonstrated successes, from defining new markets for small companies to increasing sales of new products for large public companies, in his 20 years spent in enterprise software marketing. Jack’s broad experience includes launching and establishing analytics, virtualization, and storage companies and leading marketing and business development for an early-stage cloud storage software provider. Jack has also held senior executive roles with EMC, Rainfinity, Brio Technology, SQRIBE, and Bain & Company. Jack earned an MBA from UCLA’s Anderson School of Management and a BA in economics with honors and distinction from Stanford University.

Presentations

The essentials for digital growth (sponsored by MapR) Session

Jack Norris shares lessons learned by leading companies leveraging data to transform customer experiences, operational results, and overall growth and details the infrastructure, development, and data management principles used by successful leaders to drive agility regardless of application volume or scale.

Streaming Big Data Systems Engineer | Innovator | Principal Engineer @ Expedia, Inc.

https://www.linkedin.com/in/brandonjobrien

Presentations

Business operations in Expedia through real time metric trends, predictions, correlations and anomaly detection DCS

A talk about a tech stack that collects millions of raw events per sec, trends, predicts, correlates and finds anomalies among them to alert the business in real time. Solutions for challenges like seasonal data, sparse data and scale issues also will be discussed.

Cathy O’Neil a data scientist for the startup media company Intent Media. Cathy began her career as a postdoc in MIT’s Math Department. She has been a professor at Barnard College, where she published a number of research papers in arithmetic algebraic geometry, and worked as a quant for the hedge fund D.E. Shaw in the middle of the credit crisis and for RiskMetrics, a risk software company that assesses risk for the holdings of hedge funds and banks. Cathy holds a PhD in math from Harvard.

Presentations

Meet the Expert with Cathy O'Neil (Weapons of Math Destruction) Meet the Experts

Mathematical models shape our future—scoring teachers and students, sorting résumés, granting (or denying) loans, evaluating workers, targeting voters, setting parole, and monitoring our health. Join Cathy for a spirited discussion about what this means for individuals and our society.

Weapons of math destruction Keynote

Cathy O'Neil exposes the mathematical models that are shaping our future, both as individuals and as a society. These “weapons of math destruction” score teachers and students, sort résumés, grant (or deny) loans, evaluate workers, target voters, set parole, and monitor our health.

Brian O’Neill is the founder and consulting product designer at Designing for Analytics, where he focuses on helping companies design indispensable data products that customers love. Brian’s clients and past employers include Dell EMC, NetApp, TripAdvisor, Fidelity, DataXu, Apptopia, Accenture, MITRE, Kyruus, Dispatch.me, JPMorgan Chase, the Future of Music Coalition, and E*TRADE, among others; over his career, he has worked on award-winning storage industry software for Akorri and Infinio. Brian has been designing useful, usable, and beautiful products for the web since 1996. He cofounded the adventure travel company TravelDragon.com and has invested in several Boston-area startups. When he is not manning his Big Green Egg at a BBQ or mixing a classic tiki cocktail, Brian can be found on stage as a professional percussionist and drummer. He leads the acclaimed dual-ensemble, Mr. Ho’s Orchestrotica, which the Washington Post called “anything but straightforward,” and has performed at Carnegie Hall, the Kennedy Center, and the Montreal Jazz Festival.

Say Hello: Look for the (only?) person walking around with an orange leather laptop bag.

Presentations

Design for nondesigners: Increasing revenue, usability, and utility within data analytics products Session

Do you spend a lot of time explaining your data analytics product to your customers? Is your UI/UX or navigation overly complex? Are sales suffering due to complexity, or worse, are customers not using your product? Your design may be the problem. Brian O'Neill shares a secret: you don't have to be a trained designer to recognize design and UX problems and start correcting them today.

Tim O’Reilly has a history of convening conversations that reshape the computer industry. In 1998, he organized the meeting where the term “open source software” was agreed on and helped the business world understand its importance. In 2004, with the Web 2.0 Summit, he defined how “Web 2.0” represented not only the resurgence of the web after the dot-com bust but a new model for the computer industry, based on big data, collective intelligence, and the internet as a platform. In 2009, with his Gov 2.0 Summit, Tim framed the conversation about the modernization of government technology that has shaped policy and spawned initiatives at the federal, state, and local levels and around the world. He has now turned his attention to implications of the on-demand economy, AI, robotics, and other technologies that are transforming the nature of work and the future shape of the economy. He shares his thoughts about these topics in his new book, WTF? What’s the Future and Why It’s Up to Us (Harper Business, October 2017). Tim is the founder and CEO of O’Reilly Media and a partner at O’Reilly AlphaTech Ventures (OATV). He sits on the boards of Maker Media (which was spun out from O’Reilly Media in 2012), Code for America, PeerJ, Civis Analytics, and POPVOX.

Presentations

WTF? What's the future and why it's up to us Keynote

Robots are going to take our jobs, they say. Tim O'Reilly says, "Only if that's what we ask them to do!" Tim has had his fill of technological determinism. He explains why technology is the solution to human problems and why we won't run out of work till we run out of problems.

A leading expert on big data architecture and Hadoop, Stephen O’Sullivan has 20 years of experience creating scalable, high-availability data and applications solutions. A veteran of @WalmartLabs, Sun, and Yahoo, Stephen leads data architecture and infrastructure at Silicon Valley Data Science.

Presentations

Architecting a data platform Tutorial

What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

Ask me anything: Running data science in the enterprise and architecting data platforms Ask Me Anything

John Akred, Stephen O'Sullivan, and Heather Nelson field a wide range of detailed questions on topics such as managing data science in the enterprise, architecting a data platform, and creating a modern enterprise data strategy. Even if you don’t have a specific question, join in to hear what others are asking.

Rick Okin is vice president of data engineering for JW Player, the world’s largest network-independent video platform, where he is responsible for building innovative data products to expand JW Player’s extensive footprint. A data-driven technology expert with more than 30 years of experience in the information technology industry, Rick previously served as CTO for actionable advertising intelligence provider Integral Ad Science, where he was responsible for managing all aspects of technology and system operations.

Presentations

How JW Player is powering the online video revolution with data analytics (sponsored by Snowflake Computing) Session

Rick Okin explains how JW Player strategically leverages video data analytics to power industry- and customer-level insights for the evolving online video space.

Mike Olson cofounded Cloudera in 2008 and served as its CEO until 2013, when he took on his current role of chief strategy officer. As CSO, Mike is responsible for Cloudera’s product strategy, open source leadership, engineering alignment, and direct engagement with customers. Previously, Mike was CEO of Sleepycat Software, makers of Berkeley DB, the open source embedded database engine, and he spent two years at Oracle Corporation as vice president for embedded technologies after Oracle’s acquisition of Sleepycat. Prior to joining Sleepycat, Mike held technical and business positions at database vendors Britton Lee, Illustra Information Technologies, and Informix Software. Mike holds a bachelor’s and a master’s degree in computer science from the University of California, Berkeley.

Presentations

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it Session

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.

How the IoT and machine learning keep America truckin' Keynote

Data is powering the largest trucks on America’s interstates, the buses that take our children to school, and the military vehicles that help protect our country. Terry Kline and Mike Olson look at how machine learning and predictive analytics keep more than 300,000+ connected vehicles rolling.

Journey to consolidation Keynote

Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company's most innovative applications. Cesar Delgado joins Mike Olson to discuss how Apple is using its big data stack and expertise to solve non-data problems.

Francois Orsini is the chief technology officer for MZ’s Satori business unit. Previously, he served as vice president of platform engineering and chief architect, bringing his expertise in building server-side architecture and implementation for a next-gen social and server platform; was a database architect and evangelist at Sun Microsystems; and worked in OLTP database systems, middleware, and real-time infrastructure development at companies like Oracle, Sybase, and Cloudscape. Francois has extensive experience working with database and infrastructure development, honing his expertise in distributed data management systems, scalability, security, resource management, HA cluster solutions, soft real-time and connectivity services. He also collaborated with Visa International and Visa USA to implement the first Visa Cash Virtual ATM for the internet and founded a VC-backed startup called Unikala in 1999. Francois holds a bachelor’s degree in civil engineering and computer sciences from the Paris Institute of Technology.

Presentations

Anomaly detection on live data Session

Services such as YouTube, Netflix, and Spotify popularized streaming in different industry segments, but these services do not center around live data—best exemplified by sensor data—which will be increasingly important in the future. Arun Kejariwal, Francois Orsini, and Dhruv Choudhary demonstrate how to leverage Satori to collect, discover, and react to live data feeds at ultralow latencies.

Joel Östlund is a senior data engineer in research and development at Spotify. Previously, Joel was a data and backend engineer at a national security company, a researcher at Ericsson, and a data engineering consultant in Gurgaon, India, and Italy. He holds an MS in industrial engineering and management with a specialization in computer science from Linköping University, Sweden, and National Chiao Tung University, Taiwan.

Presentations

Managing core data entities for internal customers at Spotify Session

Spotify makes data-driven product decisions. As the company grows, the magnitude and complexity of the data it cares for the most is rapid increasing. Sneha Rao and Joel Östlund walk you through how Spotify stores and exposes audience data created from multiple internal producers within Spotify.

Andrew Otto is a systems engineer at the Wikimedia Foundation, where he supports the analytics team by architecting and maintaining small and big data analytics infrastructure. Previously, Andrew was the lead systems administrator at CouchSurfing.org. He is based in Brooklyn, NY, and spends too much time playing hardcourt bike polo.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to do so.

Jon Ouimet is a senior solution engineer on the Control-M Innovation IT team at BMC Software, where he supports efforts to integrate Control-M into the DevOps, big data, and cloud markets. Jon has years of experience working with Control-M—administering it and scheduling and operating various environments. His specialties include Control-M, ETL processes, system administration (Linux and Windows), and programming in C++, Java, Python, and bash, to name a few.

Presentations

Automated data pipelines in hybrid environments: Myth or reality? (sponsored by BMC) Session

Are you building, running, or managing complex data pipelines across hybrid environments spanning multiple applications and data sources? Doing this successfully requires automating dataflows across the entire pipeline, ideally controlled through a single source. Basil Faruqui and Jon Ouimet walk you through a customer journey to automate data pipelines across a hybrid environment.

Shoumik Palkar is a second-year PhD student in the Infolab at Stanford University, where he works with Matei Zaharia on high-performance data analytics. He holds a degree in electrical engineering and computer science from UC Berkeley.

Presentations

Weld: Accelerating data science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., pandas and TensorFlow) and yet do not achieve peak hardware performance due to data movement across functions. Shoumik Palkar and Matei Zaharia offer an overview of Weld, a new interface to implement functions in these libraries while enabling optimizations across them.

Lloyd Palum is the CTO of Vnomics, where he directs the company’s technology development associated with optimizing fuel economy in commercial trucking. Lloyd has more than 25 years of experience in both commercial and government electronics, has published a number of technical articles, and speaks frequently at industry conferences. He holds five patents in the field of software and wireless communications. Lloyd earned his MSEE from Boston University and BSEE from the University of Rochester.

Presentations

How to build a digital twin Session

A digital twin models a real-world physical asset using mobile data, cloud computing, and machine learning to track chosen characteristics. Lloyd Palum walks you through building a tractor trailer digital twin using Python and TensorFlow. You can then use the example model to track and optimize performance.

Kevin Parent is the CEO of Conduce, a company that helps leaders and teams see and interact with all their data instantly using a single, intuitive human interface. An innovator, Kevin’s entire career has focused on connecting the dots between advances in technology and human experiences. Previously, he cofounded Oblong Industries, where he invented new–to–the-world interfaces that allow users to interact with software using displays, gestures, wands, tablets, and smartphones, and spent 10 years engineering theme park attractions. (He was a project engineer for the Twilight Zone Tower of Terror at Walt Disney Imagineering.) Kevin is the author of six patents. He holds a degree in physics from the Massachusetts Institute of Technology, where his undergraduate thesis work was conducted in MIT’s Media Lab.

Presentations

Implementing a successful real-time project Data 101

DHL's Javier Esplugas and Conduce's Kevin Parent explain how the two companies have implemented an IoT pipeline that gives managers and executives real-time insight into warehouse operations, helping them to identify potential hazards, reduce costs, and increase productivity.

Seeing everything so managers can act on anything: The IoT in DHL Supply Chain operations Session

DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization.

Robert Passarella evaluates AI and machine-learning investment managers for Protégé Partners. Rob has spent over 20 years on Wall Street in the gray zone between business and technology, focusing on leveraging technology and innovative information sources to empower novel ideas in research and the investment process. A veteran of Morgan Stanley, JPMorgan, Bear Stearns, Dow Jones, and Bloomberg, he has seen the transformational challenges firsthand, up close and personal. Always intrigued by the consumption and use of information for investment analysis, Rob is passionate about leveraging alternative and unstructured data for use with machine learning techniques. Rob holds an MBA from the Columbia Business School.

Presentations

Findata welcome Tutorial

Alistair Croll and Rob Passarella welcome you to Findata Day.

Rumpelstiltskin and the financial markets Findata

Can you really use unstructured data as part of your investment process? Why are leading financial services firms building practices to handle all sorts of data from satellite to browsing? Is the IoT a significant data point for financial analysis? Rob Passarella explores current applications and use cases for data, which financial services firms are eagerly gobbling up with alpha in mind.

Mo Patel is a practice director for AI and deep learning at Teradata, where he mentors and advises Teradata clients and provides guidance on ongoing deep learning projects. Mo has successfully managed and executed data science projects with clients across several industries, including cable, auto manufacturing, medical device manufacturing, technology, and car insurance. Previously, Mo was a management consultant and a software engineer. A continuous learner, Mo conducts research on applications of deep learning, reinforcement learning, and graph analytics toward solving existing and novel business problems and brings a diversity of educational and hands-on expertise connecting business and technology. He holds an MBA, a master’s degree in computer science, and a bachelor’s degree in mathematics.

Presentations

Deep learning for recommender systems Tutorial

Junxia Li and Mo Patel demonstrate how to apply deep learning to improve consumer recommendations by training neural nets to learn categories of interest for recommendations using embeddings. You'll also learn how to achieve wide and deep learning with WALS matrix factorization—now used in production for the Google Play store.

Bob Patterson is a certified master IT architect and chief strategist at HPE, where he focuses on enterprise data, analytics, and the internet of things. As a member of HPE’s strategic solutions architecture (SSA) team, Bob works with customers, industries, account teams, partners, and HPE product and services teams to drive opportunities, activities, and initiatives around data analytics and business intelligence. Previously, Bob spent 20 years as a systems engineer, consultant, IT specialist, and certified senior IT architect at IBM, where he was responsible for the design, development, and implementation of global solutions for IBM customers. He was also a member of the IBM Global Technology Services Architecture Board responsible for designing reference architectures for server consolidation and virtualization, infrastructure interoperability, and cloud computing. Bob currently guest lectures for the School of Engineering at Robert Morris University and volunteers for several nonprofit organizations. He has two patents in cloud implementation. Bob holds a BS in mechanical engineering from Carnegie Mellon University and MS degrees in computer information systems and telecommunications from the University of Denver; he also holds professional certifications in ITIL and HPE’s ExpertOne Planning and Design of Business Critical Systems.

Presentations

A comprehensive, enterprise-grade, open Hadoop solution from Hewlett Packard Enterprise (sponsored by Hewlett Packard Enterprise) Session

Bob Patterson offers an overview of Hewlett Packard Enterprise's enterprise-grade Hadoop solution, which has everything you need to accelerate your big data journey: innovative hardware architectures for diverse workloads certified for all leading distros, infrastructure software, services from HPE and partners, and add-ons like object storage.

Josh Patterson is the director of field engineering for Skymind. Previously, Josh ran a big data consultancy, worked as a principal solutions architect at Cloudera, and was an engineer at the Tennessee Valley Authority, where he was responsible for bringing Hadoop into the smart grid during his involvement in the openPDC project. Josh is a cofounder of the DL4J open source deep learning project and is a coauthor of Deep Learning: A Practitioner’s Approach. Josh has over 15 years’ experience in software development and continues to contribute to projects such as DL4J, Canova, Apache Mahout, Metronome, IterativeReduce, openPDC, and JMotif. Josh holds a master’s degree in computer science from the University of Tennessee at Chattanooga, where he did research in mesh networks and social insect swarm algorithms.

Presentations

Real-time image classification: Using convolutional neural networks on real-time streaming data Session

Enterprises building data lakes often have to deal with very large volumes of image data that they have collected over the years. Josh Patterson and Kirit Basu explain how some of the most sophisticated big data deployments are using convolutional neural nets to automatically classify images and add rich context about the content of the image, in real time, while ingesting data at scale.

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Joshua Patterson is the director of applied solutions engineering at NVIDIA. Previously, Josh worked with leading experts across the public and private sectors and academia to build a next-generation cyberdefense platform. He was also a White House Presidential Innovation Fellow. His current passions are graph analytics, machine learning, and GPU data acceleration. Josh also loves storytelling with data and creating interactive data visualizations. He holds a BA in economics from the University of North Carolina at Chapel Hill and an MA in economics from the University of South Carolina’s Moore School of Business.

Presentations

GPU-accelerating a deep learning anomaly detection platform Session

How can deep learning be employed to create a system that monitors network traffic, operations data, and system logs to reliably flag risk and unearth potential threats? Satish Dandu, Joshua Patterson, and Michael Balint explain how to bootstrap a deep learning framework to detect risk and threats in operational production systems, using best-of-breed GPU-accelerated open source tools.

Nick Pentreath is a principal engineer at IBM working primarily on machine learning on Apache Spark. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a member of the Apache Spark PMC and author of Machine Learning with Spark. Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.

Presentations

Deep learning for recommender systems Session

In the last few years, deep learning has achieved significant success in a wide range of domains, including computer vision, artificial intelligence, speech, NLP, and reinforcement learning. However, deep learning in recommender systems has, until recently, received relatively little attention. Nick Pentreath explores recent advances in this area in both research and practice.

Sander Pick is CTO at Set, an on-device machine learning platform that aims to embed user intelligence into every mobile application. Previously, Sander worked at Apple and Mission Motors. A Montanan, Sander likes focus, climbing, and open spaces.

Presentations

Learning location: Real-time feature extraction for mobile analytics Session

Location-based data is full of information about our everyday lives, but GPS and WiFi signals create extremely noisy mobile location data, making it hard to extract features, especially when working with real-time data. Andrew Hill and Sander Pick explore new strategies for extracting information from location data while remaining scalable, privacy focused, and contextually aware.

Mike Pittaro is a distinguished engineer at Dell EMC, where he works on big data cluster architectures. Mike has a background in high-performance computing, data warehousing, and distributed systems and has held engineering and service positions at Alliant Computer, Kendall Square Research, Informatica, and SnapLogic.

Presentations

Considerations for hardware-accelerated machine learning platforms Session

The advances we see in machine learning would be impossible without hardware improvements, but building a high-performance hardware platform is tricky. It involves hardware choices, an understanding of software frameworks and algorithms, and how they interact. Mike Pittaro shares the secrets of matching the right hardware and tools to the right algorithms for optimal performance.

Adrian Popescu is a data engineer at Unravel Data Systems working on performance profiling and optimization of Spark applications. He has more than eight years of experience building and profiling data management applications. He holds a PhD in computer ecience from EPFL, where his thesis focused on modeling the runtime performance of a class of analytical workloads that include iterative tasks executing on in-memory graph processing engines (Giraph BSP), and SQL queries executing at scale on Hive, a master of applied science from the University of Toronto, and a bachelor of science from University Politehnica, Bucharest.

Presentations

Using ML to solve failure problems with ML and AI apps in Spark Session

A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark.

Sean Power is the founder of Repable.

Presentations

Data science and e-sports DCS

Sean Power explores available datasets, how to access them, and the data challenges involved in working with data in the e-sports space.

Mate Radalj is vice president and principal software engineer at Kinetica. Previously, he was associate vice president and principal technologist at Infosys and a vice president and principal software engineer at SAP. Earlier in his career, Mate worked at Netezza, IBM, SignalDemand, Celquest, Callixa, Softport, Informix, and American International Credit and was a US Army military intelligence officer. Mate holds a BS in computer science from Fordham University.

Presentations

Smarter business apps with a modern GPU database (sponsored by Kinetica) Session

Infusing business apps with AI isn’t easy. Mate Radalj explains why you need to master the entire AI process from data to models to operationalization so you can build, train, and deploy predictive models that unleash smart business apps and enable data-driven decisions.  

Syed Rafice is a senior system engineer at Cloudera, where he specializes in big data on Hadoop technologies and is responsible for designing, building, developing, and assuring a number of enterprise-level big data platforms using the Cloudera distribution. Syed also focuses on both platform and cybersecurity. He has worked across multiple sectors, including government, telecoms, media, utilities, financial services, and transport.

Presentations

A practitioner’s guide to Hadoop security for the hybrid cloud Tutorial

Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.

Greg Rahn is a director of product management at Cloudera, where he is responsible for driving SQL product strategy as part of Cloudera’s analytic database product, including working directly with Impala. Over his 20-year career, Greg has worked with relational database systems in a variety of roles, including software engineering, database administration, database performance engineering, and most recently, product management, to provide a holistic view and expertise on the database market. Previously, Greg was part of the esteemed Real-World Performance Group at Oracle and was the first member of the product management team at Snowflake Computing.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Meena Ram works in enterprise data strategy and governance in the Chief Data Office at CIBC.

Presentations

Executive panel: Big data use cases around the world Session

Big data and the cloud have spread around the world, and Singapore, New Zealand, Australia, and Canada are already seeing dramatic investments and returns. In a panel moderated by Steve Totman, senior executives from a variety of leading companies, including DBS, CIBC, and Qrious, share use cases, challenges, and how to be successful.

Karthik Ramasamy is the cofounder of Streamlio, a company building next-generation real-time processing engines. Karthik has more than two decades of experience working in parallel databases, big data infrastructure, and networking. Previously, he was engineering manager and technical lead for real-time analytics at Twitter, where he was the cocreator of Heron; cofounded Locomatix, a company that specialized in real-time stream processing on Hadoop and Cassandra using SQL (acquired by Twitter); briefly worked on parallel query scheduling at Greenplum (acquired by EMC for more than $300M); and designed and delivered platforms, protocols, databases, and high-availability solutions for network routers at Juniper Networks. He is the author of several patents, publications, and one best-selling book, Network Routing: Algorithms, Protocols, and Architectures. Karthik holds a PhD in computer science from the University of Wisconsin-Madison with a focus on databases, where he worked extensively in parallel database systems, query processing, scale-out technologies, storage engines, and online analytical systems. Several of these research projects were spun out as a company later acquired by Teradata.

Presentations

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Low-latency streaming: Twitter Heron on Infiniband Session

Modern enterprises are data driven and want to move at light speed. To achieve real-time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14 ms. Karthik Ramasamy and Supun Kamburugamuvee explain how they ported Heron to Infiniband to achieve latencies as low as 7 ms.

Modern real-time streaming architectures Tutorial

Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Twitter Heron Goes Exactly Once Session

Twitter processes billions of events per day at the instant the data is generated. To achieve real-time performance, Twitter employs Heron, an open-source streaming engine tailored for large-scale environments. In this talk, Karthik will present the techniques used by Heron to implement exactly once and share the operating experiences at scale.

Radhika Rangarajan is an engineering director for big data technologies within Intel’s Software and Services Group, where she manages several open source projects and partner engagements, specifically on Apache Spark and machine learning. Radhika is one of the cofounders and the director of the West Coast chapter of Women in Big Data, a grassroots community focused on strengthening the diversity in big data and analytics. Radhika holds both a bachelor’s and a master’s degree in computer science and engineering.

Presentations

Accelerating insight with analytics and AI (sponsored by Intel) Session

Kevin Huiskes and Radhika Rangarajan discuss Intel's strategy to lower barriers to advanced analytics and AI, make results faster and more efficient, and enable data scientists and developers to make better use of existing infrastructure, emphasizing solutions based on the latest Intel Xeon Scalable platform and the open source framework BigDL.

Jun Rao is the cofounder of Confluent, a company that provides a streaming data platform on top of Apache Kafka. Previously, Jun was a senior staff engineer at LinkedIn, where he led the development of Kafka, and a researcher at IBM’s Almaden research data center, where he conducted research on database and distributed systems. Jun is the PMC chair of Apache Kafka and a committer of Apache Cassandra.

Presentations

A deep dive into Apache Kafka core internals Session

Over the last few years, streaming platform Apache Kafka has been used extensively for real-time data collecting, delivering, and processing—particularly in the enterprise. Jun Rao leads a deep dive into some of the key internals that help make Kafka popular and provide strong reliability guarantees.

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Introducing Exactly Once Semantics in Apache Kafka Session

Apache Kafka’s rise in popularity as a streaming platform has demanded a revisit of its traditional at least once message delivery semantics. In this talk, we'll present the recent additions to Apache Kafka to achieve exactly once semantics. We'll also discuss the newly introduced transactional APIs and use the Kafka Streams API as an example to show how these APIs are leveraged for streams tasks.

Sneha Rao is an experienced product owner at Spotify, where she works with big data at scale. Previously, Sneha worked at the New York Times, Comcast/NBCUniversal, and NASA’s data center. She is skilled in database management, big data, analytics, and Python and is currently pursuing a MBA focused on innovation, design, and entrepreneurial studies at New York University’s Leonard N. Stern School of Business.

Presentations

Managing core data entities for internal customers at Spotify Session

Spotify makes data-driven product decisions. As the company grows, the magnitude and complexity of the data it cares for the most is rapid increasing. Sneha Rao and Joel Östlund walk you through how Spotify stores and exposes audience data created from multiple internal producers within Spotify.

Faraz Rasheed is senior manager at TD Bank, Canada where he is leading the Enterprise Big Data Analytics team helping different line of businesses build data science solutions on bank’s big data analytics platform. Faraz holds a PhD in Computer Science with focus on Machine Learning from University of Calgary. Before joining TD Bank, Faraz has worked as senior data scientist at BlackBerry Ltd. Faraz has also been teaching data science at Ryerson University and WeCloud Data.

Presentations

Griffin: Fast-tracking model development in Hadoop Session

Steven Totman and Faraz Rasheed offer an overview of Griffin, a high-level, easy-to-use framework built on top of Spark, which encapsulates the complexities of common model development tasks within four phases: data understanding, feature extraction, model development, and serving modeling results.

Pranav Rastogi is a program manager on Microsoft’s Azure HDInsight team. Pranav spends most of his time making it easier for customers to leverage the big data ecosystem to build big data solutions faster.

Presentations

Building big data applications on Azure Tutorial

As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services.

Extend on-premises Hadoop and Spark deployments across data centers and the cloud, including Microsoft Azure (sponsored by Microsoft and WANdisco) Session

Jagane Sundar and Pranav Rastogi explain how to meet your enterprise SLAs while making full use of resources with patented active data replication technology—something computer science still says is impossible.

Alex Ratner is a third-year PhD student at the Stanford InfoLab working under Chris Re. Alex works on new machine learning paradigms for settings where limited or no hand-labeled training data is available, motivated in particular by information extraction problems in domains like genomics, clinical diagnostics, and political science. He coleads the development of the Snorkel framework for lightweight information extraction.

Presentations

Data programming: Creating large training sets quickly HDS

As data-hungry algorithms become the norm in machine learning, the bottleneck is now acquiring labeled training data. Alex Ratner explores data programming, a paradigm for the programmatic creation of training sets in which users express weak supervision strategies or domain heuristics as simple scripts called labeling functions, which are then automatically denoised.

Radhika Ravirala is a solutions architect at Amazon Web Services, where she helps customers craft distributed, robust cloud applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. Radhika enjoys spending time with her family, walking her dog, doing Warrior X-Fit, and playing an occasional hand at Smash Bros.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, Allan MacInnis, and Ben Snively walk you through building a big data application using a combination of open source technologies and AWS managed services.

José Ribau is the chief data officer at CIBC, where he leads the bank’s data management strategy and advanced analytics functions. José’s team is responsible for governing the use of strategic data and driving transformation of the business through delivery of client segmentation, predictive modeling, and analytics projects—all with a focus on producing insights that help capture growth through product consolidation and increased share of wallet. Previously, José worked in client analytics and product development at CIBC, where he contributed to Canada’s first Visa Debit launch and other client-focused solutions that paved the way for additional revenue streams. Early in his career, José spent several years as a researcher in McMaster University’s Medical Sciences Division while he completed his MS. José also holds an MBA from Queen’s University and a BSc from Wilfrid Laurier University. He enjoys spending quality time with his family and is an avid cyclist and a big Star Wars fan.

Presentations

Fintech, data innovation, and the real world Findata

José Ribau discusses the pragmatic side of data-driven finance—the realities of modern banking—comparing the demands of governance and compliance to the aspirations of fintech startups.

Salema Rice is the chief data officer at Allegis Group, where she is responsible for enterprise-wide data and analytics, including data management technology, big data, enterprise data operations, global master data management, enterprise data governance, business intelligence, enterprise information management, insights, and analytics.

Presentations

Differentiating ourselves with data and analytics DCS

Salema Rice shares how Allegis Group, the largest privately held talent management company in the world, is transforming into a digitally enhanced company, using big data and data sciences to differentiate itself in the marketplace.

Henry Robinson is a software engineer at Cloudera. For the past few years, he has worked on Apache Impala, an SQL query engine for data stored in Apache Hadoop, and leads the scalability effort to bring Impala to clusters of thousands of nodes. Henry’s main interest is in distributed systems. He is a PMC member for the Apache ZooKeeper, Apache Flume, and Apache Impala open source projects.

Presentations

Rethinking data marts in the cloud: Common architectural patterns for analytics Session

Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance.

Matthew Roche is a senior program manager in Microsoft’s Cloud and Enterprise Group, where he focuses on enterprise information management, crowdsourced metadata, and data source discovery. Matthew currently delivers capabilities in Azure Data Catalog; previously, he worked on Power BI, SQL Server Integration Services, Master Data Services, and Data Quality Services. When not enabling the world to get more value from its data, he enjoys reading, baking, and competitive longsword combat.

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. Matthew Roche and Jennifer Stevens walk you through building a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Matthew Rocklin is an open source software developer at Anaconda focusing on efficient computation and parallel computing, primarily within the Python ecosystem. He has contributed to many of the PyData libraries and today works on Dask, a framework for parallel computing. Matthew holds a PhD in computer science from the University of Chicago, where he focused on numerical linear algebra, task scheduling, and computer algebra.

Presentations

Dask: Flexible parallelism in Python for advanced analytics Session

Dask parallelizes Python libraries like NumPy, pandas, and scikit-learn, bringing a popular data science stack to the world of distributed computing. Matthew Rocklin discusses the architecture and current applications of Dask used in the wild and explores computational task scheduling and parallel computing within Python generally.

Meet the Expert with Matthew Rocklin (Anaconda) Meet the Experts

Want to know how to parallelize Python code with Dask? Talk to Matthew.

Scaling Python data analysis Tutorial

The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive but only for in-memory data and a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters.

Julie Rodriguez is vice president of product management and user experience at Eagle Investment Systems. An experience designer focusing on user research, analysis, and design for complex systems, Julie has patented her work in data visualizations for MATLAB and publishes industry articles on user experience and data analysis and visualization. She is the coauthor of Visualizing Financial Data, a book about visualization techniques and design principles that includes over 250 visuals depicting quantitative data.

Presentations

Expanding data literacy with data visualizations Session

While the value of data and its role in informing decisions and communications is well known, its meaning can be incorrectly interpreted without data visualizations that provide context and accurate representation of the underlying numbers. Julie Rodriguez shares new approaches and visual design methods that provide a greater perspective of the data.

Dan Roesch is managing director of Roesch & Associates LLC, a business advisory and strategy consulting firm. Dan has over 25 years of experience in the areas of strategy development, technology management, partnership strategy, financial analysis, and business due diligence. He has a proven ability to work with senior leaders to develop practical solutions to complex challenges along with experience developing strategies for large companies and startups. Dan is the founder of news2alpha, a startup developing financial news analytics applications, and serves as an advisory board member for an automotive consumer web startup. Previously, he was director of strategic initiatives and director of advanced technology business development at GM. Dan holds an MBA and a bachelor’s degree in engineering, both from the University of Michigan.

Presentations

Data 101 welcome Data 101

Dan Roesch welcomes you to the Data 101 tutorial.

Data 101 welcome Tutorial

Dan Roesch, managing director of Roesch & Associates LLC, welcomes you to the Data 101 tutorial.

Steve Ross is the director of product management at Cloudera, where he focuses on security across the big data ecosystem, balancing the interests of citizens, data scientists, and IT teams working to get the most out of their data while preserving privacy and complying with the demands of information security and regulations. Previously, at RSA Security and Voltage Security, Steve managed product portfolios now in use by the largest global companies and hundreds of millions of users.

Presentations

GDPR: Getting your data ready for heavy, new EU privacy regulations Session

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Edgar Ruiz is a solutions engineer at RStudio with a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Recently, Edgar authored the “Data Science on Spark using sparklyr” cheat sheet.

Presentations

Using R and Spark to analyze data on Amazon S3 Session

With R and sparklyr, a Spark standalone cluster can be used to analyze large datasets found in S3 buckets. Edgar Ruiz walks you through setting up a Spark standalone cluster using EC2 and offers an overview of S3 bucket folder and file setup, connecting R to Spark, the settings needed to read S3 data into Spark, and a data import and wrangle approach.

Philip Russom is the research director for data management at TDWI, where, as an industry analyst, he oversees many of the company’s research-oriented publications, services, and events. A well-known figure in data warehousing, business intelligence, data management, and analytics, Philip has published 550+ research reports, magazine articles, opinion columns, speeches, and webinars. Previously, he was an industry analyst covering BI at Forrester Research and Giga Information Group; ran his own business as an independent industry analyst and BI consultant; was contributing editor to leading IT magazines; and held technical and marketing positions for various database vendors.

Presentations

The data lake: Improving the role of Hadoop in data-driven business management Session

Philip Russom explains how a data lake can improve the role of Hadoop in data-driven business management. With the right end-user tools, a data lake can enable self-service data practices that wring business value from big data and modernize and extend programs for data warehousing, analytics, data integration, and other data-driven solutions.

Derek Ruths is cofounder and chief architect of CAI, a charity focused on bringing the power of data science to social good initiatives. Derek is also an associate professor of computer science at McGill University, the head of R&D at Data Sciences, and the director of the McGill Centre for Social and Cultural Data Science. In these capacities, he works closely with major tech companies, advises governments on technical innovation, teaches executive education programs, and partners with international humanitarian organizations. In his work and research, Derek has been a longtime advocate for the essential role of data science in fostering more equitable, more prosperous, and healthier organizations and societies.

Presentations

Data science for good: Benefit the world and your business at the same time Session

Derek Ruths explains how volunteer efforts, when done the right way, can actually improve a data science team’s culture and productivity—motivating data scientists, sharpening their skills, providing exposure to new challenges, reducing turnover, and creating valuable recruiting opportunities.

Data science is for everyone: Making data science work in low-tech environments DCS

Derek Ruths illustrates how organizations can adopt data-informed practices even in low-tech environments. Drawing on his experience working in and with developing nations, Derek shares several key strategies to significantly improve the odds of success when bringing data-driven practices to seemingly impractical situations.

Kostas Sakellis is the lead and engineering manager of the Apache Spark team at Cloudera. Kostas holds a bachelor’s degree in computer science from the University of Waterloo, Canada.

Presentations

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Jennifer Wu, Philip Langdale, and Kostas Sakellis explore the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Neelesh Srinivas Salian is a software engineer on the data platform team at Stitch Fix, where he works on the compute infrastructure used by the company’s data scientists. This includes the Spark environment that is used to help make data-driven decisions.

Presentations

Apache Spark in the hands of data scientists Session

Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way.

Majken Sander is a data nerd, business analyst, and solution architect at TimeXtender. Majken has worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, She is keen on “data driven” as a business principle, data science, the IoT, and all other things data.

Presentations

Show me my data, and I’ll tell you who I am. Session

Personal data is increasingly spread across various services globally. But what do companies know about us? And how do we collect that knowledge, get ahold of our own data, and maybe even correct faulty perceptions by putting the right answers out there as a service? Majken Sander explains why we desperately need a personal Discovery Hub: a go-to place for knowledge about ourselves.

Kenneth Sanford is an analytics architect at Dataiku. Ken holds a PhD and can scope work, write code, and explain it on a napkin.

Presentations

(Big) data team productivity: A balancing act (sponsored by Dataiku) Session

Fragmented data science and analytics teams result in duplicate work, poor collaboration, a lack of governance, insufficient adoption at scale, and significant key-man risk. Kenneth Sanford explains how to overcome these challenges and build a centralized analytics practice that empowers data-driven decision making.

Sri Satish is cofounder and CEO of H2O.ai, the builders of H2O. H2O democratizes big data science and makes Hadoop do math for better predictions. Previously, Sri spent time scaling R over big data with researchers at Purdue and Stanford; cofounded Platfora; was the director of engineering at DataStax; served as a partner and performance engineer at the Java multicore startup Azul Systems, where he tinkered with the entire ecosystem of enterprise apps at scale; and worked on a NoSQL trie-based index for semistructured data at in-memory index startup RightOrder. Sri is known for his knack for envisioning killer apps in quickly evolving spaces and assembling stellar teams to productize that vision. He is a regular speaker on the big data, NoSQL, and Java circuit and leaves a trail at @srisatish.

Presentations

Interpretable AI: Not just for regulators Session

Interpreting deep learning and machine learning models is not just another regulatory burden to be overcome. People who use these technologies have the right to trust and understand AI. Patrick Hall and Sri Satish share techniques for interpreting deep learning and machine learning models and telling stories from their results.

Jared Schiffman is the founder of Perch Interactive, a startup intent on revolutionizing the retail environment. Jared has worked at the intersection of design, computer science, and education for over two decades. His work fuses the physical world with the digital world and plays with relationship between the two, his projects are steeped in metaphor and gesture and emphasize the power of direct experience. Jared is the cofounder of Potion, an interactive design and technology firm located in New York City named one of the top 10 most innovative design companies in 2010 by Fast Company; has taught courses at Parsons School of Design, New York University, and at the Gate’s-funded High Tech High in San Diego. Jared holds a master’s degree in media arts and science from the MIT Media Lab, where he studied with John Maeda in the Aesthetics and Computation Group, and an SB in computer science and engineering from MIT.

Presentations

Retail's panacea: How machine learning is driving product development Session

Karen Moon, Jared Schiffman, Eric Colson, and Catherine Twist explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

William Schmarzo is the CTO of Dell EMC’s Big Data practice, where he is responsible for setting the strategy and defining the service line offerings and capabilities for the EMC Consulting Enterprise Information Management and Analytics service line. Bill has more than two decades of experience in data warehousing, BI, and analytics applications. He authored the Business Benefits Analysis methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements and has served on the Data Warehouse Institute’s faculty as the head of the analytic applications curriculum. Previously, Bill was the vice president of analytics at Yahoo, where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of actionable insights through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing, and sales of their industry-defining analytic applications. Bill is the author of Big Data: Understanding How Data Powers Big Business, has written several whitepapers, and coauthored a series of articles on analytic applications with Ralph Kimball. He is a frequent speaker on the use of big data and advanced analytics to power an organization’s key business initiatives. Bill holds a master’s degree in business administration from the University of Iowa and a bachelor of science in mathematics, computer science, and business administration from Coe College.

Presentations

Executive Briefing: Determining the economic value of your data (EvD) Session

Organizations need a process and supporting frameworks to become more effective at leveraging data and analytics to transform their business models. Using the Big Data Business Model Maturity Index as a guide, William Schmarzo demonstrates how to assess business value and implementation feasibility with respect to the monetization potential of an organization’s business use cases.

Schmidt spends the majority of his time working with existing cloud customers as well as on premise developers who are moving their MapReduce and related data processing workloads to the cloud. Beyond his Google Cloud focus, he has a deep passion for user interaction modeling, data modeling & analytical processing of user behaviors and development experience with .NET, C, JavaScript, Python, and Java.

Presentations

Emotional arithmetic: A deep dive into how machine learning and big data help you understand customers in real time (sponsored by Google) Session

Doing “algebra” with emotions can lead to new insights about customer behavior. Chad Jennings presents a serverless big data analytics platform that allows you to capture and analyze raw data and train machine learning models that can process text to discern not just the sentiment but also the underlying emotion driving that sentiment.

Jacob Schreiber is a third-year CSE PhD student and IGERT big data fellow at the University of Washington. Jacob is a core developer for the popular Python machine learning package sklearn and the author of a probabilistic modeling Python package pomegranate.

Presentations

Pomegranate: Flexible probabilistic modeling for Python HDS

Jacob Schreiber offers an overview of pomegranate, a flexible probabilistic modeling package implemented in Cython for speed. Jacob explores the models it supports, such as Bayesian networks and hidden Markov models, and how to easily implement them and explains how the underlying modular implementation unlocks several benefits for the modern data scientist.

Jim Scott is the director of enterprise strategy and architecture at MapR Technologies. Across his career, Jim has held positions running operations, engineering, architecture, and QA teams in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day, and his work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop. Jim is also the cofounder of the Chicago Hadoop Users Group (CHUG).

Presentations

How to leverage the cloud for business solutions Data 101

The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott shares use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to move your data between locations.

Presentations

Streamline Data Science Pipeline with GPU Data Frame (sponsored by NVIDIA) Session

Joining Jim McHugh are founders of GOAI: - Todd Mostak, CEO of MapD - SriSatish Ambati, CEO and co-founder of H2O - Stan Seibert, Director of Community Innovation, Anaconda In this session, the speakers will provide an update on the latest advancement and customer use cases leveraging GOAI

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

Executive Briefing: Managing successful data projects—Technology selection and team building Session

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.

Nick Selby is a Texas police detective focused on investigating computer fraud and child exploitation and a cybersecurity incident responder. A frequent contributor to newspapers including the Washington Post and New York Times, Nick is also the coauthor of Cyber Survival Manual: From Identity Theft to The Digital Apocalypse and Everything in Between, In Context: Understanding Police Killings of Unarmed Civilians, and Blackhatonomics: Understanding the Economics of Cybercrime and the technical editor of Investigating Internet Crimes: An Introduction to Solving Crimes in Cyberspace.

Presentations

The context of contacts: Seeking root causes of racial disparity in Texas traffic-summons fines DCS

Nick Selby offers an overview of his study on traffic-stop data in Texas, which found evidence that the state targeted low-income residents (a disproportional number of whom are black and Latino) for heightened scrutiny and penalties. The problem is not necessarily an issue of racist cops—which means fixing the criminal justice system isn’t just an issue of addressing racism in uniform.

Phil Sewell is Technical Director, Solutions Architecture for Micro Focus Data Security. Phil has experience with global software vendors for over 28 years and is focused today on enterprise data protection solutions. Phil helps Micro Focus’ strategic customers address today’s market changing business factors that are influencing information technology and security, including: cloud; big data; mobility; privacy; corporate and regulatory compliance; HIPAA; PCI and payments security. Prior to Micro Focus, Phil worked in a variety of
roles, across a wide range of software solutions at Sun Microsystems, Entrust, Oracle, and Forté Software. Phil has a BMath from the University of Waterloo in Applied Mathematics and Computer Science.

Presentations

Protect IoT data and monetize it with analytics (sponsored by Micro Focus Security and Big Data Analytics) Session

Phil Sewell discusses standards, options, and use cases for extracting value and delivering business outcomes from data protected at the data level.

Viral Shah is the cofounder and CEO of Julia Computing and a cocreator of the Julia language, as well as other open source software. Previously, he drove the rearchitecting of the government’s social security systems in India as part of the national ID project, Aadhaar. Viral is the coauthor of Rebooting India.

Presentations

Julia and Spark, better together Session

Spark is a fast and general engine for large-scale data. Julia is a fast and general engine for large-scale compute. Viral Shah and Stefan Karpinski explain how combining Julia's compute and Spark's data processing capabilities makes amazing things possible.

Nikita Shamgunov is CTO at MemSQL. Preivously, Nikita was a senior database engineer for Microsoft’s SQL Server. He has been awarded several patents and was a world medalist in ACM programming contests. Nikita holds a BS, MS, and PhD in computer science.

Presentations

Teaching databases to learn in the world of AI (sponsored by MemSQL) Keynote

Nikita Shamgunov discusses the future of databases for fast-learning adaptable applications.

Shaked Shammah is a graduate student at the Hebrew University, where he works under Shai Shalev-Shwartz, and a researcher at Mobileye Research. Shaked’s work focuses on general machine learning and optimization, specifically the theory and practice of deep learning and reinforcement learning.

Presentations

Failures of gradient-based deep learning HDS

Deep learning is amazing, but it sometimes fails miserably, even for very simple, practical problems. Shaked Shammah discusses four types of common problems in which deep learning fails. Some can be solved by using specific approaches to network architecture and loss functions. For others, deep learning is simply not the right way to go.

Tushar Shanbhag is head of data strategy and data products at LinkedIn. Tushar is a seasoned executive with track record of building high-growth businesses at market-defining companies such as LinkedIn, Cloudera, VMware, and Microsoft. Most recently, Tushar was vice president of products and design at Arimo, an Andreessen-Horowitz company building data intelligence products using analytics and AI.

Presentations

Taming the ever-evolving compliance beast: Lessons learned at LinkedIn Session

Shirshanka Das and Tushar Shanbhag explore the big data ecosystem at LinkedIn and share its journey to preserve member privacy while providing data democracy. Shirshanka and Tushar focus on three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement platform, and a unified data access layer.

Gwen Shapira is a system architect at Confluent, where she specializes in building and helping customers implement real-time reliable data-processing pipelines using Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Presentations

Architecting a next-generation data platform Tutorial

Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.

Ask me anything: Hadoop application architectures Ask Me Anything

Mark Grover, Ted Malaska, Gwen Shapira, and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation.

One cluster does not fit all: Architecture patterns for multicluster Apache Kafka deployments Session

There are many good reasons to run more than one Kafka cluster…and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions, so you can better choose the right architecture for your needs.

The three realities of modern programming: The cloud, microservices, and the explosion of data Session

Gwen Shapira explains how the three realities of modern programming—the explosion of data and data systems, building business processes as microservices instead of monolithic applications, and the rise of the public cloud—affect how developers and companies operate today and why companies across all industries are turning to streaming data and Apache Kafka for mission-critical applications.

Ben Sharma is CEO and cofounder of Zaloni. Ben is a passionate technologist with experience in solutions architecture and service delivery of big data, analytics, and enterprise infrastructure solutions and expertise ranging from development to production deployment in a wide array of technologies, including Hadoop, HBase, databases, virtualization, and storage. He has held technology leadership positions for NetApp, Fujitsu, and others. Ben is the coauthor of Java in Telecommunications and Architecting Data Lakes. He holds two patents.

Presentations

Architect and operationalize your enterprise data lake (sponsored by Zaloni) Session

Envision the next phase of your company’s data future: providing centralized data services for streamlined yet controlled access to data for end users across lines of business. Carlos Matos and Ben Sharma share strategies for developing an enterprise-wide data lake service to drive shared data insights across the organization. Are you ready?

Jeff Shmain is a principal solutions architect at Cloudera. He has 16+ years of financial industry experience with a strong understanding of security trading, risk, and regulations. Over the last few years, Jeff has worked on various use-case implementations at 8 out of 10 of the world’s largest investment banks.

Presentations

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Dave Shuman is a subject-matter expert at Cloudera. Dave has an extensive background in business intelligence applications, database architecture, logical and physical database design, and data warehousing. Previously, Dave held a number of roles at Vision Chain, a leading demand signal repository provider enabling retailer and manufacturer collaboration, including chief operations officer, vice president of field operations responsible for customer success and user adoption, vice president of product responsible for product strategy and messaging, and director of services. He also served at such top CG companies as Kraft Foods, PepsiCo, and General Mills, where he was responsible for implementations; was vice president of operations for enews, an ecommerce company acquired by Barnes and Noble; was executive vice president of management information systems, where he managed software development, operations, and retail analytics; and developed ecommerce applications and business processes used by Barnesandnoble.com, Yahoo, and Excite, and pioneered an innovative process for affiliate commerce. He holds an MBA with a concentration in information systems from Temple University and a BA from Earlham College.

Presentations

An open source architecture for the IoT Session

Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform.

Fahd Siddiqui is a software engineer at Cloudera, where he’s working on cloud products, such as Cloudera Altus and Cloudera Director. Previously, Fahd worked at Bazaarvoice developing EmoDB, an open source data store built on top of Cassandra. His interests include highly scalable and distributed systems. He holds a master’s degree in computer engineering from the University of Texas at Austin.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

Tanvi Singh is the Chief Analytics Officer, CCRO, at Credit Suisse. She leads a team of 60+ data scientists, data analysts, SME and Investigators globally in Zurich, New York, London, and Singapore. The team is delivering multimillion dollar projects in big data with leading Silicon Valley vendors in the space of RegTech. Tanvi has 18 years of experience in Data Science, business intelligence, Digital Analytics, Data platforms, Change and transformation with a focus on statistics, machine learning, text mining, and visualizations. Tanvi holds a master’s degree in software systems from the University of Zurich.

Presentations

A tale of two cafeterias: Focus on the line of business Keynote

Tanvi Singh explores whether long-standing non-internet-based companies possess the evidence-driven culture and platforms required to derive benefit from big data tools and impact their line of business.

From segmentation to personalization for compliance risk monitoring: A segment of one Findata

Tanvi Singh details Credit Suisse's journey to create a singular platform to manage client, employee, and bank risk holistically in a "low code, no code" environment.

Vartika Singh is a solutions architect at Cloudera with over 10 years of experience applying machine learning techniques to big data problems.

Presentations

Securely building deep learning models for digital health data Tutorial

Josh Patterson, Vartika Singh, David Kale, and Tom Hanlon walk you through interactively developing and training deep neural networks to analyze digital health data using the Cloudera Workbench and Deeplearning4j (DL4J). You'll learn how to use the Workbench to rapidly explore real-world clinical data, build data-preparation pipelines, and launch training of neural networks.

Unraveling data with Spark using deep learning and other algorithms from machine learning Tutorial

Vartika Singh and Jeffrey Shmain walk you through various approaches using the machine learning algorithms available in Spark ML to understand and decipher meaningful patterns in real-world data. Vartika and Jeff also demonstrate how to leverage open source deep learning frameworks to run classification problems on image and text datasets leveraging Spark.

Joseph Sirosh is the corporate vice president of the Cloud AI Platform at Microsoft, where he leads the company’s enterprise AI strategy and products such as Azure Machine Learning, Azure Cognitive Services, Azure Search, and Bot Framework. Prior to this role, he was the corporate vice president for Microsoft’s Data Platform. Joseph joined Microsoft from Amazon, where he was most recently the vice president for the Global Inventory Platform, responsible for the science and software behind Amazon’s supply chain and order fulfillment systems, as well as the central Machine Learning Group, which he built and led. Before joining Amazon, Joseph was vice president of research and development at Fair Isaac Corp., where he led R&D projects for DARPA, homeland security, and several government organizations. He is passionate about machine learning and its applications and has been active in the field since 1990. Joseph holds a PhD in computer science from the University of Texas at Austin and a BTech in computer science and engineering from the Indian Institute of Technology Chennai.

Presentations

Will AI help save the snow leopard? (sponsored by Microsoft) Keynote

Join Microsoft’s Joseph Sirosh for a surprising conversation about a volunteer’s dilemma, an engineer’s ingenuity, and how AI, the cloud, data, and devices came together to help save snow leopards.

Ben Snively is a specialist solutions architect on the Amazon Web Services public sector team, where he specializes in big data, analytics, and search. Previously, Ben was an engineer and architect on DoD contracts, where he worked with Hadoop and big data solutions. He has over 11 years of experience creating analytical systems. Ben holds both a bachelor’s and master’s degree in computer science from Georgia Institute of Technology and a master’s in computer engineering from University of Central Florida.

Presentations

Building your first big data application on AWS Tutorial

Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, Allan MacInnis, and Ben Snively walk you through building a big data application using a combination of open source technologies and AWS managed services.

Serverless big data architectures: Design patterns and best practices (sponsored by AWS) Session

How do you incorporate serverless concepts and technologies into your big data architectures? Ben Snively shares use cases, best practices, and a reference architecture to help you streamline data processing and improve analytics through a combination of cloud and open source serverless technologies.

Siew Choo Soh is managing director and head of core systems technology at DBS Bank, where she is responsible for driving strategy for enterprise-wide technology solutions for core banking, data analytics, finance, risk, compliance, audit, and technology solutions for the 16 countries that DBS is operating in and for driving the technology transformation agenda, which includes leveraging the cloud, big data, machine learning, and Agile methods. She is also a member of the DBS Singapore Management Committee. Siew Choo has led the insourcing of all technology application teams and the transformation of the legacy application stack to modern technology. In 2017, Siew Choo is focusing on building the enterprise data platform and the merger integration of DBS’s acquisition of ANZ retail and wealth businesses in five countries in Asia. Previously, Siew Choo spent 19 years as head of technology for Asia banking at JPMorgan, where she headed the Asia equities and Asia transaction banking technology teams based in Japan and Hong Kong and led multiyear technology build-outs to support JPM’s aggressive business expansion for equities and transaction banking across Asia. Siew Choo was an ASEAN pre-university scholar (awarded by the Public Service Commission of Singapore) and attended Victoria Junior College in Singapore. She was the top Honors student at National University of Singapore, from which she holds a bachelor of computer science with first-class honors and an MBA. Siew Choo speaks English, Malay, and conversational Chinese.

Presentations

Executive panel: Big data use cases around the world Session

Big data and the cloud have spread around the world, and Singapore, New Zealand, Australia, and Canada are already seeing dramatic investments and returns. In a panel moderated by Steve Totman, senior executives from a variety of leading companies, including DBS, CIBC, and Qrious, share use cases, challenges, and how to be successful.

Audrey Spencer-Alvarado is a business analyst for the Portland Trail Blazers. Audrey and the other members of the business analytics team provide all data insights to the various decision makers at the Trail Blazers and affiliates. She also leads the Tableau Reporting and statistical modeling projects.

Presentations

How the Portland Trail Blazers increase conversion rates with Azure Machine Learning DCS

Professional sports teams generally have very large fan bases, but only a small percentage of fans attend multiple games or purchase season tickets each year. Audrey Spencer-Alvarado explains how better identification of customers enables the Portland Trail Blazers to conduct more targeted campaigns, leading to a higher conversion rate, increased revenue, and an improved customer experience.

Kevin Stallings has more than 25 years of experience of providing strategic guidance regarding database solutions, data warehousing, and ETL design and development to technical peers and business partners. He is an associate director of data architecture at AIG, where he advocates for and leads the execution of a new life, health, and disability data architecture and strategy, which includes the implementation of a data integration hub and data warehouse, which will empower AIG’s business to modernize customer-broker interactions via new portals and mobile apps. Kevin also evangelizes the use of new tools, technologies, design patterns, and databases in and around the Hadoop ecosystem and leads the coordination of all stages of application development efforts, including requirements definition, design, architecture, testing, implementation, support, and enhancement.

Presentations

AIG: Creating a data-driven customer service organization (sponsored by Talend) Session

Kevin Stallings provides an inside look at how AIG executed a technological and cultural transformation that had a powerful impact on business outcomes and bottom-line results and explains how to use these lessons to put enterprise-wide big data preparation and self-service analysis to great use within your organization and dramatically increase customer satisfaction and engagement.

Jennifer Marie Stevens is a principal program manager with Microsoft Azure, where she oversees Microsoft’s approach to metadata management. A constant learner, Jennifer has spent her career taking on new disciplines, including product management, product marketing, engineering, and even a stint speechwriting for Microsoft’s top executives. 

Presentations

Building a Rosetta Stone for business data Session

The data-driven business must bridge the language gap between data scientists and business users. Matthew Roche and Jennifer Stevens walk you through building a business glossary that codifies your semantic layer and enables greater conversational fluency between business users and data scientists.

Bargava Subramanian is a machine learning engineer based in Bangalore, India. Bargava has 14 years’ experience delivering business analytics solutions to investment banks, entertainment studios, and high-tech companies. He has given talks and conducted numerous workshops on data science, machine learning, deep learning, and optimization in Python and R around the world. He mentors early-stage startups in their data science journey. Bargava holds a master’s degree in statistics from the University of Maryland at College Park. He is an ardent NBA fan.

Presentations

AI-driven next-generation developer tools Session

Bargava Subramanian and Harjinder Mistry explain how machine learning and deep learning techniques are helping Red Hat build smart developer tools to make software developers become more efficient.

Jagane Sundar is the CTO at WANdisco. Jagane has extensive big data, cloud, virtualization, and networking experience. He joined WANdisco through its acquisition of AltoStor, a Hadoop-as-a-service platform company. Previously, Jagane was founder and CEO of AltoScale, a Hadoop- and HBase-as-a-platform company acquired by VertiCloud. His experience with Hadoop began as director of Hadoop performance and operability at Yahoo. Jagane’s accomplishments include creating Livebackup, an open source project for KVM VM backup, developing a user mode TCP stack for Precision I/O, developing the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun Microsystems, and creating and selling a 32-bit VxD-based TCP stack for Windows 3.1 to NCD Corporation for inclusion in PC-Xware. Jagane is currently a member of the technical advisory board of VertiCloud. He holds a BE in electronics and communications engineering from Anna University.

Presentations

Extend on-premises Hadoop and Spark deployments across data centers and the cloud, including Microsoft Azure (sponsored by Microsoft and WANdisco) Session

Jagane Sundar and Pranav Rastogi explain how to meet your enterprise SLAs while making full use of resources with patented active data replication technology—something computer science still says is impossible.

Sahaana Suri is a second year PhD student in the Stanford InfoLab, working with Peter Bailis. Sahaana’s research focuses on building easily accessible data analytics and machine learning systems that scale. She holds a bachelor’s degree in electrical engineering and computer science from the University of California, Berkeley.

Presentations

MacroBase: A search engine for fast data streams Session

Sahaana Suri offers an overview of MacroBase, a new analytics engine from Stanford designed to prioritize the scarcest resource in large-scale, fast-moving data streams: human attention. MacroBase allows reconfigurable, real-time root-cause analyses that have already diagnosed issues in production streams in mobile, data center, and industrial applications.

Ben Szekely is a member of the founding team and vice president of solution engineering at Cambridge Semantics, where he leads an organization of technical and subject-matter experts to architect and deliver Anzo Smart Data Lake solutions and help customers and partners imagine and implement strategies for their enterprise information fabrics. In addition to attending Strata, Ben likes this time of year because the conditions for kiteboarding are ideal and ski season is just around the corner.

Presentations

Launching a breakthrough data lake platform for the enterprise information fabric (sponsored by Cambridge Semantics) Session

Only with a rich and interactive semantic layer can the data and analytics stack deliver true on-demand access to data, answers, and insights, weaving data together from across the enterprise into an information fabric. Ben Szekely shares the capabilities of the newly launched Anzo Smart Data Lake 4.0, the only end-to-end platform for semantic layers based on open standards.

Inbal Tadeski is a data scientist at Anodot, a provider of real-time machine learning anomaly detection and analytics solutions for detection of business incidents. Previously, Inbal was a research engineer at HP Labs, where she specialized in machine learning and data mining. She holds an MSc in computer science with a focus on machine learning from Hebrew University in Jerusalem and a BSc in computer science from Ben Gurion University.

Presentations

A spike in sales is not always good news: On the importance of learning the relationships between time series metrics at scale HDS

Inbal Tadeski demonstrates the importance of identifying relationships between time metrics so that they can be used for predictions, root cause diagnosis, and more. Inbal shares accurate methods that work on a large scale, such as behavioral pattern similarity clustering algorithms, and strategies for reducing FPs, FNs, and computational resources and distinguishing correlation and causation.

Nisha Talagala is CTO and vice president of engineering at Parallel Machines, where she focuses on production machine learning and deep learning solutions from the edge to the cloud. Nisha has more than 15 years of expertise in software development, distributed systems, I/O solutions, persistent memory, and flash. Previously, Nisha was a fellow at SanDisk; a fellow and lead architect at Fusion-io, where she drove innovation in nonvolatile memory, including the industry’s first persistent memory solution; technology lead for server flash at Intel, where she led server platform nonvolatile memory technology development, storage-memory convergence, and technical partner engagements; and CTO of Gear6, where she designed and built clustered computing caches for high-performance I/O environments. Nisha holds 48 patents in distributed systems, networking, storage, performance, and nonvolatile memory. She has authored many technical ad research publications and serves on multiple academic and industry conference program committees. Nisha holds a PhD from UC Berkeley, where her research focused on software clustering and distributed storage.

Presentations

The unspoken truths of deploying and scaling ML in production (sponsored by ParallelM) Session

Deploying ML in production is challenging. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.

David has extensive experience in building and operating web-scale data science and business platforms, as well as building world-class, Agile, distributed teams. Previously, he was with Microsoft’s Bing group, where he led business operations for Bing Shopping in the US and Europe, and at Amazon, where he built and ran distributed teams that helped scale Amazon’s financial systems. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Presentations

Meet the Expert with David Talby (Pacific AI) Meet the Experts

Got questions on applying machine learning, natural language processing, and deep learning, especially in the domains of healthcare and life sciences? Stop by and meet David.

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

When models go rogue: Hard-earned lessons about using machine learning in production Session

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Sean Taylor is the manager for the bioinformatics and high-throughput analytics team at Seattle Children’s Research Institute (SCRI), where he manages the support delivery effort for bioinformatics and computational biology solutions for the eight research centers and almost 1,000 researchers at SCRI. Sean led design and development efforts for SCRI’s integrated precision medicine repository and is now expanding the open source approaches and big data technologies to additional centers and cores. Previously, Sean led the initiative to develop and implement a state-of-the-art bioinformatics core resource at SCRI; was a computational biologist at Amgen, customizing and driving usability in a range of end user interfaces and visualization tools while applying analytic code from multiple projects for areas such as immunotherapy and inflammation; and held a postdoc at the Fred Hutchinson Cancer Research Center, where he developed a new ultrasensitive assay to detect rare mitochondrial DNA mutations in cancer and aging. Sean holds a PhD from Yale University and a BS from Brigham Young University.

Presentations

Project Rainier: Saving lives one insight at a time Session

Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed most.

Abraham Thomas is the cofounder and chief data officer of Quandl, a company he and cofounder Tammer Kamel created with the goal of making it easy for anyone to find and use high-quality data effectively in their professional decision making. Previously, Abraham was a portfolio manager and head of US bond trading at Simplex Asset Management, a multi-billion-dollar hedge fund group with offices in Tokyo, Hong Kong, and Princeton. He holds a degree from IIT Bombay.

Presentations

Oh buoy! How data science improves shipping intelligence for hedge funds Findata

Abraham Thomas demonstrates how maritime data can be used to predict physical commodity flows, in a case study that covers every stage of the data lifecycle, from raw data acquisition, data cleansing and structuring, and machine learning and probabilistic modeling to conversion to tractable format, packaging for final audience, and commercialization and distribution.

Alex Thomas is a data scientist at Indeed. He has used natural language processing (NLP) and machine learning with clinical data, identity data, and now employer and jobseeker data. He has worked with Apache Spark since version 0.9, and has worked with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Natural language understanding at scale with spaCy, Spark ML, and TensorFlow Tutorial

Natural language processing is a key component in many data science systems that must understand or reason about text. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, TensorFlow for training custom machine-learned annotators, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.

New York-based artist and educator Jer Thorp is an adjunct professor in New York University’s ITP program and a member of the World Economic Forum’s Network on AI, IoT, and the Future of Trust. Previously, Jer ran the Office for Creative Research, a multidisciplinary research group exploring new modes of engagement with data, and served as the data artist in residence at the New York Times. Jer is a former geneticist, and his digital art practice explores the many-folded boundaries between science and art. Jer’s award-winning software-based work has been exhibited in Europe, Asia, North America, and South America, including in the Museum of Modern Art in Manhattan, and has recently been featured by the Guardian, Scientific American, the New Yorker, and Popular Science. Jer is a National Geographic fellow and 2015 was named by Canadian Geographic as one of Canada’s greatest explorers.

Robin Thottungal is the EPA’s first chief data scientist, focused creating and implementing an agency-wide vision on analytics for effective data-driven decision making. Previously, at Deloitte Consulting, Robin provided clients with strategic advising on different aspect of creating a culture of using data within their organizations. Robin has served as a selection panelist for the American Academy of Sciences Hellman Fellowship in Science and Technology Policy for the past two years and vice chair for the IEEE’s Washington DC section.

Presentations

The US EPA: Digital transformation through data science Keynote

Data science is key to addressing national challenges with greater agility. At the EPA, the prime challenge is to provide the best value to American citizens in an ever-changing world. Robin Thottungal explains how the EPA addresses this challenge through digital and analytical services.

Richard Tibbetts is CEO of Empirical Systems, an MIT spinout building an AI-based data platform that provides decision support to organizations that use structured data. Previously, he was founder and CTO at StreamBase, a CEP company that merged with TIBCO in 2013, as well as a visiting scientist at the Probabilistic Computing Project at MIT.

Presentations

AI for business analytics Session

Businesses have spent decades trying to make better decisions by collecting and analyzing structured data. New AI technologies are beginning to transform this process. Richard Tibbetts explores AI that guides business analysts to ask statistically sensible questions and lets junior data scientists answer questions in minutes that previously took trained statisticians hours.

Steven Totman is Cloudera’s big data subject-matter expert, helping companies monetize their big data assets using Cloudera’s Enterprise Data Hub. Steve works with over 180 customers worldwide and helps across verticals in architectures around data management tools, data models, and ethical data usage. Previously, Steve ran strategy for a mainframe-to-Hadoop company and drove product strategy at IBM for DataStage and Information Server after joining with the Ascential acquisition. He architected IBM’s Infosphere product suite and led the design and creation of governance and metadata products like Business Glossary and Metadata Workbench. Steve holds several patents in data integration and governance- and metadata-related designs. Although he is based in NYC, Steve is happiest onsite with customers wherever they may be in the world.

Presentations

Executive panel: Big data use cases around the world Session

Big data and the cloud have spread around the world, and Singapore, New Zealand, Australia, and Canada are already seeing dramatic investments and returns. In a panel moderated by Steve Totman, senior executives from a variety of leading companies, including DBS, CIBC, and Qrious, share use cases, challenges, and how to be successful.

Griffin: Fast-tracking model development in Hadoop Session

Steven Totman and Faraz Rasheed offer an overview of Griffin, a high-level, easy-to-use framework built on top of Spark, which encapsulates the complexities of common model development tasks within four phases: data understanding, feature extraction, model development, and serving modeling results.

Michelle Tower is the associate director of business intelligence and analytics at Procter & Gamble (P&G), where she leads a global team and a range of BI service verticals covering e-business analytics, corporate performance insights, and the big data and advanced analytics platform as a service. Previously, Michelle was the associate director of customer business intelligence at P&G, where she fully leveraged retail customer data streams and internal data streams for advanced intelligence and business insights. She also held management positions at P&G in Latin America supply chain systems and in IT technologies services. Michelle began her career at P&G as a supply chain leader. She holds a bachelor’s degree in business administration and management information systems from Bowling Green State University.

Presentations

How visual analytics drove data asset success at Procter & Gamble (sponsored by Arcadia Data) Session

The early stages of delivering on your data strategies are daunting. With many claims of failed data lakes or “data swamps,” the journey seems risky, which is why you need help from industry experts to get going. Michelle Tower explains how P&G is using big data, Apache Hadoop, and visual analytics to quickly discover new insights and optimize data models for analytics and data visualization.

DB Tsai is a senior research engineer working on personalized recommendation algorithms at Netflix. He’s also a member of and committer for the Apache Spark Project Management Committee (PMC). DB has implemented several algorithms, including linear Rrgression and binary/multinomial logistic regression with elastic net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Previously, he was a lead machine learning engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms and contributed back to the open source Apache Spark project. DB was a PhD candidate in applied physics at Stanford University. He holds a master’s degree in electrical engineering from Stanford University.

Presentations

Boosting Spark MLlib performance with rich optimization algorithms Session

Recent developments in Spark MLlib have given users the power to express a wider class of ML models and decrease model training times via the use of custom parameter optimization algorithms. Seth Hendrickson and DB Tsai explain when and how to use this new API and walk you through creating your own Spark ML optimizer. Along the way, they also share performance benefits and real-world use cases.

Catherine Twist is the chief digital officer of Xcel Brands (which owns and manages the Isaac Mizrahi, Halston, Judith Ripka, and C. Wonder trademarks), where she is building an integrated data science, trend analytics, and technology platform to support Xcel’s fashion businesses across design, merchandising, marketing, and product development. Previously, Kate was chief marketing officer at Xcel, where she led marketing and PR for all brands; worked at Estee Lauder Companies, where she developed global marketing strategy for the Clinique brand; held operating roles at the fashion brands Bill Blass and Marc Jacobs International; and, earlier in her career, worked in investment banking and served as strategic advisor to top-tier luxury branded products, retail, and licensing companies. She holds a BA from the University of Pennsylvania and an MBA from Harvard Business School.

Presentations

Retail's panacea: How machine learning is driving product development Session

Karen Moon, Jared Schiffman, Eric Colson, and Catherine Twist explore how the retail industry is embracing data to include consumers in the design and development process, tackling the challenges associated with the wealth of sources and the unstructured nature of the data they handle and process and how the data is turned into insights that are digestible and actionable.

Madeleine Udell is assistant professor of operations research and information engineering and Richard and Sybil Smith Sesquicentennial Fellow at Cornell University, where she studies optimization and machine learning for large-scale data analysis and control, with applications in marketing, demographic modeling, medical informatics, and engineering system design. Her recent work on generalized low-rank models (GLRMs) extends principal components analysis (PCA) to embed tabular datasets with heterogeneous (numerical, Boolean, categorical, and ordinal) types into a low dimensional space, providing a coherent framework for compressing, denoising, and imputing missing entries. Madeleine has developed of a number of open source libraries for modeling and solving optimization problems, including Convex.jl, one of the top 10 tools in the new Julia language for technical computing, and is a member of the JuliaOpt organization, which curates high-quality optimization software. Previously, she was a postdoctoral fellow at Caltech’s Center for the Mathematics of Information, hosted by Joel Tropp. Madeleine holds a PhD in computational and mathematical engineering (under the supervision of Stephen Boyd) from Stanford University—where she was awarded a NSF graduate fellowship, a Gabilan graduate fellowship, and a Gerald J. Lieberman fellowship and was selected as the doctoral student member of Stanford’s School of Engineering Future Committee to develop a road map for the future of engineering at Stanford over the next 10–20 years—and a BS in mathematics and physics, summa cum laude with honors, from Yale University.

Presentations

Filling in missing data with generalized low-rank models HDS

Madeleine Udell explains how to fill in missing data with generalized low-rank models.

Michelle Ufford leads a team focused on data engineering innovation and centralized solutions at Netflix. Previously, she led the data management team at GoDaddy, where she built data engineering solutions for personalization and helped pioneer Hadoop data warehousing techniques. Michelle is a published author, patented developer, award-winning open source contributor, and Most Valuable Professional (MVP) for Microsoft Data Platform. You can find her on Twitter at @MichelleUfford.

Presentations

Working smarter, not harder: Driving data engineering efficiency at Netflix Session

What if we used the wealth of data and experience at our disposal to drive improvements in data engineering? Michelle Ufford explains how Netflix is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, and quickly identify and respond to issues.

Amy Unruh is a developer programs engineer for the Google Cloud Platform, with a focus on machine learning and data analytics as well as other Cloud Platform technologies. Amy has an academic background in CS/AI and has also worked at several startups, done industrial R&D, and published a book on App Engine.

Presentations

Getting started with TensorFlow Tutorial

Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.

Manuela M. Veloso is the Herbert A. Simon University Professor in the School of Computer Science at Carnegie Mellon University, where she is the head of the Machine Learning Department. Manuela’s research, undertaken with her students, focuses on artificial intelligence, particularly for a variety of autonomous robots, including mobile service robots and soccer robots. She is a fellow of the ACM, IEEE, AAAS, and AAAI and the author of numerous publications.

Presentations

Human-AI interaction: Autonomous service robots Keynote

Manuela Veloso explores human-AI collaboration, particularly in terms of robots learning from human sources and robot explanation generation to respond to language-based requests about their autonomous experience. Manuela concludes with a further discussion of general human-AI interaction and the opportunities for transparency and trust building of AI systems.

Ashish Verma is a managing director at Deloitte, where he leads the Big Data and IoT Analytics practice, building offerings and accelerators to enhance business processes and effectiveness. Ashish has more than 18 years of management consulting experience helping Fortune 100 companies build solutions that focus on addressing complex business problems related to realizing the value of information assets within an enterprise.

Presentations

Executive Briefing: From data insights to action—Developing a data-driven company culture Session

Ashish Verma explores the challenges organizations face after investing in hardware and software to power their analytics projects and the missteps that lead to inadequate data practices. Ashish explains how to course-correct and implement an insight-driven organization (IDO) framework that enables you to derive tangible value from your data faster.

Dean Wampler is the vice president of fast data engineering at Lightbend, where he leads the creation of the Lightbend Fast Data Platform, a streaming data platform built on the Lightbend Reactive Platform, Kafka, Spark, Flink, and Mesosphere DC/OS. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects. He’s also the co-organizer of several conferences around the world and several user groups in Chicago.

Presentations

Exactly once, more than once: Apache Kafka, Heron, and Apache Apex Session

In a series of three 11-minute presentations, key members of Apache Kafka, Heron, and Apache Apex discuss their respective implementations of exactly once delivery and semantics.

Meet the Expert with Dean Wampler (Lightbend) Meet the Experts

Join Dean to discuss all things streaming, especially with Kafka, Spark, Flink, Akka Streams, and Kafka Streams—from the future of machine learning in a streaming context to integrating stream processing with microservices.

Stream all the things! Session

While stream processing is now popular, streaming architectures must be more reliable and scalable than ever before—more like microservice architectures in fact. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools like Kafka, Spark, Flink, and Akka as examples, and argues that big data and microservices architectures are converging.

Peter Wang is the cofounder and CTO of Anaconda, where he leads the product engineering team for the Anaconda platform and open source projects including Bokeh and Blaze. Peter has been developing commercial scientific computing and visualization software for over 15 years and has software design and development experience across a broad variety of areas, including 3D graphics, geophysics, financial risk modeling, large data simulation and visualization, and medical imaging. As a creator of the PyData conference, he also devotes time and energy to growing the Python data community by advocating, teaching, and speaking about Python at conferences worldwide. Peter holds a BA in physics from Cornell University.

Presentations

Data science beyond the sandbox (sponsored by Anaconda) Session

Peter Wang explores the typical problems data science teams experience when working with other teams and explains how these issues can be overcome through cohesive collaborative efforts among data scientists, business analysts, IT teams, and more.

Melanie Warrick is a senior developer advocate at Google with a passion for machine learning problems at scale. Melanie’s previous experience includes work as a founding engineer on Deeplearning4j and as a data scientist and engineer at Change.org.

Presentations

Artificial intelligence Data 101

Melanie Warrick explores the definition of artificial intelligence and seeks to clarify what AI will mean for our world. Melanie summarizes AI’s most important effects to date and demystifies the changes we’ll see in the immediate future, separating myth from realistic expectation.

Ryan Weil is chief scientist in the Health Products and Solutions Group at Leidos. Ryan has nearly 20 years of experience in analytics and bioinformatics. Previously, he served as the program manager in support of the CDC Office of Infectious Disease’s bioinformatics and data analytics effort. Ryan holds a BS in microbiology from Texas A&M College Station and a PhD in molecular biophysics from UT Southwestern Medical Center in Dallas.

Presentations

Tracking the opioid-fueled HIV outbreak with big data (sponsored by Trifacta) Session

Ells Campbell, Connor Carreras, and Ryan Weil explain how the Microbial Transmission Network Team (MTNT) at the Centers for Disease Control (CDC) is leveraging new techniques in data collection, preparation, and visualization to advance the understanding of the spread of HIV/AIDS.

Brooke Wenig is a consultant for Databricks and a teaching associate at UCLA, where she has taught graduate machine learning, senior software engineering, and introductory programming courses. Previously, Brooke worked at Splunk and Under Armour as a KPCB fellow. She holds an MS in computer science with highest honors from UCLA with a focus on distributed machine learning. Brooke speaks Mandarin Chinese fluently and enjoys cycling.

Presentations

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML Tutorial

Brooke Wenig introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.

Kyle Wild is the CEO at Keen IO, which he cofounded as a part of the first class of Techstars Cloud, and sits on the company’s board of directors. Previously, he held positions in product management, software engineering, game design, and distributed systems scalability, working in the areas of API design, brand marketing, developer community evangelism, finance, organizational design, and recruiting. Kyle is a small-time angel investor and startup advisor and has spearheaded several rounds of financing from angel investors, seed funds, and large venture capital funds. He holds a BS in general engineering from the University of Illinois at Urbana-Champaign and knows quite a bit about analytics.

Presentations

Accelerating the next generation of data companies Session

This panel brings together partners from some of the world’s leading startup accelerators and founders of up-and-coming enterprise data startups to discuss how we can help create the next generation of successful enterprise data companies.

Edd Wilder-James is a strategist at Google, where he is helping build a strong and vital open source community around TensorFlow. A technology analyst, writer, and entrepreneur based in California, Edd previously helped transform businesses with data as vice president of strategy for Silicon Valley Data Science. Formerly Edd Dumbill, Edd was the founding program chair for the O’Reilly Strata conferences and chaired the Open Source Convention for six years. He was also the founding editor of the peer-reviewed journal Big Data. A startup veteran, Edd was the founder and creator of the Expectnation conference-management system and a cofounder of the Pharmalicensing.com online intellectual-property exchange. An advocate and contributor to open source software, Edd has contributed to various projects such as Debian and GNOME and created the DOAP vocabulary for describing software projects. Edd has written four books, including O’Reilly’s Learning Rails.

Presentations

Executive Briefing: Preparing your infrastructure for AI Session

Edd Wilder-James outlines a road map for executives who are beginning to consider their strategies for implementing artificial intelligence in their critical processes.

The business case for AI, Spark, and friends Data 101

AI is white-hot at the moment, but where can it really be used? Developers are usually the first to understand why some technologies cause more excitement than others. Edd Wilder-James relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.

Matt Winkler is a principal group program manager in the Data Group at Microsoft, where he leads a program management team building services and tools for developers to build intelligent apps using cognitive APIs, the Bot Framework, and the Cortana Intelligence Suite. Matt has worked at Microsoft for the last 10 years as an evangelist and a program manager working on the .NET Framework, Visual Studio, and Azure Web Sites. As part of the Microsoft big data team, Matt led a PM team building HDInsight, Microsoft’s managed Hadoop and Spark service and Azure data lake analytics. Matt holds a BS in mathematics and computer science from Denison University and an MBA from Washington University in St. Louis. In his free time, Matt enjoys skiing, hiking, and woodworking.

Presentations

Deploying to the edge, bringing AI everywhere (sponsored by Microsoft) Session

Matt Winkler shares real-world case studies on how healthcare, agriculture, and manufacturing companies are creating, training, deploying, and managing AI models faster with Microsoft Azure and deploying them to the cloud, on-premises, and to the edge.

Rose Winterton is a product director at Pitney Bowes, where she leads the product direction for the company’s location intelligence products and solutions with a recent focus on use of spatial processing in big data environments. Rose has 15 years’ experience in location intelligence and a wide range of personal customer experience in EMEA and the US, covering telecommunications, insurance, public sector, geosciences, and retail vertical markets. Previously, Rose worked on developing customer solutions as a senior consultant before moving into management. Rose studied GIS and remote sensing at University College London and geology at Oxford University.

Presentations

Benefits of big data geoenrichment for better business outcomes DCS

Geoenrichment uses a location-based key to manage data and provide a single view of a location. Rose Winterton explains how Pitney Bowes's Spectrum Technology Platform for big data allows fast processing of location-based data for address validation, geoenrichment, analysis, and integration with operational processes for more accurate decision making and better business outcomes.

Ian Wrigley is the technology evangelist at StreamSets, the company behind the industry’s first data operations platform. Over his 25-year career, Ian has taught tens of thousands of students subjects ranging from C programming to Hadoop development and administration.

Presentations

Building real-time data pipelines with Apache Kafka Tutorial

Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis.

Bichen Wu is a PhD student at UC Berkeley. His research focuses on deep learning, computer vision, and autonomous driving.

Presentations

Efficient neural networks for perception for autonomous vehicles HDS

Bichen Wu explores perception tasks for autonomous driving and explains how to design efficient neural networks to address critical issues such as latency, energy efficiency, and model size.

Jennifer Wu is director of product management for cloud at Cloudera, where she focuses on cloud strategy and solutions. Previously, Jennifer worked as a product line manager at VMware, working on the vSphere and Photon system management platforms.

Presentations

A deep dive into running data engineering workloads in AWS Tutorial

Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads.

How to successfully run data pipelines in the cloud Session

With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Jennifer Wu, Philip Langdale, and Kostas Sakellis explore the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers.

Stephen Wu is a senior program manager for big data at Microsoft.

Presentations

Performance tuning your Hadoop/Spark clusters to use cloud storage Session

Remote storage in the cloud provides an infinitely scalable, cost-effective, and performant solution for big data customers. Adoption is rapid due to the flexibility and cost savings associated with unlimited storage capacity when separating compute and storage. Stephen Wu demonstrates how to correctly performance tune your workloads when your data is stored in remote storage in the cloud.

Wei Yan is a senior engineer at Uber, where he builds data processing and querying systems that scale along with Uber’s hypergrowth.

Presentations

Geospatial big data analysis at Uber Session

Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto.

Yan Yan is an engineer at LinkedIn, where he works on the Voldemort and Venice team within the company’s data infrastructure organization. He has extensive experience working on cluster management, Zookeeper, Helix, and distributed systems in general.

Presentations

Introducing Venice: A derived datastore for batch, streaming, and lambda architectures Session

Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency.

Fangjin Yang is a coauthor of the open source Druid project and a cofounder of Imply, a data analytics startup based in San Francisco. Previously, Fangjin held senior engineering positions at Metamarkets and Cisco Systems. Fangjin has a BASc in electrical engineering and an MASc in computer engineering from the University of Waterloo, Canada.

Presentations

Analytics at Wikipedia Session

The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to do so.

Han Yang is a technical marketing manager and product manager at Cisco, where he helps drive Cisco’s hybrid cloud strategy and drives UCS big data and analytics solutions. Previously, as product manager, Han led the largest switching beta at Cisco with the software virtual switch, Nexus 1000V, for over 20,000 customers. He holds a PhD in electrical engineering from Stanford University.

Presentations

Building the IoT data lifecycle (sponsored by Cisco) Session

For many enterprises, the internet of things represents an opportunity to transform the business by examining its data from a holistic lifecycle perspective and generating, analyzing, and archiving the data to reengineer the enterprise. Han Yang explores the latest trends and the role of infrastructure in enabling such a transformation.

Yuhao Yang is a software engineer at Intel, where he provides implementation, consulting, and tuning advice on the Hadoop ecosystem to industry partners. Yuhao’s area of focus is distributed machine learning, especially large-scale analytical applications and infrastructure on Spark. He’s also an active contributor to Spark MLlib (50+ patches), has delivered the implementation of online LDA, QR decomposition, and several transformers of Spark feature engineering, and has provided improvements on some important algorithms.

Presentations

Building advanced analytics and deep learning on Apache Spark with BigDL Session

Yuhao Yang and Zhichao Li discuss building end-to-end analytics and deep learning applications, such as speech recognition and object detection, on top of BigDL and Spark and explore recent developments in BigDL, including Python APIs, notebook and TensorBoard support, TensorFlow model R/W support, better recurrent and recursive net support, and 3D image convolutions.

Chuck Yarbrough is the senior director of solutions marketing and management at Pentaho, a leading big data analytics company that helps organizations engineer big data connections, blend data, and report and visualize all of their data. Chuck is responsible for creating and driving Pentaho solutions that leverage the Pentaho platform, enabling customers to implement big data solutions quicker and achieve greater ROI faster. Chuck has more than 20 years of experience helping organizations use technology to their advantage to ensure they can run, manage, and transform their business through better use of data. A lifelong participant in the data game, Chuck has held leadership roles at Deloitte Consulting, SAP Business Objects, Hyperion, and National Semiconductor.

Presentations

The converging world of big data and the IoT (sponsored by Pentaho) Session

The IoT can deliver real outcomes that can transform organizations—and societies—for the better. But the IoT is not transformative without the power of big data. Chuck Yarbrough shares examples of where the IoT and big data have combined to solve significant business challenges and take advantage of business opportunities.

Lucy Yu is an engineer at MemSQL. Lucy holds a degree in computer science and a master of engineering from MIT, where, under Matei Zaharia, she worked on implementing an experimental framework for work sharing in Spark.

Presentations

Exploring real-time capabilities with Spark SQL Session

Lucy Yu demonstrates how to extend the Spark SQL abstraction to support more complex pushdown, such as group by, subqueries, and joins.

Matei Zaharia is an assistant professor in the Computer Science Department at Stanford, where he works on computer systems and big data.

Presentations

Weld: Accelerating data science by 100x Session

Modern data applications combine functions from many optimized libraries (e.g., pandas and TensorFlow) and yet do not achieve peak hardware performance due to data movement across functions. Shoumik Palkar and Matei Zaharia offer an overview of Weld, a new interface to implement functions in these libraries while enabling optimizations across them.

Ben Zaitlen is a data scientist and developer at Anaconda. He has several years of experience with Python and is passionate about any and all forms of data. Currently, he spends his time thinking about usability of large data systems and infrastructure problems as they relate to data management and analysis.

Presentations

Scaling Python data analysis Tutorial

The Python data science stack, which includes NumPy, pandas, and scikit-learn, is efficient and intuitive but only for in-memory data and a single core. Matthew Rocklin and Ben Zaitlen demonstrate how to parallelize and scale your Python workloads to multicore machines and multimachine clusters.

Tristan Zajonc is a senior engineering manager at Cloudera. Previously, he was cofounder and CEO of Sense, a visiting fellow at Harvard’s Institute for Quantitative Social Science, and a consultant at the World Bank. Tristan holds a PhD in public policy and an MPA in international development from Harvard and a BA in economics from Pomona College.

Presentations

Data science at team scale: Considerations for sharing, collaborating, and getting to production Session

Data science alone is easy. Data science with others, whether in the enterprise or on shared distributed systems, requires a bit more work. Tristan Zajonc and Thomas Dinsmore discuss common technology considerations and patterns for collaboration in large teams and for moving machine learning into production at scale.