20–23 April 2020

Speakers

Hear from innovative programmers, talented managers, and senior executives who are doing amazing things with data and AI. More speakers will be announced; please check back for updates.

Grid viewList view

Nutsa Abazadze is a senior Data Scientist at TBC Bank. Her main responsibilities include Analyzing and modeling customers’ data from different perspectives and delivering high quality reports to decision makers. Before joining TBC Bank, Nutsa worked in market research industry and applied machine learning models to survey data. Nutsa holds a master’s degree in survey statistics from the University of Bamberg, Germany. In frames of her master thesis she developed an internal R package for questionnaire modularization for one of the biggest market research companies.

Presentations

How a Failed Machine Learning Exercise Increased Deposit Profitability by 20% Session

We will tell you how our failed attempt to build an ML model brought us to discovering institutional problems and kicked off improvement of existing business processes so that we would collect quality data for future modeling; and how we still managed to increase deposit profitability by 20% in the process.

Tatiana Al-Chueyr Martins is a senior data engineer on the Datalab team at the BBC, where she contributes to the development of recommendation systems and connects the data across the organization. She’s passionate about open source, data-driven problem solving and Python. Tatiana has over 15 years of experience developing various software applications, including tridimensional image processing and educational platform.

Presentations

Taming recommendation systems workflows with Apache Airflow Session

During the last year, BBC's Datalab team adopted Apache Airflow to improve its recommendation model lifecycle and data processing pipeline. Tatiana Al-Chueyr Martins shares insights and practical examples, achievements, and challenges. You'll leave empowered to decide when to use Airflow.

Alasdair Allan is a director at Babilim Light Industries and a scientist, author, hacker, maker, and journalist. An expert on the internet of things and sensor systems, he’s famous for hacking hotel radios, deploying mesh networked sensors through the Moscone Center during Google I/O, and for being behind one of the first big mobile privacy scandals when, back in 2011, he revealed that Apple’s iPhone was tracking user location constantly. He’s written eight books and writes regularly for Hackster.io, Hackaday, and other outlets. A former astronomer, he also built a peer-to-peer autonomous telescope network that detected what was, at the time, the most distant object ever discovered.

Presentations

Benchmarking machine learning at the edge Session

The future of machine learning is on the edge and on small, embedded devices. Over the last year, custom silicon, intended to speed up machine learning inferencing on the edge, has started to appear. No cloud needed. Alasdair Allan evaluates the new silicon, looking not just at inferencing speed but also at heating, cooling, and the overall power envelope needed to run it.

Magaly Alonzo is a data scientist at Elter. He works on different use cases, for example, with railroad companies or the French army. With a background in neurosciences, he focuses on data science and more specifically the development of AIs based on either time series data or images. These developments often include real-time management and inference, relying on new generation cards such as NVIDIA Jetson or Intel OpenVino. His experience with neuroscience often helps finding ideas on artificial architecture that are on the edge of actual research. He’s a coauthor of a book, Apprendre demain (only in French), which aims at depicting AI and the neurosciences as disciplines that have to work together and interact more.

Presentations

Dealing with time series data Session

Time series is a particular type of data for one purpose: time. Because of this single property, time series needs a very specific kind of neural network that necessitates memory. Magaly Alonzo offers an overview of what time series is and its properties. And you'll dive into recurrent neural nets, a particular architecture designed for this purpose.

Shradha Ambekar is a staff software engineer in the Small Business Data Group at Intuit, where she’s the technical lead for lineage framework (SuperGLUE), real-time analytics, and has made several key contributions in building solutions around the data platform, and she contributed to spark-cassandra-connector. She has experience with Hadoop distributed file system (HDFS), Hive, MapReduce, Hadoop, Spark, Kafka, Cassandra, and Vertica. Previously, she was a software engineer at Rearden Commerce. Shradha spoke at the O’Reilly Open Source Conference in 2019. She holds a bachelor’s degree in electronics and communication engineering from NIT Raipur, India.

Presentations

Always accurate business metrics with lineage-based anomaly tracking Session

Imagine a business metric showing a sudden spike. Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and debug issues in minutes.

Optimizing analytical queries on Cassandra by 100x Session

Data analysis at scale with fast query response is critical for businesses. Cassandra, a popular datastore used in streaming applications, with Spark integration allows running analytical workload but can be slow. Shradha Ambekar unpacks similar challenges faced at Intuit and the solutions her team implemented to improve performance by 100X.

Janisha Anand is a senior business development manager for data lakes at AWS, where she focuses on designing, implementing, and architecting large-scale solutions in the areas of data management, data processing, data architecture, and data analytics.

Presentations

Build a serverless data lake for analytics 1-Day training

Janisha Anand and Nikki Rouda teach you how to build a serverless data lake on AWS. You'll ingest Instacart's public dataset to the data lake and draw valuable insights on consumer grocery shopping trends. You’ll build data pipelines, leverage data lake storage infrastructure, configure security and governance policies, create a persistent catalog of data, perform ETL, and run an ad hoc analysis.

Eitan is currently the Director of Data Science at Bill.com and has many years of experience as a scientist and researcher. His recent focus is in machine learning, deep learning, applied statistics and engineering. Before, Eitan was a Postdoctoral Scholar at Lawrence Berkeley National Lab, received his PhD in Physics from Boston University and his B.S. in Astrophysics from University of California Santa Cruz. Eitan has 2 patents and 11 publications to date and has spoken about data at various conferences around the world.

Presentations

Beyond OCR: Using deep learning to understand documents Session

Although the field of optical character recognition (OCR) has been around for half a century, document parsing and field extraction from images remains an open research topic. We utilize an end-to-end deep learning architecture that leverages document understanding to extract fields of interest.

Antje Barth is a senior developer advocate for AI and machine learning at AWS. Besides AI and ML, Antje is passionate about helping developers leverage big data, container, and Kubernetes platforms in the context of AI and machine learning. Previously, Antje was in technical evangelist and solutions engineering roles at MapR and Cisco. She frequently speaks at AI and machine learning conferences and meetups around the world. Antje is a cofounder of the Düsseldorf chapter of Women in Big Data.

Presentations

Closing the loop: Continuous machine learning using Kubeflow Session

Many machine learning systems focus primarily on training models but leave users with the task of deploying and retraining their models. Antje Barth discusses the importance of continuous machine learning for improving model performance and details practical approaches to building continuous model training pipelines using Kubeflow.

Jason Bell specializes in high-volume streaming systems for large retail customers, using Kafka in a commercial context for the last five years. Jason was section editor for Java Developer’s Journal, has contributed to IBM developerWorks on autonomic computing, and is the author of Machine Learning: Hands On for Developers and Technical Professionals.

Presentations

From Apache Kafka to Apache Pulsar: The plan and the reality Session

Apache Pulsar gives you the same robust real-time messaging capabilities as Kafka. Jason Bell examines the challenges of migrating from an existing Kafka cluster to Apache Pulsar and what considerations you need to make with brokers, topics, retention, consumers, and producers.

David Benham is a data scientist at Chesapeake Energy.

Presentations

Using Spark for anomaly detection at scale: A case study Session

Cloudera and Chesapeake Energy present a real-world use case for anomaly detection at scale to reduce time-to-action in response to pipeline blockage. You'll apply these to the use case, including the business context, the problem, the machine learning approach taken, the technical architecture employed, and the lessons learned.

Giacomo Bernardi is Distinguished Engineer at Extreme Networks, where he works on multiple science-heavy project for traffic engineering and network traffic visibility analytics. He leads a global team of data scientist and machine learning engineers. Giacomo is a self-proclaimed networking nerd and was CTO of a large Internet Service Provider where he built a custom software-defined platform. He holds a PhD in Wireless Networking from the University of Edinburgh (UK), a MSc from Trinity College Dublin (Ireland) and a BSc from the University of Milan (Italy).

Presentations

What do machines say when nobody’s looking? Tracking IoT security with NLP Session

Machines talk among them! What can we learn about their behaviour by analysing their "language"? In this talk we present a lightweight approach for securing large IoT deployments by leveraging modern Natural Language Processing techniques. Rather than attempting cumbersome firewall rules, we argue that IoT deployments can be efficiently secured by online behavioural modelling.

Rajesh Shreedhar Bhat is working as a Data Scientist at Walmart Labs, Bangalore. His work is primarily focused on building reusable machine/deep learning solutions that can be used across various business domains at Walmart. He completed his Bachelor’s degree from PESIT, Bangalore and currently pursuing his MS in CS with ML specialization from Arizona State University.
He has a couple of research publications in the field of NLP and vision, which are published at top tier conferences such as CoNLL, ASONAM, etc.. and he has filed 6 US patents in Retail space leveraging AI & ML. He is a Kaggle Expert(World Rank 966/122431) with 3 silver and 2 bronze medals and has been a speaker in highly recognized conferences/meetups such as Data Hack Summit, India’s Largest Applied Artificial Intelligence & Machine Learning Conference, Kaggle days meet up – Senior Track, etc ..
Apart from this, Rajesh is a mentor for Udacity Deep learning & Data Scientist Nanodegree programs for the past 3 years and has conducted ML & DL workshops in GE Healthcare, IIIT Kancheepuram and many other places.

More details:

Presentations

Attention Networks all the way to production using Kubeflow 1-Day training

With the latest developments and improvements in the field of deep learning and artificial intelligence, many demanding natural language processing tasks become easy to implement and execute. Text summarization is one of the tasks that can be done using attention networks.

Satadal Bhattacharjee is principal product manager at AWS AI. He leads the machine learning engine PM team on projects such as SageMaker, optimizing and enhancing machine learning frameworks, and AWS deep learning containers and AMIs. For fun outside work, Satadal loves to hike, coach robotics teams, and spend time with his family and friends.

Presentations

Using Amazon SageMaker to build, train, and deploy ML models 1-Day training

Build, train, and deploy a deep learning model on Amazon SageMaker with Nathalie Rauschmayr, Satadal Bhattacharjee, and Aparna Elangovan, and learn how to use some of the latest SageMaker features such as SageMaker Debugger and SageMaker Model Monitor.

Wojciech Biela is a co-founder of Starburst, where he’s responsible for product development. He has over 15 years’ experience building products and running engineering teams. Previously, Wojciech was the engineering manager at the Teradata Center for Hadoop, running the Presto engineering operations in Warsaw, Poland; built and ran the Polish engineering team for a subsidiary of Hadapt, a pioneer in the SQL-on-Hadoop space (acquired by Teradata in 2014); and built and led teams on multiyear projects from custom big ecommerce and SCM platforms to POS systems. Wojciech holds an MS in computer science from the Wroclaw University of Technology.

Presentations

Presto on Kubernetes: Query anything, anywhere Session

Wojciech Biela and Karol Sobcza explore Presto, an open source SQL engine, offering high concurrency, low-latency queries across multiple data sources within one query. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring.

Marcel Blattner is a chief data scientist at Tamedia, Switzerland. He’s responsible for developing an analytical stack within the Tamedia end-to-end architecture to facilitate new insights from data benefiting all stakeholders. He earned his PhD in physics.

Presentations

The black box problem Session

We still lack a clear understanding of how deep learning neural networks learn. Theoretical physics can provide some tools to gain more insight about generalization and model robustness. Marcel Blattner offers an overview of ongoing research and the first promising and applicable results.

Adam Blum is the CEO of Auger.AI. He’s a serial CEO, CTO, vice president of engineering, and cofounder, and a continually active open source contributor for Python, Go, and Ruby. He initiated and participated in five technology standards, including founding LTI Resource Search. He wrote what may be the first book on applied neural networks, Neural Networks in C++ (Wiley, 1992). Previously, he was a professor at UC Berkeley and Carnegie Mellon.

Presentations

Automating AutoML: How automated building of machine learning models transforms software Session

First generation AutoML was targeted to business analysts and "citizen data scientists": upload data to the service, watch the leaderboard, pick a winning model. Second generation of AutoML is targeted to developers and covers the full AutoML lifecycle. Join Adam Blum to learn how tools transform applications by replacing logic with predictions.

Hugo Bowne-Anderson is a data scientist at DataCamp and has had extensive experience teaching basic to advanced data science topics at institutions such as Yale University and Cold Spring Harbor Laboratory, conferences such as SciPy, PyCon, and with organizations such as Data Carpentry. He developed over 25 courses on the DataCamp platform, impacting over 300,000 learners worldwide. He has also hosted DataFramed, the DataCamp podcast, loves teaching Bayesian data analysis, and aspires to reduce as much computational anxiety in the world as he can through pedagogy. His main interests are promoting data and AI literacy and fluency and helping to spread data skills throughout organizations.

Presentations

Essential math and statistics for data science 2-Day Training

Hugo Bowne-Anderson walks you through the basics of the math and stats you need to know to do data science and interpret your results correctly (the calculus, linear algebra, statistical intuition, and probabilistic thinking, among others) through hands-on examples from machine learning, online experiments and hypothesis testing, natural language processing, data ethics, and more.

Essential math and statistics for data science (Day 2) Training Day 2

Hugo Bowne-Anderson walks you through the basics of the math and stats you need to know to do data science and interpret your results correctly (the calculus, linear algebra, statistical intuition, and probabilistic thinking, among others) through hands-on examples from machine learning, online experiments and hypothesis testing, natural language processing, data ethics, and more.

Yaakov Bressler is a data scientist and theater producer at Dramatic Solutions. His works include Magic the Play and Jung and Crazy. He has extensive consulting experience in data science and uses advanced mathematics and sophisticated algorithms to tackle complicated problems. In his theater work, he’s drawn to tackling societal issues. Straddling both worlds and both roles has allowed him to improve the communication and accessibility of advanced analytical practices to business leaders within entertainment and the arts.

Presentations

Dynamic pricing for Broadway and the West End Session

Dynamic pricing implemented properly by Broadway, the West End, and smaller theaters shows the promise of increasing revenue while selling more tickets and lowering prices. Kelly Carmody and Yaakov Bressler dig into their work proving the statistics behind dynamic pricing using probability distributions and a variety of modeling techniques in Python.

Patrick Buehler is a principal data scientist at Microsoft’s Cloud AI Group. He has over 15 years of working experience in academic settings and with various internal/external customers spanning a wide range of computer vision problems. He obtained his PhD from Oxford in Computer Vision with Prof. Andrew Zisserman.

Presentations

Solving real-world computer vision problems with open source Session

Training and deployment of deep neural networks for computer vision (CV) in realistic business scenarios remains a challenge for both data scientists and engineers. Angus Taylor and Patrick Buehler dig into state-of-the-art in the CV domain and provide resources and code examples for various CV tasks by leveraging the Microsoft CV best-practices repository.

Alberto Calleja is a software engineer interested in building products people love in agile environments with a focus on high-quality tests and clean code. At the moment, I am part of the Spring Engineering team at Pivotal working from Seville, Spain on a fully remote team. We are building Spring Cloud related products and frameworks to help people adopting a microservices architecture and improving the experience of Spring in Cloud Foundry and Kubernetes. I like to focus on Reliability, Continuous Delivery, and Testing tasks and I mainly commit to Java Open Source projects.

Presentations

Kubernetes Distilled - An in depth guide for the busy data engineer 1-Day training

Today's data engineer needs a deep understanding of the key tools and concepts within the vast, rapidly evolving Kubernetes ecosystem. This training will provide developers with a thorough grounding on Kubernetes concepts, suggest best practices and get hands-on with some of the essential tooling. Topics will include

Kelly Carmody is a data scientist at Dramatic Solutions in New York City, where she gives workshops and talks for the theater community on how to increase profitability, accessibility, and quality of theater by implementing innovative tech and analytics solutions. She has an interdisciplinary background in neuroscience, sociology, and epidemiology. She has a range of research experience at institutions ranging from the University of the Virgin Islands to Columbia University and received her master’s degree in infectious disease control from the London School of Hygiene and Tropical Medicine.

Presentations

Dynamic pricing for Broadway and the West End Session

Dynamic pricing implemented properly by Broadway, the West End, and smaller theaters shows the promise of increasing revenue while selling more tickets and lowering prices. Kelly Carmody and Yaakov Bressler dig into their work proving the statistics behind dynamic pricing using probability distributions and a variety of modeling techniques in Python.

Jeff Carpenter leads the Developer Advocate team at DataStax, using his background in system architecture, microservices and Apache Cassandra to help empower developers and operations engineers to build distributed systems that are scalable, reliable, and secure. Jeff has worked on large-scale systems in the defense and hospitality industries and is the author of Cassandra: The Definitive Guide, 2nd Edition (3rd Edition on the way!).

Presentations

Building Data Pipelines with Kafka and Cassandra Interactive session

In this hands-on training, you’ll learn how to incorporate Apache Cassandra and Apache Kafka into your data pipelines, using the Kafka Connect framework and the DataStax Kafka source and sink Connectors.

Wei-Chiu Chuang is a software engineer at Cloudera, where he’s responsible for the development of Cloudera’s storage systems, mostly the Hadoop Distributed File System (HDFS). He’s an Apache Hadoop Committer and Project Management Committee member for his contribution in the open source project. He’s also a cofounder of the Taiwan Data Engineering Association, a nonprofit organization promoting better data engineering technologies and applications in Taiwan. Wei-Chiu earned his PhD in computer science from Purdue University for his research in distributed systems and programming models.

Presentations

Distributed tracing in Apache Hadoop Session

Distributed tracing is a well-known technique for identifying where failures occur and the reason behind poor performance, especially for complex systems like Hadoop, which involves many different components. Siyao Meng and Wei-Chiu Chuang demo the work on integrating OpenTracing in the Hadoop ecosystem and outline Cloudera's future integration plan.

Alexandre (Alex) Combessie is a data scientist at Dataiku who designs and deploys data projects with machine learning from prototype to production. Previously, he helped build the data science team at Capgemini in France. Having begun his career in economic analysis, he continues to work on interpretable models in complement to deep learning. Alex loves travel and enjoys learning new things and making useful products.

Presentations

Generative adversarial networks for finance Session

The Gaussian assumption in the Black-Scholes formula for option pricing has proven it's limited. Today, GANs are the new gold standard for simulation. It's worked wonders in image generation, but it remains to be seen if it can be applied to option pricing. Alexandre Combessie tells you the story of how two data scientists deployed a GAN for option pricing in real time in 10 days.

Maurice Coyle is a chief data scientist at Trūata. He has more than 15 years of experience building innovative technology solutions that deliver improved experiences while respecting user privacy. Maurice’s deep technical and academic expertise gained during his PhD and post-doctoral studies are complemented by a wealth of commercial expertise gained from cofounding and leading a tech startup as CEO.

Presentations

Data Privacy Mythbusting Session

Is customer trust dead? Maurice Coyle unpacks this question and explores some of the myths around the use of personal data and consumer privacy. He debunks some of the most common data privacy myths and shares valuable insights into the effective use of data for insights-driven organizations.

Alistair Croll is an entrepreneur with a background in web performance, analytics, cloud computing, and business strategy. In 2001, he cofounded Coradiant (acquired by BMC in 2011) and has since helped launch Rednod, CloudOps, Bitcurrent, Year One Labs, and several other early-stage companies. He works with startups on business acceleration and advises a number of larger companies on innovation and technology. A sought-after public speaker on data-driven innovation and the impact of technology on society, Alistair has founded and run a variety of conferences, including Cloud Connect, Bitnorth, and the International Startup Festival, and is the chair of O’Reilly’s Strata Data Conference. He has written several books on technology and business, including the best-selling Lean Analytics. Alistair tries to mitigate his chronic ADD by writing about far too many things at Solve For Interesting.

Presentations

Thursday keynotes Keynote

Strata Data & AI Conference program chairs Rachel Roumeliotis and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Strata Data & AI Conference program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Tom is the Head of Decision Science at Monzo where he’s developing the next generation of decisioning tools to help deliver on Monzo’s mission of making money work for everyone.

Tom is a strong advocate of Diversity in Data Science as well as actively supporting Mental Health and Neurodiversity initiatives.

Outside of work, Tom is a keen photographer, traveller and owns far too many board games.

Presentations

AI safety: How do we bridge the gap between technology and the law? Session

Firms and government have become more aware of the risk of "black-box" algorithms that "work," but in an opaque way. Existing laws and regulations merely stipulate what ought to be the case and not to achieve it technically. Richard Sargeant is joined by leading figures from law, technology, and businesses to interrogate this subject.

Robert Crowe is a data scientist and TensorFlow Developer Advocate at Google with a passion for helping developers quickly learn what they need to be productive. He’s used TensorFlow since the very early days and is excited about how it’s evolving quickly to become even better than it already is. Previously, Robert deployed production ML applications and led software engineering teams for large and small companies, always focusing on clean, elegant solutions to well-defined needs. In his spare time, Robert sails, surfs occasionally, and raises a family.

Presentations

From research to production: Lessons that Google has learned Session

Production ML must address issues of modern software methodology as well as issues unique to ML. Different types of ML have different requirements, often driven by different data lifecycles and ground truth. And implementations often suffer from limitations in modularity, scalability, and extensibility. Robert Crowe examines production ML applications and reviews TensorFlow Extended (TFX).

Michael Cullan is a data scientist in residence at Pragmatic Institute, where he teaches hands-on courses in data science and business-oriented topics in managing data science initiatives at the organizational level. He also leads internal data science projects in support of marketing and operations teams. He earned a master’s degree in statistics and a bachelor’s degree in mathematics. His academic research areas ranged from computational paleobiology, where he developed software for measuring evidence for disparate evolutionary models based on fossil data, to music and AI, where he assisted in modeling musical data for a jazz improvisation robot. In his free time, he applies his math and programming skills toward creating code-based visual art and design projects.

Presentations

Machine learning from scratch in TensorFlow 2-Day Training

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms.

Machine learning from scratch in TensorFlow (Day 2) Training Day 2

The TensorFlow library provides data flow graphs for numerical computations with automatic parallelization across several CPUs or GPUs. Michael Cullan teaches you why this architecture makes it ideal for implementing neural networks and other machine learning algorithms.

Walid Daboubi is cyber data scientist at Richemont, where he develops threat-hunting solutions by applying machine learning and advanced data analytics on cyber-resilience, such as a malware-detection project using a deep autoencoder neural network. Previously he worked on cloud security development at Dassault Systèmes. He earned a master’s in computer science from Université de Technologie de Compiègne. Walid has presented at a number of conferences on machine learning use cases.

Presentations

Hunting with AI: A guide to proactive incident response Session

Traditional cybersecurity processes are by definition reactive in that they're based on a set of rules. Walid Daboubi offers you a glimpse into how Richemont made its cybersecurity approach more proactive by applying machine learning on a set of concrete use cases.

Tal Doron is the director of technology innovation at GigaSpaces, where he bridges the gap between business and technology, architecting, and strategizing digital transformation from ideas to success with strong business impact, and he manages presales activities, engaging with all levels of decision makers from architects to strategic dialogue with C-level executives. Tal brings over a decade of technical experience in enterprise architecture specializing in mission-critical applications with focus on real-time analytics, distributed systems, identity management, fusion middleware, and innovation. Previously, Tal held positions at Dooblo, Enix, Experis BI, and Oracle.

Presentations

Visualize your operational data, analytics, and machine learning insights in real time Session

More enterprises are using big data for better business decision making, but existing infrastructure lacks the performance and scale needed to support the growing requirements for real-time analysis and visualization of operational data. Tal Doron outlines how you can achieve BI visualization on fresh data for real-time dashboards and low-latency response time when generating reports.

Robert Drysdale is a senior manager at Accenture’s Global Innovation Centre, The Dock. He manages the AI and data engineering teams, who build AI systems for Accenture business units and clients.

Presentations

Cloud native machine learning Session

You'll take a look into building, training, and deploying machine learning and deep learning models on the main cloud platforms (AWS, Azure, GCP) and agnostically with Robert Drysdal.

Ted Dunning is the chief technology officer at MapR, an HPE company. He’s also a board member for the Apache Software Foundation; a PMC member; and committer on a number of projects. Ted has years of experience with machine learning and other big data solutions across a range of sectors. He’s contributed to clustering, classification, and matrix decomposition algorithms in Mahout and to the new Mahout Math library and designed the t-digest algorithm used in several open source projects and by a variety of companies. Previously, Ted was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud-detection systems for ID Analytics (LifeLock). Ted has coauthored a number of books on big data topics, including several published by O’Reilly related to machine learning, and has 24 issued patents to date plus a dozen pending. He holds a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.

Presentations

Building real-world data pipelines Session

Data pipelines are fast becoming a standard fixture in modern systems, but how to build and maintain them isn't nearly as widely known as, say, building a data warehouse. Ted Dunning demystifies the core building blocks of such pipelines and how to use tools such as TensorFlow (extended), scikit-learn, Apache Flink, and Apache Beam to build, maintain, and monitor them.

Aparna Elangovan is an AI ML prototyping engineer at AWS. She designs deep learning solutions in computer vision and natural language processing on AWS.

Presentations

Using Amazon SageMaker to build, train, and deploy ML models 1-Day training

Build, train, and deploy a deep learning model on Amazon SageMaker with Nathalie Rauschmayr, Satadal Bhattacharjee, and Aparna Elangovan, and learn how to use some of the latest SageMaker features such as SageMaker Debugger and SageMaker Model Monitor.

Jeff Evans is a staff software engineer at StreamSets, where he helped build its state-of-the-art data ops platform. Besides the obvious—developing new features and fixing bugs—he also engages actively in the StreamSets community channels, fleshes out technical designs for numerous projects, and spends many long hours debugging thorny customer issues. Most recently, he’s been deeply involved in Transformer, StreamSets’s next generation engine for executing data pipelines on Apache Spark. Previously, he worked for a decade in the financial industry under a variety of technology roles and spent a few years building student safety solutions in the education technology sector.

Presentations

Implementing slowly changing dimensions on Spark Session

Spark is a powerful tool for data processing, but can it do slowly changing dimensions? The answer is yes, with some thoughtful use of its capabilities. And thanks to Spark’s built-in features, you aren’t limited to databases when it comes to handling deltas and persisting historical changes in records. Jeff Evans includes live demos so you can see these concepts in action.

Brandy Freitas is a principal data scientist at Pitney Bowes, where she works with clients in a wide variety of industries to develop analytical solutions for their business needs. Brandy is a research-physicist-turned-data-scientist based in Boston, Massachusetts. Her academic research focused primarily on protein structure determination, applying machine learning techniques to single-particle cryoelectron microscopy data. Brandy is a National Science Foundation Graduate Research Fellow and a James Mills Pierce Fellow. She holds an undergraduate degree in physics and chemistry from the Rochester Institute of Technology and did her graduate work in biophysics at Harvard University.

Presentations

Enhance machine learning with graph-native algorithms Session

Brandy Freitas demystifies the mathematical principles behind graph databases, offers a primer to graph native algorithms, and outlines the current use of graph technology in industry.

Laura Froelich is a data scientist at DHI Water & Environment, where she is dedicated to utilizing data to discover patterns and underlying structure to enable optimization of businesses and processes, particularly through deep learning methods. Before that, she worked on a large variety of projects covering industries spanning life sciences to the energy industry at Teradata. Previously, Laura was part of a research group investigating nonspecific effects of vaccines using survival analysis methods. Laura holds a PhD from the Technical University of Denmark. For her dissertation, Decomposition and Classification of Electroencephalography Data, Laura used unsupervised decomposition and supervised classification methods to research brain activity and developed rigorous, interpretable approaches to classifying tensor data.

Presentations

Radar-based flow prediction in water networks for better real-time decision making Session

We combine traditional predictive models with deep learning methods to improve operation of waste water treatment plants. This data-driven approach relies on weather radar data that replaces local and often sparsely located rain gauge sensor stations. Our approach allows for fast and probabilistic forecasts that robustly improve real-time operation of the urban drainage system.

Barbara Fusinska is a machine learning strategic cloud engineering manager at Google with a strong software development background. Previously, she was at a variety of different companies like ABB, Base, Trainline, and Microsoft, where she gained experience in building diverse software systems, ultimately focusing on the data science and machine learning field. Barbara believes in the importance of data and metrics when growing a successful business. In her free time, Barbara enjoys programming activities and collaborating around data architecture. She can be found on Twitter as @BasiaFusinska and blogs at http://barbarafusinska.com.

Presentations

Natural language processing with deep learning and TensorFlow Session

Natural language processing (NLP) offers techniques to gain insight from and generate text data. Barbara Fusinska introduces you to NLP concepts and deep learning architectures using document context. You'll see a series of demos with TensorFlow from classification task to text generation.

Debasish Ghosh is a principal software engineer at Lightbend. He’s passionate about technology and open source, loves functional programming, and has been trying to learn math and machine learning. Debasish is an occasional speaker in technology conferences worldwide, including the likes of QCon, Philly ETE, Code Mesh, Scala World, Functional Conf, and GOTO. He’s the author of DSLs In Action and Functional & Reactive Domain Modeling. Debasish is a senior member of ACM. He’s also a father, husband, avid reader, and Seinfeld fan, who loves spending time with his beautiful family.

Presentations

Online ML in streaming apps: Adapt to change with limited resources Session

Debasish Ghosh and Stavros Kontopoulos explore online machine learning algorithm choices for streaming applications, including resource-constrained use cases like IoT and personalization, complete with code samples. You'll learn about drift detection algorithms and Hoeffding Adaptive Trees, performance metrics for online models, and practical concerns with deployment in production.

Oliver Gindele is head of machine learning at Datatonic. Oliver is passionate about using computers models to solve real-world problems. Working with clients in retail, finance, and telecommunications, he applies deep learning techniques to tackle some of the most challenging use cases in these industries. He studied materials science at ETH Zurich and holds a PhD in computational physics from UCL.

Presentations

ML in production: Serverless and painless Session

Productionizing machine learning (ML) pipelines can be a daunting and difficult task for data scientists. Oliver Gindele highlights some of the newest technologies that address that issue and explains how a global cosmetics brand used them to productionize a serverless ML pipeline in an exciting case study.

Sarah Gold is CEO and founder at Projects by IF. She is a leading expert in emerging issues and trends in privacy, security and technology. Since Sarah founded IF in 2015, she has grown a world-renowned, multidisciplinary team who work with some of the most influential global organisations. IF’s recent partners and clients include a range of companies from different sectors, from big tech to healthcare, including Homes England, Google AI, Oxfam, Barnardo’s and Citizens Advice.

Presentations

Data design patterns that people trust Session

People care about how data about them is used. Building trust with consumers will require a change in how services treat data. Since 2016, IF has curated a data patterns catalogue which is used by product teams around the world. We’ll show how patterns help teams build digital services that give people agency over data, build trust and start addressing systemic balances of power.

Victor Gonzalez is an innovation manager at ConCrédito, where he has participated in the process of its digital transformation. Previously, he’s been closely linked to the development of technology projects, initially as a software engineer in different industries, business intelligence project consultant for decision making, and in recent years in the administration of innovation projects, partially combining his taste for teaching in some universities. He’s a lover of art, unknown places, long talks, and technology that gives us time for all that.

Presentations

Brick-and-mortar to digital-first: Data-driven digital transformation Session

Victor Gonzalez explores how the fintech ecosystem changes the rules of the financial services industry in Mexico. The ConCrédito digital transformation driven by data project is the basis for the growth and scope of business objectives. The business model needed to migrate from the traditional model to digital processes, allowing ConCrédito to be in the hands of its customers.

Martin Goodson is a chief scientist and CEO of Evolution AI and a specialist in machine reading technologies. He’s also the chair of the Royal Statistical Society Data Science Section, the professional body for data science in the UK, and runs the largest machine learning community in Europe. Martin’s work has been covered in the Economist, Quartz, Business Insider and TechCrunch.

Presentations

Lessons learned from running one of the UK’s largest AI research consortiums Session

Combining the exacting requirements of a leading data provider with a university’s expertise led to breakthrough technology that reads balance sheets more accurately than humans. But the journey wasn’t smooth. Martin Goodson shares the project’s structure, outcomes, and mistakes made along the way.

Sunil Goplani is a group development manager at Intuit, leading the big data platform. Sunil has played key architecture and leadership roles in building solutions around data platforms, big data, BI, data warehousing, and MDM for startups and enterprises. Previously, Sunil served in key engineering positions at Netflix, Chegg, Brand.net, and few other startups. Sunil has a master’s degree in computer science.

Presentations

Always accurate business metrics with lineage-based anomaly tracking Session

Imagine a business metric showing a sudden spike. Debugging data pipelines is nontrivial and finding the root cause can take hours to days. Shradha Ambekar and Sunil Goplani outline how Intuit built a self-serve tool that automatically discovers data pipeline lineage and applies anomaly detection to detect and debug issues in minutes.

Trevor Grant is an Apache Software Foundation Memmber and involved in multiple projects such as Mahout, Streams, and SDAP-incubating just to name a few. He holds an MS in applied math and an MBA from Illinois State University. He speaks about computer stuff internationally. He has taken numerous classes in stand-up and improv comedy to make his talks more pleasant for you- the listener.

Presentations

Ship it! A practitioner's guide to model management and deployment with Kubeflow. Session

We'll show you a way to get & keep your models in production with Kubeflow.

Morgan Gregory leads strategy and programs for Google Cloud’s Office of the CTO, the mission of which is to foster collaborative innovation between Google and its most strategic customers around the world. Her technical focus area is AI, and she has a passion for responsible AI, as well as AI for science and AI for good. By keeping her finger on the pulse of the Office’s engagements and leveraging the deep technical and industry expertise across the team, she identifies themes that are relevant and important for today’s technical leaders. Previously, Morgan was in management consulting at the Boston Consulting Group (BCG), where she advised F100 companies in the technology, financial services, and pharmaceutical industries. She started her career as a software engineer and product manager building partnering products and solutions for tech companies at Partnerpedia (later acquired by BMC) a startup in Vancouver, Canada. Morgan earned her BSc in computer science from the University of British Columbia and an MBA from the MIT Sloan School of Management. Find Morgan on LinkedIn or on Twitter as @morganjgregory.

Presentations

Responsible AI: The importance of getting it right and the harm of getting it wrong Session

The adoption of AI is accelerating. We're reaping many benefits from the advancement of AI, but we're also seeing hints of the unintended harm that occurs when responsibility isn’t front and center. Morgan Gregory explains why it’s critical to understand how and why this happens so we can build our future responsibly, with AI that's fair, safe, trustworthy, and green.

Anna Gressel is a litigation associate at Debevoise & Plimpton LLP and a member of the firm’s Commercial Litigation Group and Technology, Media & Telecommunications practice. Her practice focuses on complex civil litigation in federal and state courts, and she advises on legal and regulatory issues around artificial intelligence and other emerging technologies. She’s the coauthor of publications including “German Report May Be Road Map for Future AI Regulation,” “Storm Clouds or Silver Linings? Assessing the Impact of the U.S. CLOUD Act on Cross-Border Criminal Investigations,” and “Do the Apps Have Ears? Cross-Device Tracking." She sits on the board of directors of Ms. JD, a nonprofit organization dedicated to the success of women in law school and the legal profession. She a member of the Law Committee of the IEEE Global Initiative on the Ethics of Autonomous and Intelligent Systems.

Presentations

AI impact assessments: Tools for evaluating and mitigating corporate AI risks Session

The Canadian Government made waves when it passed a law requiring AI impact assessments for automated decision systems. Similar proposals are pending in the US and EU. Anna Gressel, Meeri Haataja, and Jim Pastore unpack what an AI impact assessment looks like in practice and how companies can get started from a technical and legal perspective, and they provide tips on assessing AI risk.

AI regulation and ethics in fintech: Insights from the US and UK Session

Anna Gressel, Jim Pastore, and Florian Ostmann lead a crash course on the emerging ethical and regulatory issues surrounding fintech AI. You'll hear insights from statements by US and UK regulators in banking and financial services and examine their priorities in 2020. You'll get practical guidance on how you can mitigate ethical and legal risks and position your AI products for success.

Rob studied Engineering, Economics and Management degree at Oxford University graduating in 2007 with First Class Honours. Prior to founding Mindful Chef in 2015, he worked as an Interest Rate Options trader at Morgan Stanley where he ran the Exotic Derivatives trading desk in New York. In 2018, he was nominated as one of the top 30 UK entrepreneurs under the age of 35 by Startups.co.uk.

Presentations

A recipe for innovation: recommending recipes based on adventurousness Session

Mindful Chef is a health-focused company that delivers weekly recipe boxes. In order to create a more personalised experience for their customers, they teamed up with Pivigo to develop an innovative recommender system. In this talk we will tell about this project and the development of a novel approach to understanding user taste that had an unexpectedly large impact on recommendation accuracy.

Sarah is a Senior Data Scientist at InVision where she studies user collaboration through data. She is an accomplished conference speaker and O’Reilly Media author, and enjoys making data science as accessible as possible to a broad audience. Sarah attended graduate school at the University of Michigan’s School of Information.

Presentations

Preparing and standardizing data for machine learning Interactive session

Getting your data ready for modeling is the essential first step in the machine learning process. Sarah Guido outlines the basics of preparing and standardizing data for use in machine learning models.

Sijie Guo is the founder and CEO of StreamNative. StreamNative is a data infrastructure startup offering a cloud native event streaming platform based on Apache Pulsar for enterprises. Previously, he was the tech lead for the Messaging Group at Twitter and worked on push notification infrastructure at Yahoo. He is also the VP of Apache BookKeeper and PMC Member of Apache Pulsar.

Presentations

The secrets behind Apache Pulsar for processing tens of billions of transactions per day Session

Apache Pulsar as a cloud-native event streaming platform gains more and more adoptions in mission critical services due to its stronger consistency and durability guarantees. This presentation deep dives into the technical details driven the Pulsar adoption trend and showcases the real world example on using Apache Pulsar to process billions of transactions every day.

Meeri is the CEO and Co-Founder of Saidot, a start-up with a mission for enabling responsible AI ecosystems. Saidot develops technology for end-user AI explainability, transparency, and independent validation. Meeri was the chair of ethics working group in Finland’s national AI program that submitted its final report in March 2019. In this role she initiated a national AI ethics challenge and engaged more than 70 organizations commit to ethical use of AI and define ethics principles. Meeri is also the Chair of IEEE’s initiative for the creation of AI ethics certificates in ECPAIS program (Ethics Certification Program for Autonomous and Intelligent Systems).

Meeri is an Affiliate at the Berkman Klein Center for Internet & Society at Harvard University during academic year 2019-2020 with a focus on projects related to building citizen trust through AI transparency & open informing.

Prior to starting her own company Meeri was leading AI strategy and GDPR implementation in OP Financial Group. Meeri has a long background in analytics and AI consulting with Accenture Analytics. During her Accenture years she has been working in driving data and analytics strategies and large AI implementation programs in media, telecommunications, high-tech and retail industries. Meeri started her career as data scientist in telecommunications after completing her M.Sc.(Econ.) in Helsinki School of Economics.

Meeri is an active advocate of responsible and human-centric AI. She’s an experienced public speaker regularly speaking in international conferences and seminars on AI opportunities, AI ethics and governance.

Presentations

AI impact assessments: Tools for evaluating and mitigating corporate AI risks Session

The Canadian Government made waves when it passed a law requiring AI impact assessments for automated decision systems. Similar proposals are pending in the US and EU. Anna Gressel, Meeri Haataja, and Jim Pastore unpack what an AI impact assessment looks like in practice and how companies can get started from a technical and legal perspective, and they provide tips on assessing AI risk.

Hatem Hajri is a senior research scientist at IRT SystemX, where he mainly works on robustness and adversarial attacks of artificial intelligence-based systems. Previously, he held three teaching and research positions as University Paris 10, Luxembourg University, and University of Bordeaux, where he worked on various problems of stochastic analysis and graphical models, and at the VeDeCoM Institute at Versailles, France, where he conducted research on autonomous driving. He earned the French agrégation of mathematics and his MS and PhD degrees in applied mathematics at Paris Sud University, France.

Presentations

A probabilistic approach to adversarial machine learning Session

Adversarial machine learning studies vulnerabilities of machine learning algorithms in adversarial settings and develops techniques to make learning more robust to adversarial examples. Hatem Hajr outlines adversarial machine learning and illustrates a new approach to address the problem of adversarial examples based on probabilistic techniques.

Jonny Hancox is a senior data scientist on the healthcare team at NVIDIA, specializing in the application of artificial intelligence within the fields of radiology, histopathology, and genomics. His work is all about accelerating the uptake of AI and making it easy for people to get started and get the most out of their hardware. Previously, he was a solution architect on the health and life sciences team at Intel. Although originally trained as product designer, the majority of Jonny’s career has been in software development, with roles ranging from engineer to technical director, but the theme of most of this work has been automation of image-related tasks, usually within the NHS and public sector.

Presentations

Federated learning for healthcare Session

Federated learning (FL) is a relatively new technique pioneered to allow you to use much larger datasets to train machine learning models without needing to share sensitive data. Jonny Hancox describes why this technique is deal for the healthcare sector, in which patient data is highly sensitive, but there's a need to increase the amount of training data to get models to clinically viable levels.

Adam Hill is a lead data scientist at HAL24K working with client projects to deliver smart-city and smart-infrastructure solutions. He has worked within the traffic and water sectors to deliver machine learning models that enable decision support and insight. Adam is also currently a Royal Society Entrepreneur in Residence encouraging innovation and entrepreneurial activities within academia around data science topics and tools. He is also a long-term, core volunteer within the DataKind UK community supporting data science projects delivering social good for NGOs. Adam holds a PhD in Astrophysics from the University of Southampton.

Presentations

Beyond smart infrastructure: leveraging satellite data to detect wildfires Session

Wildfires are a major environmental and health risk, with a frequency that has increased dramatically in the past decade. Early detection is critical, however most often wildfires are only discovered by eye-witness accounts. In this talk we will tell about a data science partnership between HAL24K and Pivigo aimed at building an automated wildfire detection system using NOAA satellite data.

Rainer Hoffmann is a senior manager data and AI at EnBW, where he works at the interface of data science and internal customers and identifies AI use cases across the whole company. He has lead numerous AI projects from ideation to production. Previously, Rainer was responsible for algorithmic power trading and started his career as a data scientist.

Presentations

AI at scale driving the German Energiewende Session

Almost two years ago EnBW developed its core beliefs for the role of AI at EnBW and derived concrete actions that need to be taken to scale its AI activities. Rainer Hoffmann and Frank Säuberlich describe the actions and the challenges EnBW has faced on its journey so far and its approach to mastering these challenges.

Rick Houlihan is a principal technologist and leads the NoSQL blackbelt team at AWS and has designed hundreds of NoSQL database schemas for some of the largest and most highly scaled applications in the world. Many of Rick’s designs are deployed at the foundation of core Amazon and AWS services such as CloudTrail, IAM, CloudWatch, EC2, Alexa, and a variety of retail internet and fulfillment-center services. Rick brings over 25 years of technology expertise and has authored nine patents across a diverse set of technologies including complex event processing, neural network analysis, microprocessor design, cloud virtualization, and NoSQL technologies. As an innovator in the NoSQL space, Rick has developed a repeatable process for building real-world applications that deliver highly efficient denormalized data models for workloads of any scale, and he regularly delivers highly rated sessions at re:Invent and other AWS conferences on this specific topic.

Presentations

Where's my Lookup Table? Session

When Amazon decided to migrate thousands of application services to NoSQL, many of those services required complex relational models that could not be reduced to simple key-value access patterns. The most commonly documented use cases for NoSQL are simplistic. this session shows how to model complex relational data efficiently in denormalized structures.

Oliver Hughes is an engineer on the Spring Cloud Services team.

Presentations

Kubernetes Distilled - An in depth guide for the busy data engineer 1-Day training

Today's data engineer needs a deep understanding of the key tools and concepts within the vast, rapidly evolving Kubernetes ecosystem. This training will provide developers with a thorough grounding on Kubernetes concepts, suggest best practices and get hands-on with some of the essential tooling. Topics will include

Max Humber is a distinguished faculty member at General Assembly. Previously he pushed data at Wealthsimple and Borrowell.

Presentations

Lean ML: Take your machine learning model from idea to URL Interactive session

Max Humber helps you get your model in front of users as quickly as possible. You'll discover a step-by-step lean ML playbook showing you how to convert your idea into a fully deployed application.

Daniel (Dan) Huss is the founder and CEO of gravityAI, a marketplace for algorithms. Previously, Dan was a product manager at State Street Verus, leading a team of over 30 designers, engineers, data scientists, and SMEs; and a project manager on similar AI platforms at BCG Digital Ventures. Dan spoke about his experience developing at Verus, a first-of-its-kind mobile application that uses NLP, machine learning, and a knowledge graph to make connections between an investor’s portfolio and news, at the O’Reilly Strata Data Conference in New York. He spoke recently as part of a panel at Strata New York with Cloudera, at CDX NYC, and at TedX. When Dan isn’t running his startup, you can find him teaching entrepreneurship at CUNY or running enterprise trainings in product management with General Assembly.

Presentations

Algorithm commoditization: Build versus buy decisions from a product manager perspective Session

Many types of algorithms have become commoditized, yet companies continue to use tight resources to try to build these in-house all the time. Considering that according to Gartner, 87% of internal data science projects fail to make it into production, it's crazy to concentrate resources on anything but the most proprietary of projects. Daniel Huss is here to help you decide where to focus.

Viacheslav Inozemtsev is a data engineer at Zalando, building an internal data lake platform on top of Apache Spark, Delta Lake, Apache Presto, and serverless cloud technologies, and enabling machine learning and AI for all teams and departments of the company. He has eight years of data and software engineering experience. He earned a degree in applied mathematics, and then an MSc degree in computer science with the focus on data processing and analysis.

Presentations

Lambda architecture with Apache Spark Structured Streaming and Delta Lake tables Session

Lambda architecture is a general-purpose architecture for data platforms. It's been known for a while but was always hard to implement. Viacheslav Inozemtsev explains how, with the release of Delta Lake tables after Spark Structured Streaming became mature, Lambda architecture can now be done much easier than ever before for analytical and machine learning use cases.

Charu Jaiswal is a machine learning scientist at integrate.ai, one of the fastest growing companies in Canadian history. She builds predictive models to help large enterprises like insurance companies and banks become more customer-centric. Previously, she applied machine learning to the venture capital and energy storage industries. Charu earned her master’s degree in machine learning and industrial engineering from the University of Toronto.

Presentations

Machine learning models after deployment: Testing, monitoring, and retraining Session

You train ML models and deploy them into the wild. And then the performance of your models decreases over time as business operations and customer behaviors change. You may only notice months later, incurring costly results. Charu Jaiswal explains how to fight back against performance loss by monitoring, testing, and retraining ML models actively in production.

Asif Jan is a group director in personalized healthcare (PHC) data science at Roche, Switzerland, where he leads a multidisciplinary team of scientists in computer science, neuroscience, and statistics. The team implements a variety of statistical and machine learning methods on real-world datasets (e.g., electronic medical records, health insurance claims, disease registries) to fulfill the evidence and data analysis needs of the neuroscience disease area at Roche. Previously, he was head of data science at Roche Diagnostics, leading a team of quantitative scientists supporting in-vitro diagnostics (IVD) and clinical decision support (CDS) product development, and defined data strategy enabling use of real-world data in Roche Diagnostics, and he’s held a number of roles overseeing technology strategy development, enterprise and solution architecture, and program management at Roche and in other research organizations. Asif has vast experience in building and leading data science teams in pharma, diagnostics, and industrial research institutes, tackling complex scientific and business problems.

Presentations

Data science for enabling personalized healthcare Session

Advances in AI and ML are critical to advancing understanding diseases and bringing better and more efficacious treatments to patients, realizing the dream of personalized healthcare. Asif Jan shares insights from building data science teams in pharma and outlines a road map for success of AI and ML in the pharma industry.

Grishma Jena is a data scientist on the UX research and design team at IBM Data & AI in San Francisco. She works across portfolios in conjunction with the user research and design teams and uses data to understand users’ struggles. Previously, she was a mentor for the nonprofit AI4ALL’s AI Project Fellowship, where she guided a group of high school students on using AI for prioritizing 911 EMS calls. Grishma also teaches Python at the San Francisco Public Library. She enjoys delivering talks and is passionate about encouraging women and youngsters in technology. She holds a master’s degree in computer science from the University of Pennsylvania. Her research interests include machine learning and natural language processing.

Presentations

Data wrangling with Python 2-Day Training

Data science is rapidly changing every industry. This has resulted in a shift away from traditional software development and toward data-driven decision making. Grishma Jena uses Python to extract, wrangle, explore, and understand data so you can leverage it in the real world.

Data wrangling with Python (Day 2) Training Day 2

Data science is rapidly changing every industry. This has resulted in a shift away from traditional software development and toward data-driven decision making. Grishma Jena uses Python to extract, wrangle, explore, and understand data so you can leverage it in the real world.

Pravin Jha is a senior data scientist at Ameren, an American power company. He contributes in customer analytics domain with his expertise in machine learning and natural language processing. He has more than five years of academic research experience in engineering and data analytics. He also holds a professional engineering license and has more than five years of professional experience in construction industry. He earned his PhD in engineering science from Southern Illinois University Carbondale. He’s an avid NBA fan and enjoys watching basketball in his free time.

Presentations

Writer-independent offline signature verification in banks using few-shot learning Session

Offline signature verification is one of the most critical tasks in traditional banking and financial industries. The unique challenge is to detect subtle but crucial differences between genuine and forged signatures. This verification task is even more challenging in writer-independent scenarios. Tuhin Sharma and Pravin Jha detail few-shot image classification.

Ken Johnston is the principal data science manager for the Microsoft 360 Business Intelligence Group (M360 BIG). In his time at Microsoft, Ken has shipped many products, including Commerce Server, Office 365, Bing Local and Segments, and Windows, and for two and a half years, he was the director of test excellence. A frequent keynote presenter, trainer, blogger, and author, Ken is a coauthor of How We Test Software at Microsoft and contributing author to Experiences of Test Automation: Case Studies of Software Test Automation. He holds an MBA from the University of Washington. Check out his blog posts on data science management on LinkedIn.

Presentations

Infinite segmentation: Scalable mutual information ranking on real-world graphs Session

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks.

Kim Falk is a senior data scientist at IKEA, where he’s part of a small, dedicated team focusing on real-time promotions. Previously, Kim worked on recommender systems in scenarios like in retargeting ads and in video-on-demand sites. He’s also worked on classifying Danish legal documents using NLP. He’s the author of Practical Recommender Systems.

Presentations

Deep reinforcement learning for personalized promotions at IKEA Session

Around the world, IKEA has an ever-growing number of loyalty club (Family) members. An important part of IKEA’s ongoing digital transformation is to improve communication with these customers and to inspire them with offers that are most relevant for improving their everyday life. Kim Falk shares IKEA's work on personalizing promotional emails.

Robin actively contributed to building products and platforms that accelerated the Digital Transformation of industries using the power of Data. He works as the Chief Data and Analytical Officer for wefox – the largest Insurtech in Europe, and nominated as top 10 hottest FinTech companies in the world by Business Insider. Prior to this he has held many senior leadership roles in both Fortune 100 companies like Cisco and agile FinTech startups.

Robin will be the CTO, and Managing Director of the new Credit Risk Assessment startup of the firm in subject .

Presentations

How Explainable AI can Solve AI Adoption Hurdles Session

A key challenge to AI adoption is the lack of transparency and the Blackbox models. This talk shows how a Berlin based startup democratized Credit Risk Assessment with Explainable AI. The blackbox nature of AI causes concerns on adoption, regulation and ethical use. We present a hope that explainable AI could not only solve this problem, but in doing so make the world a better place.

Anthony Joseph is a technology cofounder of a property tech startup and an Australian software engineer and mathematician. He earned his degree from MBT and enjoys teaching and learning coding with the Australian startup scene.

Presentations

Applying machine learning to wearable technologies for exercise technique management Session

IoT devices are increasing in power and capability, now allowing developers to use machine learning models on the device. Anthony Joseph analyzes a boxing training session with motion sensors onboard IoT devices using the TensorFlow framework and provides user feedback on technique and speed.

Russell Jurney is principal consultant at Data Syndrome, a product analytics consultancy dedicated to advancing the adoption of the development methodology Agile data science, as outlined in the book Agile Data Science 2.0 (O’Reilly, 2017). Previously, he worked as a data scientist building data products for over a decade, starting in interactive web visualization and then moving toward full stack data products, machine learning, and artificial intelligence at companies such as NING, LinkedIn, Hortonworks and Relato. He’s a self-taught visualization software engineer, data engineer, data scientist, writer, and most recently teacher. In addition to helping companies build analytics products, Data Syndrome offers live and video training courses.

Presentations

Unsupervised learning: A Python HOWTO of techniques with some theory 1-Day training

Russell Jurney surveys machine learning techniques from across the field of unsupervised learning and explains the theory behind each technique as well as working examples in Python using open source software.

Davin Kaing is a Data Scientist on the Client Advocacy team at IBM where he applies statistics, causal inference, and machine learning to uncover driving factors of client experience and generate insights to improve IBM client experience. Prior to IBM, Davin was a Data Scientist and Consultant for various start-ups in a variety of industries including healthcare, finance, and cyber insurance. He holds a Master’s in Statistics from Columbia University, a Master’s in Data Science from the George Washington University, and a Bachelor’s in Bioengineering from the University of the Pacific.

Presentations

Causal Inference Using Observational Data Session

What is driving revenue? How can we improve our client experience? These are causal questions that many organizations face. Answering these questions using data can be challenging, especially since in most cases, only observational data are available. We will go through an overview of both traditional and modern causal inference techniques and address their limitations and applications.

Swasti Kakker is a software development engineer on the LinkedIn Data team at LinkedIn. Her passion lies in increasing and improving developer productivity by designing and implementing scalable platforms for the same. In her two-year tenure at LinkedIn, she’s worked on the design and implementation of hosted notebooks at LinkedIn, which focuses on providing a hosted solution of Jupyter notebooks. She’s worked closely with the stakeholders to understand the expectations and requirements of the platform that would improve developer productivity. Previously, she worked with the Spark team, discussing how Spark History Server can be improved to make it more scalable to cater to the traffic by Dr. Elephant. She’s also contributed to adding the Spark heuristics in Dr. Elephant after understanding the needs of the stakeholders (mainly Spark developers) which gave her good knowledge about Spark infrastructure, Spark parameters, and how to tune them efficiently.

Presentations

Darwin: Evolving hosted notebooks at LinkedIn Session

Come and learn the challenges we overcame to make Darwin (Data Analytics and Relevance Workbench at LinkedIn) a reality. Know about how data scientists, developers, and analysts at LinkedIn can share their notebooks with their peers, author work in multiple languages, have their custom execution environments, execute long-running jobs, and do much more on a single hosted notebooks platform.

Amit Kapoor is a data storyteller at narrativeViz, where he uses storytelling and data visualization as tools for improving communication, persuasion, and leadership through workshops and trainings conducted for corporations, nonprofits, colleges, and individuals. Interested in learning and teaching the craft of telling visual stories with data, Amit also teaches storytelling with data for executive courses as a guest faculty member at IIM Bangalore and IIM Ahmedabad. Amit’s background is in strategy consulting, using data-driven stories to drive change across organizations and businesses. Previously, he gained more than 12 years of management consulting experience with A.T. Kearney in India, Booz & Company in Europe, and startups in Bangalore. Amit holds a BTech in mechanical engineering from IIT, Delhi, and a PGDM (MBA) from IIM, Ahmedabad.

Presentations

Democratize and build better deep learning models using TensorFlow.js Session

Bargava Subramanian and Amit Kapoor use two real-world examples to show how you can quickly build visual data products using TensorFlow.js to address the challenges inherent in understanding the strengths, weaknesses, and biases of your models as well as involving business users to design and develop a more effective model.

Holden Karau is a transgender Canadian software working in the bay area. Previously, she worked at IBM, Alpine, Databricks, Google (twice), Foursquare, and Amazon. Holden is the coauthor of Learning Spark, High Performance Spark, and another Spark book that’s a bit more out of date. She’s a committer on the Apache Spark, SystemML, and Mahout projects. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work, she enjoys playing with fire, riding scooters, and dancing.

Presentations

Ship it! A practitioner's guide to model management and deployment with Kubeflow. Session

We'll show you a way to get & keep your models in production with Kubeflow.

Meher Kasam is an iOS software engineer at Square and is a seasoned software developer with apps used by tens of millions of users every day. He’s shipped features for a range of apps from Square’s point of sale to the Bing app. Previously, he worked at Microsoft, where he was the mobile development lead for the Seeing AI app, which has received widespread recognition and awards from Mobile World Congress, CES, FCC, and the American Council of the Blind, to name a few. A hacker at heart with a flair for fast prototyping, he’s won close to two dozen hackathons and converted them to features shipped in widely used products. He also serves as a judge of international competitions including the Global Mobile Awards and the Edison Awards.

Presentations

30 golden rules to speed up TensorFlow performance Session

Meher Kasam, Anirudh Koul, and Siddha Ganju highlight the must-have checklist for everyday AI practitioners to speed up your deep learning training and inference with TensorFlow code examples.

Phil Kendall (he/him) is the chief innovation officer at Intercept IP, a small UK company that produces a low-power black box for the motor insurance market, where he leads the R&D efforts, ensuring it products remain cutting edge. From making a start on the ZX Spectrum, his experience ranges across various industries from telematics to enterprise virtualization software.

Presentations

Implementing device-specific learning for IoT devices Session

Philip Kendall offers a look at the challenges involved in training and deploying a unique model to each of tens of thousands of Arduino-class IoT devices to minimize power use and maximize lifetime. The solution involves a high-level simulation of the system on the backend to perform the training and a custom virtual machine on the device to implement the learned model.

Scott Kidder has been building video encoding & delivery platforms for over 12 years (MobiTV, Brightcove/Zencoder, and now Mux). He’s currently a Staff Software Engineer working on the Mux Data service which provides realtime and historical analytics for Internet video playback. Scott has built high-volume stream-processing applications for Mux Data and Mux Video (our full-service video encoding and distribution service) that have served some of the most widely watched video streams on the Internet (World Cup, NFL Super Bowl). Interests include Kafka, Flink, Kubernetes, and Go.

Presentations

Stateful Stream Processing with Kafka and Go Session

Learn how the Mux Data service has leveraged Kafka and Go to build stateful stream-processing applications that operate on extremely high-volumes of video-view beacons to drive real-time monitoring dashboards and historical metrics representing a viewer’s quality-of-experience. We’ll also cover fault-tolerance, monitoring, and Kubernetes container deployments.

Kevin Kim is the head of the Data Group at Socar, the largest car sharing company in South Korea. Previously, he cofounded Between, an app for couples that was downloaded 20 million times and was a developer and data scientist until Socar acquired his company. He’s also an open source enthusiast and a committer and PMC member of Apache Zeppelin project.

Presentations

Redefining the car-sharing industry with data science Session

Socar has been seriously focused on data operations. Kevin Kim describes how Socar is redefining the car-sharing industry with data science with an experiment-based pricing strategy, machine learning–based demand prediction, optimized car management, accident risk profiling, and much more.

Matt Kirk runs Your Chief Scientist which is a firm devoted to training small cohorts of highly motivated engineers to become data scientist practitioners. He pulls from his experience writing Thoughtful Machine Learning with Python as well as his clients like Clickfunnels, Garver, SheerID, SupaDupa, and Madrona Ventures. To find out more check out https://yourchiefscientist.com/.

Presentations

Introduction to Reinforcement Learning 1-Day training

Join us as we dig into the theory, the practice, and the implementation of this highly promising field of machine learning.

Stavros Kontopoulos is a senior engineer on the data systems team at Lightbend. He implemented Lightbend’s fast data strategy. Previously, he built software solutions that scale in different verticals like telecoms and marketing. His interests include distributed system design, streaming technologies, and NoSQL databases.

Presentations

Online ML in streaming apps: Adapt to change with limited resources Session

Debasish Ghosh and Stavros Kontopoulos explore online machine learning algorithm choices for streaming applications, including resource-constrained use cases like IoT and personalization, complete with code samples. You'll learn about drift detection algorithms and Hoeffding Adaptive Trees, performance metrics for online models, and practical concerns with deployment in production.

Gabor Kotalik is a big data project lead at Deutsche Telekom, where he’s responsible for continuous improvement of customer analytics and machine learning solutions for the commercial roaming business. He has more than 10 years of experience in business intelligence and advanced analytics focusing on using insights and enabling data-driven business decisions.

Presentations

Machine learning processes at Deutsche Telekom Global Carrier Session

Deutsche Telekom is fourth biggest telecommunication company in the world, and every day millions of its customers use their mobile services in roaming. Gabor Kotalik and Václav Surovec explain how the company designed and built its machine learning processes on top of the Cloudera Hadoop cluster to support its commercial roaming business.

Melanie wants a world where robots do all the boring, repetitive stuff for her so she can spend her time doing not boring, repetitive stuff. She’s a senior lead technologist at Booz Allen Hamilton. As a mathematician-turned-programmer with more than 10 years of experience, she has worked at universities, Booz Allen, PwC, and Capgemini analyzing data, producing cool demonstrations of artificial intelligence, discovering (inventing?) mathematics, project-managing large technical implementations, and trying to keep people from freaking out. After work, she spends most evenings working on homework and studying as she’s trying to complete a MS in Computer Science from Georgia Tech. When she’s not coding, which is almost all the time, you can find her figuring out how to cook vegan keto recipes, taking care of her paraplegic and geriatric cats and dogs, and trying to raise a decent human being to take over the world.

Presentations

If I only had a brain: Putting the intelligence in intelligent automation Session

Traditional automation is typically limited to clear-cut business rules that can be easily programmed. Melanie Laffin expands what automation can do by adding eyes (computer vision), a brain (general AI models), and speech (natural language processing) to automations to enhance their ability.

Jonathan Leslie is head of data science at Pivigo, where he works with business partners to develop data science solutions that make the most of their data, including in-depth analysis of existing data and predictive analytics for future business needs. He also programs, mentors, and manages teams of data scientists on projects in a wide variety of business domains.

Presentations

Bringing innovation to online retail: automating customer service using NLP Session

MADE.com are a furniture and homewares retailer with a unique online-only business model. Given this format, it is crucial that customer service agents are able to respond to queries quickly and accurately. However, it can often be difficult to match the demand of incoming requests. We will tell about a project aimed developing a framework for automated responses to customer queries.

Marko Letic is a front-end engineer, lecturer and data visualization scientist. He is currently leading the front-end team at AVA, a Berlin-based company, where he is working on a platform that combines big data, pattern recognition and artificial intelligence to take the safety of individuals, organizations, cities, and countries to a whole new level. His main role is to create a contextual analysis of the processed data through a web-based client application. Marko is also working as a Tech Speaker at Mozilla promoting the values of the open web and he is one of the organizers of Armada JS, the first JS conference in Serbia. He holds a MSc degree in Computer Science and is pursuing his PhD in data visualization. He sometimes writes fiction novels that probably will never get published as he spends too much time coding.

Presentations

Saving the world with JavaScript: A Data Visualization story Session

Did you know that the beginnings of data visualization are strongly tied to solving some of the biggest problems humanity has ever faced? Wouldn’t it be more interesting to say that you’re not a doctor, but you do save lives than to say you’re just a developer? If you want to know more, join me on this trip through time and beyond.

Tianhui Michael Li is the founder and president of the Data Incubator, a data science training and placement firm. Michael bootstrapped the company and navigated it to a successful sale to the Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw. He’s a regular contributor to the Wall Street JournalTech CrunchWiredFast CompanyHarvard Business ReviewMIT Sloan Management ReviewEntrepreneurVenture Beat, Tech Target, and O’Reilly. Michael was a postdoc at Cornell Tech, a PhD at Princeton, and a Marshall Scholar in Cambridge.

Presentations

Getting Started with AI and Data Science Session

Drawing on experiences gleaned from hundreds of clients, Michael Li provides successful case studies from companies in a variety of industries that have successfully incorporated data science into their products and services. He presents the Pragmatic Data Framework, which successful clients have embraced to jumpstart their data science efforts and prioritize high-impact data science projects.

Simon Lidberg is a solution architect in Microsoft’s Data Insights Center of Excellence. He’s worked with database and data warehousing solutions for almost 20 years in a various of industries, with a more recent focus on analysis, BI, and big data. Simon is the author of Getting Started with SQL Server 2012 Cube Development.

Presentations

(Partially) demystifying DevOps for AI Session

DevOps, DevSecOps, AIOps, ML Ops, Data Ops, No Ops....Ditch your confusion and join Simon Lidberg and Benjamin Wright-Jones to understand what DevOps means for AI and your organization.

Alexandre Lomadze is a senior data scientist in TBC bank. He has a broad experience in using and developing machine learning algorithms for various business projects, such as telco, HR, banking. At the same time he teaches Machine Learning in the Free University of Tbilisi. Aleksandre has a technical background in math and computer science but gets most excited about approaching data problems from a business perspective and to drive to the optimal decisions. He has BAC degree at Moscow institute of physics and technology and master degree at Tbilisi State University in computer science. Also he has twice medaled in international mathematical Olympiad.

Presentations

How a Failed Machine Learning Exercise Increased Deposit Profitability by 20% Session

We will tell you how our failed attempt to build an ML model brought us to discovering institutional problems and kicked off improvement of existing business processes so that we would collect quality data for future modeling; and how we still managed to increase deposit profitability by 20% in the process.

Markus Ludwig is a senior data scientist at Scout24, where he builds and deploys machine learning systems that power search and discovery. Previously, Markus worked as an academic researcher, lecturer, and consultant. He earned a PhD in computational finance from the University of Zurich, Switzerland.

Presentations

Transformers in the wild Session

Markus Ludwig shares insights from training and deploying a Transformer model that translates natural language to structured search queries. You'll cover the entire journey from idea to product, from teaching the model new tricks to helping it forget bad habits, and iteratively refine the user experience.

Mike Lutz is an infrastructure lead at Samtec. Traditionally living in the data communications world, he stumbled into data (and big data) as a way to manage the floods of information that were being generated in his many telemetry and Internet of Things adventures.

Presentations

Big data for the small fry: Bootstrapping from onsite to the cloud Session

Netflix proposed a novel best practice in using Jupyter notebooks as glue for working in the big data and AI-processing domain. You can follow a manufacturing company's adventure as it tries to implement Netflix's ideas on a dramatically smaller scale. Mike Lutz explains how Netflix's idea can be useful even for the small fry.

Miguel Martínez is a senior deep learning data scientist at NVIDIA, where he concentrates on RAPIDS. Previously, he mentored students at Udacity’s artificial intelligence nanodegree. He has a strong background in financial services, mainly focused on payments and channels. As a constant and steadfast learner, he’s always up for new challenges.

Presentations

Accelerating machine learning and graph analytics by several orders of magnitude with GPUs Session

GPU acceleration has been at the heart of scientific computing and artificial intelligence for many years now. Since the launch of RAPIDS last year, this vast computational resource has become available for data science workloads too. Miguel Martínez details the RAPIDS framework, a GPU-accelerated drop-in replacement for utilities such as pandas, scikit-learn, NetworkX, and XGBoost.

Hamlet Jesse Medina Ruiz is a senior data scientist at Criteo. Previously, he was a control system engineer for Petróleos de Venezuela. Hamlet finished in the top ranking in multiple data science competitions, including 4th place on predicting return volatility on the New York Stock Exchange hosted by Collège de France and CFM in 2018 and 25th place on predicting stock returns hosted by G-Research in 2018. Hamlet holds a two master degrees on mathematics and machine learning from Pierre and Marie Curie University, and a PhD in applied mathematics from Paris-Sud University in France, where he focused on statistical signal processing and machine learning.

Presentations

Predicting Criteo’s internet traffic load using Bayesian structural time series models Session

Criteo's infrastructure provides capacity and connectivity to host its platform and applications; the evolution of its infrastructure is driven by the ability to forecast traffic demand. Hamlet Jesse Medina Ruiz explains how Criteo uses Bayesian dynamic time series models to accurately forecast its traffic load and optimize hardware resources across data centers.

Siyao Meng is a software engineer at Cloudera, where he is an Apache Hadoop (HDFS) and Apache Ozone contributor.

Presentations

Distributed tracing in Apache Hadoop Session

Distributed tracing is a well-known technique for identifying where failures occur and the reason behind poor performance, especially for complex systems like Hadoop, which involves many different components. Siyao Meng and Wei-Chiu Chuang demo the work on integrating OpenTracing in the Hadoop ecosystem and outline Cloudera's future integration plan.

Laurence Moroney is a developer advocate on the Google Brain team at Google, working on TensorFlow and machine learning. He’s the author of dozens of programming books, including several best sellers, and a regular speaker on the Google circuit. When not Googling, he’s also a published novelist, comic book writer, and screenwriter.

Presentations

Zero to hero with TensorFlow 2.0 Session

Laurence Moroney explores how to go from wondering what machine learning (ML) is to building a convolutional neural network to recognize and categorize images. With this, you'll gain the foundation to understand how to use ML and AI in apps all the way from the enterprise cloud down to tiny microcontrollers using the same code.

Francesco Mucio is a data consultant. The first time Francesco met the word data, it was just the plural of datum, and now he’s building a small consulting firm to help organizations to avoid or solve some of the problems he’s seen in the past. He likes to draw data models and optimize queries. He spends his free time with his daughter, who, for some reason, speaks four languages.

Presentations

Data engineering: The worst practices Session

Sit down and play data engineering worst practices bingo. From cloud infrastructure to stream processing, from data lakes to analytics, you'll see what can go wrong and the reasons behind these decision. Francesco Mucio has been collecting stories for almost 20 years, and it's finally time to give back. If you recognize your organization in some of them, well, Francesco told you to sit down.

Jacques Nadeau is the cofounder and CTO of Dremio. Previously, he ran MapR’s distributed systems team; was CTO and cofounder of YapMap, an enterprise search startup; and held engineering leadership roles at Quigo, Offermatica, and aQuantive. Jacques is cocreator and PMC chair of Apache Arrow, a PMC member of Apache Calcite, a mentor for Apache Heron, and the founding PMC chair of the open source Apache Drill project.

Presentations

Real-world cloud data lakes: Examples and guide Session

Join in for a review of how to build a successful cloud data lake. Jacques Nadeau leads a deep dive into key topics such as landing, ETL, security cost and performance trade-offs, and access patterns, as well as technologies such as Apache Arrow, Iceberg, and Spark in the context of real-world customer deployments.

Shuna is machine learning specialist with a background in engineering. This means that she works with solutions that fit into the context.

The focus of Shuna’s work lies in the intersection between data analysis, optimisation and machine learning with the aim of achieving higher efficiency.

Shuna is an expert in finding creative, targeted solutions but also in applying existing solutions.

In the moment she is specially working with water and waste water treatmentplants.

Presentations

Radar-based flow prediction in water networks for better real-time decision making Session

We combine traditional predictive models with deep learning methods to improve operation of waste water treatment plants. This data-driven approach relies on weather radar data that replaces local and often sparsely located rain gauge sensor stations. Our approach allows for fast and probabilistic forecasts that robustly improve real-time operation of the urban drainage system.

Joseph Nelson is a cofounder and machine learning engineer at Roboflow, a tool for accelerating computer vision model development. Previously, he cofounded Represently, “the Zendesk for Congress,” reducing the time the US House of Representative takes to respond to constituent messages with natural language processing (NLP). He’s taught over 2,000 hours of data science instruction in Python with General Assembly and the George Washington University. Joseph is dedicated to making machine learning more accessible. He’s from Iowa.

Presentations

Using TensorFlow Lite for Computer Vision Interactive session

In this session, Joseph walks you through the end-to-end flow required to train a model for mobile deployment, including image collection, preprocessing and augmenting considerations, model training, and saving the TFLite model in an appropropriate format for deployment. For this session, participants should have awareness of machine learning, familiarity with Python, and their laptops.

Elias Nema is a senior data engineer at OLX. He specializes in big data, machine learning, and analytical platforms. Elias has had the opportunity to work within different companies, teams, and industries. He enjoys applying data management to solve real business problems, building an analytical data culture, and overall using data to make better decisions.

Presentations

Scaling a real-time recommender system for 350M users in a dynamic marketplace Session

OLX includes 20+ brands, more than 350M monthly active users, and millions of new items added to a platform daily. Of course recommender systems play a crucial part in its platform. Elias Nema highlights the data flows and core components used for building, serving and continuously iterating recommenders in such a dynamic marketplace.

Thomas Nield is the founder of Nield Consulting Group and a professional author, conference speaker, and trainer at O’Reilly. He wrote two books including Getting Started with SQL (O’Reilly) and Learning RxJava (Packt). He regularly teaches classes on analytics, machine learning, and mathematical optimization and has written several popular articles like “How it Feels to Learn Data Science in 2019” and “Is Deep Learning Already Hitting Its Limitations?” Valuing problem solving over problem finding, Thomas believes using solutions that are practical, which are often unique in every industry.

Presentations

Large-scale machine learning with Spark and scikit-learn 2-Day Training

There's been an explosion of tools for machine learning, but two have emerged as practical go-to solutions: scikit-learn and Apache Spark. Using Python, Thomas Nield leads a deep dive into examples in parallel (no pun intended) for both of these tools and learn how to tackle machine learning at small, medium, and large scales.

Large-scale machine learning with Spark and scikit-learn (Day 2) Training Day 2

There's been an explosion of tools for machine learning, but two have emerged as practical go-to solutions: scikit-learn and Apache Spark. Using Python, Thomas Nield leads a deep dive into examples in parallel (no pun intended) for both of these tools and learn how to tackle machine learning at small, medium, and large scales.

Machine Learning from Scratch Interactive session

Linear regression, logistic regression, and Naïve Bayes are workhorse machine learning algorithms that achieve practical results with little overhead. As a matter of fact, building these algorithms from scratch (without libraries) is more accessible than you may think!

Aileen Nielsen works at an early-stage NYC startup that has something to do with time series data and neural networks, and she is also the author of a Practical Time Series Analysis, published in 2019, and an upcoming book, Practical Fairness, to be published in summer 2020. Previously, Aileen worked at corporate law firms, physics research labs, a variety of NYC tech startups, and most recently, the mobile health platform One Drop as well as on Hillary Clinton’s presidential campaign. Aileen currently serves as the chair of the NYC Bar’s Science and Law Committee as well as a Fellow in Law and Tech at ETH Zurich. Aileen is a frequent speaker at machine learning conferences on both technical and legal subjects.

Presentations

Is deep learning the future of prediction? Interactive session

This talk poses the question of whether deep learning will ever come to dominate time series forecasting as it has come to dominate approaches to language and imagery. We'll both ask the question and provide a partial answer.

Dr Sami Niemi has been working on Bayesian inference and machine learning over 10 years and have published peer reviewed papers in astrophysics and statistics. He has delivered machine learning models for e.g. telecommunications and financial services. Sami has built supervised learning models to predict customer and company defaults, 1st and 3rd party fraud, customer complaints, and used natural language processing for probabilistic parsing and matching. He has also used unsupervised learning in a risk based anti-money laundering application. Currently Sami works at Barclays where he leads a team of data scientists building fraud detection models and manages the UK fraud models.

Presentations

Implementing Machine Learning Models for Real-Time Transaction Fraud Detection Session

Predicting transaction payment fraud in real-time is an important challenge, which state-of-art supervised machine learning models can help to solve. In last two years Barclays has developed and tested different models and implementation solutions. In this talk we learn how state-of-the-art machine learning models can be implemented, while meeting strict real-time latency requirements.

Kim Nilsson is the CEO of Pivigo, a London-based data science marketplace and training provider responsible for S2DS, Europe’s largest data science training program, which has by now trained more than 650 fellows working on over 200 commercial projects with 120+ partner companies, including Barclays, KPMG, Royal Mail, News UK, and Marks & Spencer. An ex-astronomer turned entrepreneur with a PhD in astrophysics and an MBA, Kim is passionate about people, data, and connecting the two.

Presentations

A recipe for innovation: recommending recipes based on adventurousness Session

Mindful Chef is a health-focused company that delivers weekly recipe boxes. In order to create a more personalised experience for their customers, they teamed up with Pivigo to develop an innovative recommender system. In this talk we will tell about this project and the development of a novel approach to understanding user taste that had an unexpectedly large impact on recommendation accuracy.

Alon Nir is a senior data scientist, data science lead at Deliveroo Plus. Alon earned an MA in economics.

Presentations

Getting an edge with network analysis with Python Session

Alon Nir offers you a glimpse into what a powerful and impactful tool network analysis is. With plethora of real-world examples and friendly Python syntax, you'll be equipped—and hopefully inspired—to start your journey with this network analysis.

Tristan O’Gorman is a product architect with IBM Watson IoT, specializing in asset management solutions with a focus on applications of artificial intelligence. Previously, he worked in a variety of software product development roles. Tristan has advanced degrees, including applied data science, from National University of Ireland, Galway; University of Limerick; and Technical University Dublin. In his spare time, he’s busy with his two boys and enjoys tennis and photography.

Presentations

Data-driven predictive maintenance: Has the promise of the IIoT been realized? Session

The advance of the industrial internet of things (IIoT) promised much, particularly in the area of predictive maintenance. Tristan O'Gorman digs into whether or not those promises have been realized. You'll learn about the particular technical and strategic challenges that organizations seeking to adopt predictive maintenance have to overcome.

Florian Ostmann is the policy theme lead within the Public Policy Programme at the Alan Turing Institute, the UK’s national institute for data science and artificial intelligence. His research interests are centered around applications of data science and AI in the public sector, ethical and regulatory questions in relation to emerging technologies across different sectors of the economy, and the future of work and social welfare systems. He acts as a principal investigator for projects on the use of AI in financial services and criminal justice. Florian is a member of the Law Committee for the IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. Previously, he was a research associate at the Shorenstein Center on Media, Politics and Public Policy, where he conducted research on questions of fairness and transparency in the context of algorithmic decision making; working at the Pan American Health Organization; serving as a consultant on responsible investing and human rights due diligence in the private sector (with a focus on modern slavery risks), autonomous vehicle policy, and social impact measurement. Florian holds a master’s in public policy from the Harvard Kennedy School and a PhD in political philosophy from UCL.

Presentations

AI regulation and ethics in fintech: Insights from the US and UK Session

Anna Gressel, Jim Pastore, and Florian Ostmann lead a crash course on the emerging ethical and regulatory issues surrounding fintech AI. You'll hear insights from statements by US and UK regulators in banking and financial services and examine their priorities in 2020. You'll get practical guidance on how you can mitigate ethical and legal risks and position your AI products for success.

Lukumon Oyedele is the founding director of the Big Data, Enterprise and Artificial Intelligence Lab (Big-DEAL) at the University of the West of England (UWE) Bristol and he’s the assistant vice-chancellor, digital innovation and enterprise at UWE Bristol. His research focus is in the transformation of the UK construction industry for improved productivity and performance using emerging digital technologies, including AI, big data, machine learning and deep learning, IoT, natural language processing, and augmented reality and virtual reality. His cross-disciplinary research has culminated into strategic partnerships with businesses to stimulate improved productivity and value delivery within the architectural, engineering, and construction (AEC) industries. Lukumon has a substantial research track record in managing and delivering large-scale, applied, collaborative, and multi-year research projects to the tune of £18 million. The impact of the projects evidences Lukumon’s knack for employing emerging innovative technologies to address diverse challenges confronting large businesses and SME within the AEC industries. He leads a cross-disciplinary team of world-class researchers, including computer scientists, data scientist, building information modeling (BIM) modelers, civil engineers, architects, planners, electrical engineers, sociologist, psychologists, and financial analysts, among others.

Presentations

Conversational AI and augmented reality for supporting frontline construction workers Session

The time spent by frontline construction workers can be reduced by 50% through a hands-free assembly support building information modeling (BIM) system. Lukumon Oyedele explains how to make it possible for onsite construction workers to seek support from BIM through verbal query and augmented display through conversational AI and augmented reality (AR).

Maziyar Panahi is a data scientist at John Snow Labs and an active contributor to the Spark NLP open source project. He’s also a lead big data engineer and project manager at the Institut des Systèmes Complexes in Paris, overseeing a platform with over 110 billion documents on 120+ servers and 120+ TB of HDFS storage. Maziyar has 15 years of experience as a software engineer, system administrator, project manager, and research officer.

Presentations

Advanced natural language processing with Spark NLP 1-Day training

Alex Thomas and Maziyar Panahi detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Jim Pastore is a litigation partner at Debevoise & Plimpton LLP and a member of the firm’s Cybersecurity and Data Privacy Practice and Intellectual Property Litigation Group. His practice focuses on privacy and cybersecurity issues. He’s recognized by Chambers USA and the Legal 500 US (2015–2019) for his cybersecurity work and was included in Benchmark Litigation’s Under 40 Hot List, which recognizes attorneys under 40 with outstanding career accomplishments. Named as a cybersecurity trailblazer by the National Law Journal, he’s twice been named to Cybersecurity Docket’s “Incident Response 30,” a list of the best and brightest data breach–response attorneys. Previously Jim served for five years as an Assistant United States Attorney in the Southern District of New York, where he spent most of his time as a prosecutor with the Complex Frauds Unit and Computer Hacking and Intellectual Property Section.

Presentations

AI impact assessments: Tools for evaluating and mitigating corporate AI risks Session

The Canadian Government made waves when it passed a law requiring AI impact assessments for automated decision systems. Similar proposals are pending in the US and EU. Anna Gressel, Meeri Haataja, and Jim Pastore unpack what an AI impact assessment looks like in practice and how companies can get started from a technical and legal perspective, and they provide tips on assessing AI risk.

AI regulation and ethics in fintech: Insights from the US and UK Session

Anna Gressel, Jim Pastore, and Florian Ostmann lead a crash course on the emerging ethical and regulatory issues surrounding fintech AI. You'll hear insights from statements by US and UK regulators in banking and financial services and examine their priorities in 2020. You'll get practical guidance on how you can mitigate ethical and legal risks and position your AI products for success.

Lomit Patel is the vice president of growth at IMVU, responsible for user acquisition, retention, and monetization. Previously, Lomit managed growth at early stage startups including Roku (IPO), TrustedID (acquired by Equifax), Texture (acquired by Apple), and EarthLink. Lomit is a public speaker, author, advisor, and recognized as a Mobile Hero by Liftoff.

Presentations

Lean AI: How innovative startups use artificial intelligence to grow Session

The future of customer acquisition rests on the shoulders of leveraging intelligent machines, orchestrating complex campaigns across key marketing platforms—dynamically allocating budgets, pruning creatives, surfacing insights, and taking actions powered by AI. Lomit Patel shows you how to use AI and machine learning (ML) to provide an operational layer to deliver meaningful results.

Pirabu Pathmasenan is a senior manager of enterprise business applications and Hadoop platform at TD Bank. His team is responsible for the day-to-day business operations, analysis, and support of TD’s critical business applications and Hadoop platforms to deliver better business outcomes. Previously, Pirabu spent 12 years at IBM doing various roles within the IBM analytics organization, including as strategic account manager, where he was responsible for managing both cloud and private cloud projects for customers in the US and Canada; a technical consultant focusing on DB2 LUW and information integration and governance products; and a technical leader in the DB2 QA organization, where he was responsible for leading, developing, and inventing new tools and processes to improve the quality of the DB2 database product. Pirabu has several papers published on Ip.com and has spoken at various IBM conferences across the globe.

Presentations

A deep dive into TD Bank's data-driven transformation Session

Melissa Singh and Pirabu Pathmasenan walk you through TD Bank's data-driven transformation. You'll learn how it started, where it is today, and where it's going with big data and AI. You'll uncover shifts in the company's cultural paradigm, along with the technical tools and practices used to transition traditional analytics teams into the world of big data and AI.

Andy Petrella is the CEO of Kensu, an analytics and AI governance company that created the Kensu Data Activity Manager (DAM), a first-of-its-kind governance, compliance, and performance (GCP) solution. He’s a mathematician turned distributed computing entrepreneur. Besides being a Scala and Spark trainer, Andy has participated in many projects built using Spark, Cassandra, and other distributed technologies in various fields including geospatial analysis, the IoT, and automotive and smart cities projects. Andy is the creator of the Spark Notebook, the only reactive and fully Scala notebook for Apache Spark. In 2015, Andy cofounded Data Fellas with Xavier Tordoir around their product the Agile Data Science Toolkit, which facilitates the productization of data science projects and guarantees their maintainability and sustainability over time. Andy is also member of the program committee for the O’Reilly Strata Data Conference, Scala eXchange, Data Science eXchange, and Devoxx events.

Presentations

Data quality and lineage monitoring: Why you need them in production Session

Recent papers from Google and the European Commission emphasized the need for solutions to monitor data quality and lineage. Andy Petrella highlights three advantages for monitoring in production: boosting efficiency of data processes, increasing confidence in models in real time, and ensuring accountability to fulfill policies.

Rupert Prescot is a senior product manager for the Elsevier data platform. He’s worked in digital product management roles for 10 years, in the past 3 at RELX for the LexisNexis and Elsevier businesses. His work has focused on helping business customers find and understand relationships in content and data in the legal and scientific industries, leveraging a range of graph database technologies. A key focus for him is understanding how technology solutions deliver customer value and in successfully delivering search, recommendation, and data platform products.

Presentations

Data cleaning at scale Session

The ultimate purpose of data is to drive decisions, but things in the real world commonly aren’t as reliable or accurate as we'd like them to be. The main reason data gets dirty and often unreliable is simple: human intervention. Rupert Prescot and Jonathan Warner are here to help you maintain the reliability of data that's constantly exposed to and updated by your users.

Phillip Radley is chief data architect on the core enterprise architecture team at BT, where he’s responsible for data architecture across the company. Based at BT’s Adastral Park campus in the UK, Phill leads BT’s MDM and big data initiatives, driving associated strategic architecture and investment road maps for the business. He’s worked in IT and communications for 30 years. Previously, Phill was been chief architect for infrastructure performance-management solutions from UK consumer broadband to outsourced Fortune 500 networks and high-performance trading networks. He has broad global experience, including with BT’s concert global venture in the US and five years as an Asia Pacific BSS/OSS architect based in Sydney. Phill is a physics graduate with an MBA.

Presentations

Pivoting from BI on Hadoop to ML and streaming in the cloud at BT Session

Enterprise IT has been delivering BI on Hadoop for a few years, but frustrated business analysts and data scientists want self-service data and ML in the cloud, so they can go much faster. Phillip Radley explores the challenges when enterprise IT teams have to quickly pivot from caring for an elephant on-premises to farming herds of clusters, pipelines, and models in clouds.

Rajkumar Iyer is a senior staff engineer in Qubole (Bangalore), working on the challenges of running Spark as a Service on cloud. Previously, he worked on hyperscale real-time distributed key-value stores at Aerospike and the shared disk distributed database at Sybase. His interests include autoscaling, task scheduling, and transactions in big data systems.

Presentations

ACID for big data lakes on Apache Hive, Apache Spark, and Presto Session

Abhishek Somani, Shubham Tagra, and V Rajkuma detail an open source framework for Apache Hive, Apache Spark, and Presto that provides cross-engine ACID transactions and enables performant and cost-effective updates and deletes on big data lakes on the cloud.

Manu Ram Pandit is a senior software engineer on the data analytics and infrastructure team at LinkedIn. He has extensive experience in building complex and scalable applications. During his tenure at LinkedIn, he’s influenced design and implementation of hosted notebooks, providing a seamless experience to end users. He works closely with customers, engineers, and product to understand and define the requirements and design of the system. Previously, he was with Paytm, Amadeus, and Samsung, where he built scalable applications for various domains.

Presentations

Darwin: Evolving hosted notebooks at LinkedIn Session

Come and learn the challenges we overcame to make Darwin (Data Analytics and Relevance Workbench at LinkedIn) a reality. Know about how data scientists, developers, and analysts at LinkedIn can share their notebooks with their peers, author work in multiple languages, have their custom execution environments, execute long-running jobs, and do much more on a single hosted notebooks platform.

Nathalie Rauschmayr is a machine learning scientist at AWS, where she helps customers develop deep learning applications. She has a research background in high-performance computing, having conducted research in several international research organizations including the German Aerospace Center, the European Organization for Nuclear Research (CERN), and Lawrence Livermore National Laboratory (LLNL).

Presentations

Using Amazon SageMaker to build, train, and deploy ML models 1-Day training

Build, train, and deploy a deep learning model on Amazon SageMaker with Nathalie Rauschmayr, Satadal Bhattacharjee, and Aparna Elangovan, and learn how to use some of the latest SageMaker features such as SageMaker Debugger and SageMaker Model Monitor.

Bhargavi Reddy is a senior data engineer on the platform and security data engineering team at Netflix, where she builds large-scale analytical data products to enable efficient use of Netflix’s cloud resources and strengthen its security and privacy. Bhargavi is passionate about advancing the cause of women in technology. She actively volunteers for various I&D and Women in Tech groups internally and externally to encourage and inspire women to pursue and excel in technology careers. Bhargavi earned her master’s in information systems management from Carnegie Mellon University. When she’s not working, you’re most likely to find Bhargavi at a Bollywood dance academy, exploring the world, or enjoying Indian food.

Presentations

Drive Netflix cloud efficiency and security with AWS S3 access logs Session

Bhargavi Reddy outlines the driving forces for effective data lifecycle management (DLM) at Netflix and the current state of Netflix’s S3 data warehouse, offers an overview of the S3 access logs collection process using SQS and Apache Iceberg, and details how the S3 logs are used for improving the efficiency and security posture of Netflix's cloud infrastructure at scale in the DLM realm.

Nikki Rouda is a principal product marketing manager at Amazon Web Services (AWS). Nikki has decades of experience leading enterprise big data, analytics, and data center infrastructure initiatives. Previously, he held senior positions at Cloudera, Enterprise Strategy Group (ESG), Riverbed, NetApp, Veritas, and UK-based Alertme.com (an early consumer IoT startup). Nikki holds an MBA from Cambridge’s Judge Business School and an ScB in geophysics from Brown University.

Presentations

Build a serverless data lake for analytics 1-Day training

Janisha Anand and Nikki Rouda teach you how to build a serverless data lake on AWS. You'll ingest Instacart's public dataset to the data lake and draw valuable insights on consumer grocery shopping trends. You’ll build data pipelines, leverage data lake storage infrastructure, configure security and governance policies, create a persistent catalog of data, perform ETL, and run an ad hoc analysis.

Building a secure, scalable, and transactional data lake on AWS 2-Day Training

Nikki Rouda walks you through the steps of building a data lake on Amazon S3 using different ingestion mechanisms, performing incremental data processing on the data lake to support transactions on S3, and securing the data lake with fine-grained access control policies.

Building a secure, scalable, and transactional data lake on AWS (Day 2) Training Day 2

Nikki Rouda walks you through the steps of building a data lake on Amazon S3 using different ingestion mechanisms, performing incremental data processing on the data lake to support transactions on S3, and securing the data lake with fine-grained access control policies.

Rachel Roumeliotis is a strategic content director at O’Reilly, where she leads an editorial team that covers a wide variety of programming topics ranging from full stack to open source in the enterprise to emerging programming languages. Rachel is a programming chair of OSCON and O’Reilly’s Software Architecture Conference. She has been working in technical publishing for 10 years, acquiring content in many areas including mobile programming, UX, computer security, and AI.

Presentations

Thursday keynotes Keynote

Strata Data & AI Conference program chairs Rachel Roumeliotis and Alistair Croll welcome you to the second day of keynotes.

Wednesday keynotes Keynote

Strata Data & AI Conference program chairs Rachel Roumeliotis and Alistair Croll welcome you to the first day of keynotes.

Nipun Sadvilkar is a senior data scientist at US healthcare company Episource, where he helps design and build the clinical natural language processing (NLP) engine to revamp medical coding workflows, enhance coder efficiency and accelerate revenue cycle. He has 3+ years of experience in building NLP solutions and web-based data science architectures in the areas of healthcare, finance, media, and psychology. his interest lies at the intersection of machine learning and software engineering with a fair understanding of the business domain. Nipun is a member of PyCon India, PyDelhi, PyData Mumbai, SciPy India, and blogs regularly about Python and AI on his website.

Presentations

Clinical NLP: Building a named-entity recognition model for clinical text Session

Episource is building a clinical natural language processing (NLP) engine to extract from medical charts to automate coding in claims submissions using a medical coder's expertise to review highlighted entities and autosuggested ICD10 codes. Nipun Sadvilkar details building a key component of Episource's clinical NLP engine—Clinical NER—from data annotation to models and techniques.

Kumar Sambhav is a data and analytics architect at Barclays, where he leads the data and analytics platform design and strategy for people analytics and wider HR. He has around 15 years of industry experience working at the enterprise level in data landscape, including data migration, data warehouse, BI, AI, ML, and data strategy. Previously, Kumar helped a wide variety of organizations with their data journeys, including Barclays, Bank of England, BAE Systems, Network Rail, Sky, Standard Life, Orange, and Honeywell.

Presentations

Architecting platform for people analytics Session

People analytics has become key to unlocking human resource insights to understand and measure policy effectiveness and implement improvements by embedding intelligent decision making. Kumar Sambhav draws on people analytics use cases from Barclays to discuss the pipeline it developed and the corresponding controls and governance model that was implemented.

Majken Sander is a data nerd, business analyst, and solution architect. Previously, Majken worked with IT, management information, analytics, BI, and DW for 20+ years. Armed with strong analytical expertise, she’s keen on “data driven” as a business principle, data science, the IoT, and all other things data. Read more majkensander.com.

Presentations

Let’s talk data literacy—and ethics Session

Join Majken Sander to learn about the importance of data literacy and ethics. Schools and society in general need to educating citizens to raise their digital awareness. Companies need to build their employee's data literacy competencies. And the company's digital economy strategy should include data ethics and maybe also chose to embrace it as a competitive edge gained via branding value.

Flávio Roberto Santos works as a Data Infrastructure Engineer at Spotify in Stockholm, Sweden. He currently works in the event delivery team, whose main responsibility is to build and maintain a internal data platform used to collect and store events from Spotify clients and backend services. Before joining Spotify, Flávio worked managing Hadoop, Cassandra, Kafka, and Elasticsearch clusters in Brazil.

Presentations

A journey through Spotify event delivery system Session

Data has been a first-class citizen at Spotify since the beginning. It is an important component of the ecosystem that allows data scientists and analysts to improve features and develop new products. Events collected from instrumented clients and backends go through a complex system before they are available for internal teams. This talk goes deep into how event delivery is built inside Spotify.

Richard Sargeant is the chief commercial officer at Faculty. Richard supports senior leaders across a variety of sectors to transform their businesses to use AI effectively. Previously, he was director of transformation at the Home Office, where he oversaw the creation of the second most advanced in-house machine learning capability in government; he was one of the founding directors of the UK’s Government Digital Service; and he was at Google. He has also worked at the Prime Minister’s Strategy Unit and HM Treasury. He is a nonexec on the Board of Exeter University, and the Government’s Centre for Data Ethics and Innovation. He has a degree in political philosophy, economics, and social psychology from Cambridge University.

Presentations

AI safety: How do we bridge the gap between technology and the law? Session

Firms and government have become more aware of the risk of "black-box" algorithms that "work," but in an opaque way. Existing laws and regulations merely stipulate what ought to be the case and not to achieve it technically. Richard Sargeant is joined by leading figures from law, technology, and businesses to interrogate this subject.

Frank Säuberlich is the chief data officer at EnBW.

Presentations

AI at scale driving the German Energiewende Session

Almost two years ago EnBW developed its core beliefs for the role of AI at EnBW and derived concrete actions that need to be taken to scale its AI activities. Rainer Hoffmann and Frank Säuberlich describe the actions and the challenges EnBW has faced on its journey so far and its approach to mastering these challenges.

Alejandro Saucedo is chairman at the Institute for Ethical AI & Machine Learning. In his more than 10 years of software development experience, Alejandro has held technical leadership positions across hypergrowth scale-ups and tech giants including Eigen Technologies, Bloomberg LP, and Hack Partners. Alejandro has a strong track record of building multiple departments of machine learning engineers from scratch and leading the delivery of numerous large-scale machine learning systems across the financial, insurance, legal, transport, manufacturing, and construction sectors in Europe, the US, and Latin America.

Presentations

A practical ML Ops framework for machine learning at massive scale Session

Managing production machine learning systems at scale has uncovered new challenges that require fundamentally different approaches to traditional software engineering or data science. Alejandro Saucedo explores ML Ops, a concept that often encompasses the methodologies to continuously integrate, deploy and monitor machine learning in production at massive scale.

Conor Sayles is a group advanced analytics lead at Bank of Ireland; he reports to the chief data officer, leads an analytics team with an annual €15M data value realization target, and coordinates collaboration with analytics teams embedded in business functions across the company. Conor has 18 years’ experience in the retail banking industry, including risk model development, regulatory reporting, data visualization, and automated decisioning.

Presentations

Building an ecosystem for analytics success at Bank of Ireland Session

Conor Sayles details how Bank of Ireland led a data value realization strategy, yielding a return of over €70M and incorporating infrastructure investment, agile management, and design thinking. An analytic system including Tableau, Teradata, SAS, and Cloudera provides a cornerstone for decision making across multiple functions. Underlying the success is a growing data community.

Fredrik Schlyter is a machine learning engineer at Violet ventures.

Presentations

Finding payment information on invoices using machine learning Session

Finvoice started out as a small project consisting of one machine learning engineer and 50 invoices; today it's used by companies that scan over 80 million invoices per year. Fredrik Schlyter describes how machine learning can capture payment information on invoices and how it expanded from a cloud-based API solution to doing the inference directly on customers' mobile phones.

Tuhin Sharma is a cofounder and CTO of Binaize, an AI-based firm. Previously, he was a data scientist at IBM Watson and Red Hat, where he mainly worked on social media analytics, demand forecasting, retail analytics, and customer analytics, and he worked at multiple startups, where he built personalized recommendation systems to maximize customer engagement with the help of ML and DL techniques across multiple domains like fintech, ed tech, media, and ecommerce. He’s filed five patents and published four research papers in the field of natural language processing and machine learning. He holds a postgraduate degree in computer science and engineering, specializing in data mining, from the Indian Institute of Technology Roorkee. He loves to play table tennis and guitar in his leisure time. His favorite quote is, “Life is beautiful.”

Presentations

Writer-independent offline signature verification in banks using few-shot learning Session

Offline signature verification is one of the most critical tasks in traditional banking and financial industries. The unique challenge is to detect subtle but crucial differences between genuine and forged signatures. This verification task is even more challenging in writer-independent scenarios. Tuhin Sharma and Pravin Jha detail few-shot image classification.

Thunder Shiviah is a senior solutions architect at Databricks. Previously, Thunder was a machine learning engineer at McKinsey & Company focused on productionizing machine learning at scale.

Presentations

Managing the full deployment lifecycle of machine learning models with MLflow Session

Thunder Shiviah and Cyrielle Simeone dive into MLflow, an open source platform from Databricks, to manage the complete ML lifecycle, including experiment tracking, model management, and deployment. With over 140 contributors and 800,000 monthly download on PyPi, MLflow has gained tremendous community adoption, demonstrating the need for an open source platform for the ML lifecycle.

Cyrielle Simeone is the product marketing manager for data science and machine learning at Databricks.

Presentations

Managing the full deployment lifecycle of machine learning models with MLflow Session

Thunder Shiviah and Cyrielle Simeone dive into MLflow, an open source platform from Databricks, to manage the complete ML lifecycle, including experiment tracking, model management, and deployment. With over 140 contributors and 800,000 monthly download on PyPi, MLflow has gained tremendous community adoption, demonstrating the need for an open source platform for the ML lifecycle.

Julien Simon is a technical evangelist at AWS. Previously, Julien spent 10 years as a CTO and vice president of engineering at a number of top-tier web startups. He’s particularly interested in all things architecture, deployment, performance, scalability, and data. Julien frequently speaks at conferences and technical workshops, where he helps developers and enterprises bring their ideas to life thanks to the Amazon Web Services infrastructure.

Presentations

A pragmatic introduction to graph neural networks Session

Julien Simon offers an overview of graph neural networks (GNNs), one of the most exciting developments in machine learning today. You'll discuss real-life use cases for which GNNs are a great fit and get started with GNNs using the Deep Graph Library, an open source library built on top of Apache MXNet and PyTorch.

Melissa Singh is a senior manager of an enterprise Hadoop enablement team at TD Bank, where she leads a Center of Excellence team for Hadoop enablement with the primary objective to help transform analytics teams and enable them to make the move to big data and AI. Previously, Melissa started in Java development within the ecommerce space at IBM, where she was introduced to machine learning and big data; she pursued this interest area at TD Bank, where she led a big data engineering and data science team to help build out big data environments and data science tooling. She earned a bachelor’s of engineering degree and has over eight years of software engineering experience. Her areas of interest include reading, competitive dance, and working in the community to help promote women in STEM.

Presentations

A deep dive into TD Bank's data-driven transformation Session

Melissa Singh and Pirabu Pathmasenan walk you through TD Bank's data-driven transformation. You'll learn how it started, where it is today, and where it's going with big data and AI. You'll uncover shifts in the company's cultural paradigm, along with the technical tools and practices used to transition traditional analytics teams into the world of big data and AI.

Pramod Singh is currently playing a role of Machine Learning Expert at Walmart Labs. He has extensive hands-on experience in machine learning, deep learning, AI, data engineering, designing algorithms and application development. He has spent more than 10 years working on multiple data projects at different organizations. He’s the author of three books -Machine Learning with PySpark , Learn PySpark and Learn TensorFlow 2.0. He is also a regular speaker at major conferences such as O’Reilly’s Strata and AI conferences. Pramod holds a BTech in electrical engineering from B.A.T.U, and an MBA from Symbiosis University. He has also done Data Science certification from IIM–Calcutta. He lives in Bangalore with his wife and three-year-old son. In his spare time, he enjoys playing guitar, coding, reading, and watching football.

Presentations

Attention Networks all the way to production using Kubeflow 1-Day training

With the latest developments and improvements in the field of deep learning and artificial intelligence, many demanding natural language processing tasks become easy to implement and execute. Text summarization is one of the tasks that can be done using attention networks.

Axel Sirota has a Masters degree in Mathematics with a deep interest in Deep Learning and Development Lifecycle. After researching in Probability, Statistics and Machine Learning optimization, he is currently working at ASAPP as a Lead Machine Learning Engineer leveraging Customer Experience conversations for making accurate predictions with Neural Networks. Apart from that, Axel has worked 5+ years in Development Lifecycle regarding the whole CI/CD pipeline from setting and administering Jenkins all the way to production.

Presentations

Getting Started With Tensorflow Lite Interactive session

Over this training, we will learn in a hands-on approach about Tensorflow Lite and how to leverage it to create a machine learning application that can run on your cell phone.

Karol Sobczak is a cofounder of and software engineer on the Starburst team. He contributes to the Presto code base and is also active in the community. Karol has been involved in the design and development of significant features in Presto like the Kubernetes integration, cost-based optimizer, correlated subqueries, distributed ordering, and a plethora of smaller planner and performance enhancements. Previously, he worked at Teradata Labs, Hadapt, and IBM Research. He graduated from Warsaw University and the Vrije University of Amsterdam.

Presentations

Presto on Kubernetes: Query anything, anywhere Session

Wojciech Biela and Karol Sobcza explore Presto, an open source SQL engine, offering high concurrency, low-latency queries across multiple data sources within one query. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring.

Abhishek Somani is a senior staff engineer engineer on the Hive team at Qubole. Previously, Abhishek worked at Citrix and Cisco. He holds a degree from NIT Allahabad.

Presentations

ACID for big data lakes on Apache Hive, Apache Spark, and Presto Session

Abhishek Somani, Shubham Tagra, and V Rajkuma detail an open source framework for Apache Hive, Apache Spark, and Presto that provides cross-engine ACID transactions and enables performant and cost-effective updates and deletes on big data lakes on the cloud.

Ankit Srivastava is a senior data scientist on the core data science team for the Azure Cloud + AI Platform Division at Microsoft, where he focuses on commercial and education segment data science projects within the company. Previously, he was a developer on the data integration and insights team. He has built several production-scale ML enrichments that are leveraged for sales compensation and senior leadership team metrics.

Presentations

Infinite segmentation: Scalable mutual information ranking on real-world graphs Session

Today, normal growth isn't enough—you need hockey-stick levels of growth. Sales and marketing orgs are looking to AI to "growth hack" their way to new markets and segments. Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks.

Gabriel is the head of data science and architecture at the BBC, where his role is to help make the organization more data informed and make it easier for product teams to build data- and machine learning–powered products. He guest lectures on responsible machine learning at London Business School and is an honorary senior research associate at UCL. He also advises startups and VCs on data and machine learning strategies. Previously, he was the data director at Notonthehighstreet.com and head of data science at Tesco. His teams have worked on a diverse range of problems from search engines, recommendation engines, and pricing optimization to vehicle routing problems and store space optimization. Gabriel earned an MA in mathematics from Cambridge and an MBA from London Business School.

Presentations

AI safety: How do we bridge the gap between technology and the law? Session

Firms and government have become more aware of the risk of "black-box" algorithms that "work," but in an opaque way. Existing laws and regulations merely stipulate what ought to be the case and not to achieve it technically. Richard Sargeant is joined by leading figures from law, technology, and businesses to interrogate this subject.

Build a recommendations framework for an ever-changing landscape Session

Gabriel Straub walks you through the BBC's experience with building a framework to build public service recommendations for the BBC, deploying in multiple clouds, following our machine learning principles, and reflecting editorial values to inform, educate, and entertain.

Bargava Subramanian is a cofounder and deep learning engineer at Binaize in Bangalore, India. He has 15 years’ experience delivering business analytics and machine learning solutions to B2B companies. He mentors organizations in their data science journey. He holds a master’s degree from the University of Maryland, College Park. He’s an ardent NBA fan.

Presentations

Democratize and build better deep learning models using TensorFlow.js Session

Bargava Subramanian and Amit Kapoor use two real-world examples to show how you can quickly build visual data products using TensorFlow.js to address the challenges inherent in understanding the strengths, weaknesses, and biases of your models as well as involving business users to design and develop a more effective model.

Perumal Sudalai Kumaresa is a data science consultant at Data Reply with five years of industry background with four different companies in sentiment analysis, NLP, big data processing, text analytics, and implementing machine learning models. Perumal’s focus is on AI using reinforcement learning.

Presentations

Deep reinforcement learning for NLP Session

Natural language processing (NLP) tasks using supervised ML perform poorly where conversational context is involved. Perumal Sudalai Kumaresa details how implementing deep reinforcement learning (DRL) in NLP is a better predictor in handling problems like Q&A, dialogue generation, and article summary by simulation of two agents taking turns that explore state-action space and learning a policy.

Dan Sullivan is a principal engineer and architect at New Relic, and a software architect, author, and instructor with over 25 years of experience in the tech industry. He has extensive experience is multiple fields, including machine learning, data science, streaming analytics, and cloud architecture. Dan’s latest books include NoSQL for Mere Mortals, Google Cloud Certified Associate Cloud Engineer Study Guide, and Google Cloud Professional Architect Study Guide (forthcoming). His courses cover a range of topics, including scalable machine learning, data science topics, Scala, Cassandra, and Advanced SQL, and have accumulated almost one million views across Lynda and LinkedIn Learning. He earned his PhD in genetics and computational biology.

Presentations

Don’t be that developer who puts a biased model into production Session

ML models may perform as expected from a reliability and scalability perspective, but make poor decisions that cost sales and trust. In worst-case scenarios, decisions may violate policies and government regulations. Dan Sullivan showcases techniques for identifying bias, leveraging explainability methods to measure compliance, and incorporating these techniques into DevOps practices.

Presentations

AI safety: How do we bridge the gap between technology and the law? Session

Firms and government have become more aware of the risk of "black-box" algorithms that "work," but in an opaque way. Existing laws and regulations merely stipulate what ought to be the case and not to achieve it technically. Richard Sargeant is joined by leading figures from law, technology, and businesses to interrogate this subject.

Václav Surovec is a senior big data engineer and comanages the Big Data Department at Deutsche Telekom IT. The department’s more than 45 engineers deliver big data projects to Germany, the Netherlands, and the Czech Republic. Recently, he led the commercial roaming project. Previously, he worked at T-Mobile Czech Republic while he was still a student of Czech Technical University in Prague.

Presentations

Machine learning processes at Deutsche Telekom Global Carrier Session

Deutsche Telekom is fourth biggest telecommunication company in the world, and every day millions of its customers use their mobile services in roaming. Gabor Kotalik and Václav Surovec explain how the company designed and built its machine learning processes on top of the Cloudera Hadoop cluster to support its commercial roaming business.

Ben Sykes is a senior software engineer at Netflix, where he builds the video player delivery platform that’s used to manage the real-time deployment and monitoring of daily updates to video player devices around the world. He lives by the tagline, “I make things and break things,” believing that the best learning opportunities come when things don’t work.

Presentations

Real-time insights and a high-quality streaming experience at Netflix Session

Ensuring a consistently great Netflix experience while pushing innovative technology updates is no easy feat. Ben Sykes takes a look at how Netflix turns log streams into real-time metrics to provide visibility into how devices are performing in the field. You'll discover some of the lessons Netflix learned while optimizing Druid to handle its load.

Andras Szabo is a data scientist at Pivigo (London, UK) where, besides working on internal data science projects, he takes part in facilitating projects carried out by either freelancers or aspiring data scientist from academia. A physicist by training, he has a decade of experience working in the biological academic research field on experimental analysis and hypothesis testing through computational simulations. After leaving academia he worked as a freelance data scientist in a variety of fields, including healthcare and finance, before joining Pivigo in 2019.

Presentations

Beyond smart infrastructure: leveraging satellite data to detect wildfires Session

Wildfires are a major environmental and health risk, with a frequency that has increased dramatically in the past decade. Early detection is critical, however most often wildfires are only discovered by eye-witness accounts. In this talk we will tell about a data science partnership between HAL24K and Pivigo aimed at building an automated wildfire detection system using NOAA satellite data.

Shubham Tagra is a senior staff engineer at Qubole, working on Presto and Hive development and making these solutions cloud ready. Previously, Shubham worked on the storage area network at NetApp. Shubham holds a bachelor’s degree in computer engineering from the National Institute of Technology, Karnataka, India.

Presentations

ACID for big data lakes on Apache Hive, Apache Spark, and Presto Session

Abhishek Somani, Shubham Tagra, and V Rajkuma detail an open source framework for Apache Hive, Apache Spark, and Presto that provides cross-engine ACID transactions and enables performant and cost-effective updates and deletes on big data lakes on the cloud.

Angus Taylor is a data scientist at Microsoft, where he builds AI solutions for customers. He holds a MSc in artificial intelligence and has previous experience in the retail, energy, and government sectors.

Presentations

Solving real-world computer vision problems with open source Session

Training and deployment of deep neural networks for computer vision (CV) in realistic business scenarios remains a challenge for both data scientists and engineers. Angus Taylor and Patrick Buehler dig into state-of-the-art in the CV domain and provide resources and code examples for various CV tasks by leveraging the Microsoft CV best-practices repository.

Alex Thomas is a data scientist at John Snow Labs. He’s used natural language processing (NLP) and machine learning with clinical data, identity data, and job data. He’s worked with Apache Spark since version 0.9 as well as with NLP libraries and frameworks including UIMA and OpenNLP.

Presentations

Advanced natural language processing with Spark NLP 1-Day training

Alex Thomas and Maziyar Panahi detail the application of the latest advances in deep learning for common natural language processing (NLP) tasks such as named entity recognition, document classification, sentiment analysis, spell checking, and OCR. You'll learn to build complete text analysis pipelines using the highly performant, scalable, open source Spark NLP library in Python.

Ward Van Laer is a lead machine learning engineer at IxorThink, the AI practice of the Belgian software company Ixor. Fascinated by the mystery and power of the human mind, he’s inventive in explaining how AI models work and how to interpret their results. At IxorThink, Ward demonstrated a lot of added value with operational AI driven solutions in the content marketing industry and healthcare sector.

Presentations

Understanding AI: Interpretability and UX Session

A machine learning solution is only as good as it's deemed by the end user. More often than not, we don't think through how results are communicated or measured. Join Ward Van Laer to understand why, if you want business-end end users to trust and correctly interpret AI models, you might need to make your models transparent and understandable.

Navneet Kumar Verma is a Software Engineer on the data analytics and infrastructure team at Linkedin. He contributes to Darwin by designing and developing backend services & infrastructure. He is working with Darwin customers and stakeholders to add new features/extend Darwin to transform it into a data app platform. In the past, he had been working with WalmartLabs and Manhattan Associates to build big data applications.

Presentations

Darwin: Evolving hosted notebooks at LinkedIn Session

Come and learn the challenges we overcame to make Darwin (Data Analytics and Relevance Workbench at LinkedIn) a reality. Know about how data scientists, developers, and analysts at LinkedIn can share their notebooks with their peers, author work in multiple languages, have their custom execution environments, execute long-running jobs, and do much more on a single hosted notebooks platform.

Naghman Waheed is the data platforms lead at Bayer Crop Science, where he’s responsible for defining and establishing enterprise architecture and direction for data platforms. Naghman is an experienced IT professional with over 25 years of work devoted to the delivery of data solutions spanning numerous business functions, including supply chain, manufacturing, order to cash, finance, and procurement. Throughout his 20+ year career at Bayer, Naghman has held a variety of positions in the data space, ranging from designing several scale data warehouses to defining a data strategy for the company and leading various data teams. His broad range of experience includes managing global IT data projects, establishing enterprise information architecture functions, defining enterprise architecture for SAP systems, and creating numerous information delivery solutions. Naghman holds a BA in computer science from Knox College, a BS in electrical engineering from Washington University, an MS in electrical engineering and computer science from the University of Illinois, and an MBA and a master’s degree in information management, both from Washington University.

Presentations

Enabling data streaming from SAP ERP system using Kafka Session

IT information systems are a key enabler for Bayer's business in a very competitive environment. As the complexity of its business grows, so does the need to provide data for real-time business analytics and BI. Naghman Waheed walks you through the unique architecture that streams data out of its SAP ERP using SAP SLT and Kafka, enabling business decisions based on real-time events.

Mary Wahl is a data scientist on the AI for Earth team at Microsoft, which helps NGOs apply deep learning to problems in conservation biology and environmental science. Mary has also worked on computer vision and genomics projects as a member of Microsoft’s algorithms and data science solutions team in Boston. Previously, Mary studied recent human migration, disease risk estimation, and forensic reidentification using crowdsourced genomic and genealogical data as a Harvard College Fellow.

Presentations

AI applications in aerial imagery Session

With the increasing availability of massive high-resolution aerial imagery, the geospatial information system community and the computer vision (CV) community joined forces in the new field of "geo AI." Mary Wahl and Ye Xing introduce you to this new field with live demos and sample code for common AI applications to aerial imagery from both commercial and government use cases.

Dean Wampler is an expert in streaming data systems, focusing on applications of machine learning and artificial intelligence (ML/AI). He is Head of Developer Relations at Anyscale, which is developing Ray for distributed Python, primarily for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with Akka Streams, Apache Spark, Apache Flink, and Apache Kafka. Dean is the author of Fast Data Architectures for Streaming Applications, Programming Scala, and Functional Programming for Java Developers, and he is the coauthor of Programming Hive, all from O’Reilly. He’s a contributor to several open source projects. A frequent conference speaker and tutorial teacher, he’s also the co-organizer of several conferences around the world and several user groups in Chicago. He has a Ph.D. in Physics from the University of Washington.

Presentations

Understanding data governance for machine learning models Session

Production deployment of machine learning (ML) models requires data governance, because models are data. Dean Wampler justifies that claim and explores its implications and techniques for satisfying the requirements. Using motivating examples, you'll explore reproducibility, security, traceability, and auditing, plus some unique characteristics of models in production settings.

Using Ray to scale Python, data processing, and machine learning 1-Day training

Surprisingly, there's no simple way to scale up Python applications from your laptop to the cloud. Ray is an open source framework for parallel and distributed computing that makes it easy to program and analyze data at any scale by providing general-purpose high-performance primitives. Dean Wampler teaches you how to use Ray to scale up Python applications, data processing, and machine learning.

Jiao (Jennie) Wang is a software engineer on the big data technology team at Intel, where she works in the area of big data analytics. She’s engaged in developing and optimizing distributed deep learning framework on Apache Spark.

Jiao(Jennie)Wang是英特尔大数据技术团队的软件工程师,主要工作在大数据分析领域。她致力于基于Apache Spark开发和优化分布式深度学习框架。

Presentations

Real-time recommendation using attention network with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention network using MXNet, Ray, Apache Spark, and Analytics Zoo in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure in Burger King’s production environment.

Luyang Wang is a senior manager, guest intelligence and data science at Restaurant Brands International, where he works on machine learning and big data analytics. He’s develops distributed machine learning applications and real-time recommendation services for the Burger King brand. Previously, he was at Office Depot and the Philips Big Data and AI lab.

Presentations

Real-time recommendation using attention network with Analytics Zoo on Apache Spark Session

Lu Wang and Jennie Wang explain how to build a real-time menu recommendation system to leverage attention network using MXNet, Ray, Apache Spark, and Analytics Zoo in the cloud. You'll learn how to deploy the model and serve the real-time recommendation using both cloud and on-device infrastructure in Burger King’s production environment.

Jonathan Warner is a data strategist at Elsevier with a history in ad-serving, affiliate marketing, and securities lending. He primarily works with rapid prototyping and investigative analysis in Spark, Python, and Redshift. Jonathan used to have hobbies; now has a young daughter. He’s powered by coffee, yoga, and music.

Presentations

Data cleaning at scale Session

The ultimate purpose of data is to drive decisions, but things in the real world commonly aren’t as reliable or accurate as we'd like them to be. The main reason data gets dirty and often unreliable is simple: human intervention. Rupert Prescot and Jonathan Warner are here to help you maintain the reliability of data that's constantly exposed to and updated by your users.

Presentations

AI safety: How do we bridge the gap between technology and the law? Session

Firms and government have become more aware of the risk of "black-box" algorithms that "work," but in an opaque way. Existing laws and regulations merely stipulate what ought to be the case and not to achieve it technically. Richard Sargeant is joined by leading figures from law, technology, and businesses to interrogate this subject.

Benjamin Wright-Jones is a solution architect at the Microsoft WW Services CTO Office for Data and AI, where his team helps enterprise customers solve their analytical challenges. Over his career, Ben has worked on some of the largest and most complex data-centric projects around the globe.

Presentations

(Partially) demystifying DevOps for AI Session

DevOps, DevSecOps, AIOps, ML Ops, Data Ops, No Ops....Ditch your confusion and join Simon Lidberg and Benjamin Wright-Jones to understand what DevOps means for AI and your organization.

Ye Xing is a senior data scientist at Microsoft and has rich experience in providing end-to-end big data analytic solutions to big enterprise customers. Her main focused vertical areas for enterprise customers are (but not limited to) customer segmentation, personalized recommendation, churn prediction, predictive maintenance. Beyond the classic machine learning techniques, she’s also familiar with cutting-edge machine learning techniques, deep learning in computer vision and medical image analysis and advanced online learning recommendation algorithms.

Presentations

AI applications in aerial imagery Session

With the increasing availability of massive high-resolution aerial imagery, the geospatial information system community and the computer vision (CV) community joined forces in the new field of "geo AI." Mary Wahl and Ye Xing introduce you to this new field with live demos and sample code for common AI applications to aerial imagery from both commercial and government use cases.

Itai Yaffe is a big data tech lead at Nielsen Identity Engine, where he deals with big data challenges using tools like Spark, Druid, Kafka, and others. He’s also a part of the Israeli chapter’s core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.

Presentations

Casting the spell: Druid advanced techniques Session

Nielsen Marketing Cloud leverages Apache Druid to provide its customers (marketers and publishers) real-time analytics tools for various use cases, including in-flight analytics, reporting, and building target audiences. Itai Yaffe digs into advanced Druid techniques, such as efficient ingestion of billions of events per day, query optimization, and data retention and deletion.

Jennifer Yang is the head of data management and risk control at Wells Fargo Enterprise Data Technology. Previously, Jennifer served various senior leadership roles in risk management and capital management at major financial institutions. Jennifer’s unique experience allows her to understand data and technology from both the end user’s and data management’s perspectives. Jennifer is passionate about leveraging the power of new technologies to gain insights from the data to develop cost effective and scalable business solutions. Jennifer holds an undergraduate degree in applied chemistry from Beijing University, a master’s degree in computer science from the State University of New York at Stony Brook, and an MBA specializing in finance and accounting from New York University’s Stern School of Business.

Presentations

Apply Machine Learning Technique in Data Quality Management Session

Traditional rule-based data quality management methodology is costly and poorly scalable. It requires subject matter experts within business, data and technology domains. The presentation will discuss a use case that demonstrates how the machine learning techniques can be used in the data quality management on the big data platform in the financial industry.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires