Siddha Ganju explains how CERN uses machine-learning models to predict which datasets will become popular over time. This helps to replicate the datasets that are most heavily accessed, which improves the efficiency of physics analysis in CMS. Analyzing this data leads to useful information about the physical processes.
Reproducibility is necessary so that any process can be simulated at different times. Some processes may be more popular and hence need to be made easily accessible. Users access this data from replicas of data stored in specified places, but creating numerous replicas of every dataset is not feasible, so predicting which datasets might become popular is necessary. Siddha explains how CERN solved the classification problem which finds if a dataset will become popular or not by calculating binary values of popular (1 / TRUE) or unpopular (0 / FALSE), giving an example with toy data. (Actual data cannot be disclosed.)
After finding which dataset is popular, CERN still had to decide which machine-learning algorithm suits the procedure best. Three algorithms were employed, naive Bayes, stochastic gradient descent, and random forest. These models were combined into an ensemble to check which algorithm offers the best true positive, true negative, false positive, or false negative value.
Siddha details how this process offers better data analysis, leading to parallel, real-time processing of the distributed data that is abundantly available in CMS.
Siddha Ganju is a data scientist at Deep Vision, where she works on building deep learning models and software for embedded devices. Siddha is interested in problems that connect natural languages and computer vision using deep learning. Her work ranges from visual question answering to generative adversarial networks to gathering insights from CERN’s petabyte scale data and has been published at top tier conferences like CVPR. She is a frequent speaker at conferences and advises the Data Lab at NASA. Siddha holds a master’s degree in computational data science from Carnegie Mellon University, where she worked on multimodal deep learning-based question answering. When she’s not working, you might catch her hiking.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.