In most security data science talks that describe a specific algorithm used to solve a security problem, the audience is left wondering: how did they perform system testing when there is no labeled attack data; what metrics do they monitor; and what do these systems actually look like in production? Academia and industry both focus largely on security detection, but the emphasis is almost always on the algorithmic machinery powering the systems. Prior art productizing solutions is sparse: it has been studied from a machine-learning angle or from a security angle but has not been jointly explored. But the intersection of operationalizing security and machine-learning solutions is important not only because security data science solutions inherit complexities from both fields but also because each has unique challenges—for instance, compliance restrictions that dictate data cannot be exported from specific geographic locations (a security constraint) have a downstream effect on model design, deployment, evaluation, and management strategies (a data science constraint).
Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud. Ram and Andrew explore lessons learned in taking a prototype security analytics system and productizing it with help from teams across Microsoft in a variety of roles, from security analysts in Azure Cloud Security and researchers in Microsoft Research to applied ML engineers in Azure Security Data Science and service engineers on the Service and Reliability team.
Ram and Andrew begin with a focus on the impact of compliance on model design, discussing the balkanization of the cloud (i.e., how certain countries have strict laws against importing data across borders and their effects on model design). The first problem that data scientists will encounter is that it now becomes very difficult to identify macro trends because of fractured data. Ram and Andrew propose tiered model building wherein local models are built in the respective national clouds along with a global model that is only informed of the output of the local models, respecting compliance and privacy notions. Ram and Andrew then explain how to evaluate a security data science system when there is no attack data. You’ll learn techniques to generate attack data like using common attacker tools, red teaming, threat intelligence feeds, and cross-product pollination to verify if the system works and the inherent trade-offs between the different strategies. Ram and Andrew also cover the relevance of generative adversarial networks, a new technique in deep learning that can potentially provide higher-quality samples than sampling techniques like SMOTE. Ram and Andrew conclude with a discussion on model management, focusing on autoscaling the system, illustrated using a case study in detecting anomalous user behavior in SharePoint.
Ram Shankar is a security data wrangler in Azure Security Data Science, where he works on the intersection of ML and security. Ram’s work at Microsoft includes a slew of patents in the large intrusion detection space (called “fundamental and groundbreaking” by evaluators). In addition, he has given talks in internal conferences and received Microsoft’s Engineering Excellence award. Ram has previously spoken at data-analytics-focused conferences like Strata San Jose and the Practice of Machine Learning as well as at security-focused conferences like BlueHat, DerbyCon, FireEye Security Summit (MIRCon), and Infiltrate. Ram graduated from Carnegie Mellon University with master’s degrees in both ECE and innovation management.
Andrew Wicker is a machine learning engineer in the Security division at Microsoft, where his current work focuses on researching and developing machine-learning solutions to protect identities in the cloud. Andrew’s previous work includes developing machine-learning models to detect safety events in an immense amount of FAA radar data and working on the development of a distributed graph analytics system. His expertise encompasses the areas of artificial intelligence, graph analysis, and large-scale machine learning. Andrew holds a BS, an MS, and a PhD in computer science from North Carolina State University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.