Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Confounding factors galore: Using software ecosystem data to risk-rate code

J. C. Herz (Ion Channel)
2:05pm2:45pm Wednesday, September 27, 2017
Security
Location: 1E 14 Level: Beginner
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • CISOs, heads of application security, security engineers, and risk managers

Prerequisite knowledge

  • Familiarity with modern software development practices (continuous integration and delivery, GitFlow) and how open source is used by enterprises for everyday and mission-critical applications

What you'll learn

  • Understand why combining traditional domain knowledge with analytics is more than the sum of the parts and why machine learning doesn’t replace old-school conceptual foundations and contextual grounding
  • Explore tools for thinking about big data analysis of complex and complicated systems

Description

Automating security for DevOps means continuous analysis of open source software dependencies, vulnerabilities, and ecosystem dynamics. But the data is confounding: a flurry of reported vulnerabilities or infrequent commits that could be good or bad, depending on a project’s scope and lifecycle. JC Herz illuminates nonintuitive insights from the software supply chain, as well as tools and areas for further investigation.

Ion Channel analyzes software ecosystem data to risk rate code for continuous integration and delivery. But even well-defined data become slippery and ambivalent in this analytical domain. Each software ecosystem (Java, Python, NPM, Ruby, Go, etc.) is a little bit different, and each presents a unique challenge to the development of a unified model of risk from transitive dependencies and technical debt.

Vulnerability data seems straightforward, but diagnosis doesn’t always correlate with disease. It can actually be a sign of health. If a project has a lot of reported vulnerabilities, that could mean that it has been subject to a thorough review and is therefore low risk. To analogize to healthcare, people who get regular checkups have a lot more identified risk factors than people who don’t know they’re sick or people so healthy they never see a doctor. What security customers want to know is, where are unreported and uncorrected vulnerabilities lurking in my infrastructure? This is a wicked, high-dimensional problem because the data is both ambiguous and ambivalent.

Software supply chain analysis is a perfect case study in why conventional expertise and data science are best combined. Machine learning alone isn’t great at identifying risk when context varies and context matters. Old-school analog domain knowledge is massively useful—global logistics experts have a lot to teach us—but doesn’t account for the volatility and labor-market dynamics of open source communities and the ephemeral nature of the product.

It’s incredibly difficult, counterintuitive, confounding, and interesting.

Photo of J. C. Herz

J. C. Herz

Ion Channel

JC Herz is cofounder and COO at Ion Channel, a data and microservices platform that automates situational awareness and enables risk management of the software supply chain. She has 15 years of analytics experience in healthcare and national security. JC was a White House special consultant to the Pentagon’s CIO office and coauthored the DoD’s open technology development roadmap. A published author, she has been contributing to Wired magazine since 1993.