Training: 8–9 November 2016
Tutorials & Conference: 9–11 November 2016
Amsterdam, NL

Foundations of security data science

Jay Jacobs (BitSight Technologies), Charles Givre (Booz Allen Hamilton)
9:00–17:00 Tuesday & Wednesday, 8-9 November
Location: D203

Participants should plan to attend both days of this 2-day training. Training passes do not include access to tutorials on Wednesday.

Average rating: *****
(5.00, 1 rating)

Prerequisite knowledge

  • A working knowledge of a scripting or programming language (ideally Python or R)
  • Familiarity with security data sources, including vulnerability scanner data, DNS data, and threat intelligence data

Materials or downloads needed in advance

We will be providing a virtual machine with all the course materials. If you choose to use the virtual machine you will need:

  • A laptop (Windows/Linux/OS X) with 8 GB of RAM and 30 GB of free space. If you are using a PC you will need to have virtualization enabled in the BIOS.
  • The latest version of Virtualbox—available at Virtualbox.org.

While we strongly encourage you to use the virtual machine, if you choose not to, you will need:

You will also need a GitHub account.

What you'll learn

  • Understand how to organize and execute a data analysis project, from exploration to insight
  • Gain experience working with different data formats
  • Learn how the science of data visualization can transform how you communicate your story
  • Explore the applications of models and machine-learning techniques

Description

Join Jay Jacobs, Charles Givre, and Bob Rudis, the authors of Data-Driven Security, for a hands-on, in-depth exploration into the foundations of security data science. You’ll learn how to explore and analyze data you probably already have and gain valuable exposure to and experience with tools and techniques to prepare, analyze, and visualize the knowledge hiding in your data. Jay, Charles, and Bob guide you through working with three hands-on, practical applications with real data, introducing each in a language-agnostic approach before providing language-specific guidance for hands-on work. A GitHub repository with the examples will be available so that you can revisit the examples and continue learning after the training.

If you are a security analyst and need to leverage more data in your analyses, are working in operations and know you can pull out more from the data you have, or already identify vulnerabilities and weaknesses in systems and networks but need to better communicate your team’s findings during engagements, this is the training for you.

Outline:

Day 1

Introductions and core concepts (90 minutes)

  • The flow of data-driven research
  • Tools of the trade and how to get these tools approved by desktop admins
  • GitLab
  • Using notebooks
  • Workflow management
  • Reproducible research
  • Statistical principles, descriptive stats, sampling, and confidence intervals
  • How data analysis differs from a typical development life-cycle
  • How to structure and plan for a data analysis project

Core data visualization (90 minutes)

  • The visual building blocks of a data visualization
  • How people process and consume visualizations (cognitive science)
  • Targeting the audience and other techniques for making a good visualization
  • Common pitfalls and mistakes
  • Visual makeovers
  • Introduction to data visualization tool suites
  • Mental building blocks for assessing and creating good visualizations

Lunch break

Tooling up—hands-on lab (30 minutes)

  • Instructors will work with participants to ensure their environments are operational and everyone has access to the course resources.
  • Participants will create an operational environment.

Core exploratory data analysis (60 minutes)

  • Columnar thinking
  • Exploratory data analysis (EDA) and building your data intuition
  • Visualization techniques to describe and demystify the data
  • EDA with moving pictures
  • Applying the lessons to the tools of your choice (examples provided in Python/R)

Exploratory data analysis and visualization challenge—hands-on lab (90 minutes)

  • Instructors will provide a real-world dataset.
  • Participants will prepare and explore the data and produce their own visualizations and stories from the data. You can also submit your work to the training GitLab instance.

Day 1 wrap-up (30 minutes)


Day 2

Project showcase from Day 1 (30 minutes)

  • Instructors kick off the day by reviewing some concepts from the previous day and showcase some work from the day before.
  • Participants will see and discuss the strengths and opportunities for the analysis done by other participants.

Core clustering and unsupervised learning (60 minutes)

  • Supervised versus unsupervised learning
  • Unsupervised learning: what is it, how it works, when to use it, and some typical use cases for applying it
  • Specific unsupervised techniques and how they work (language-specific implementations provided as examples)
  • The importance of and techniques for feature generation and the role of domain expertise
  • Introduction to the dataset and the question we need to answer

Vulnerability data challenge—hands-on lab (90 minutes)

  • Instructors will provide a real-world dataset and the challenge.
  • Participants will prepare and explore the data and develop a research question to answer. (Key questions will be provided that can be answered in the time allotted, but participants can identify additional ones if they have existing knowledge about vulnerability management.) You can submit your work to the training GitLab instance.

Lunch break

Morning wrap-up (30 minutes)

  • Instructors review concepts from the morning and showcase some work from the participants.
  • Participants will see and discuss the strengths and opportunities of the analysis done by other participants.

Core classification and supervised learning (60 minutes)

  • Supervised learning: what is it, how it works, when to use it, and some typical use cases for applying it
  • Random forests and how they work (language-specific implementations provided as examples)
  • Discussion about the importance and techniques for feature generation and the role of domain expertise
  • Introduction to the dataset and the question we need to answer

Domain-generating algorithms—hands-on lab (90 minutes)

  • Instructors will provide a real-world dataset and the challenge.
  • Participants will prepare and explore the data, generate features, and do some supervised learning on the data. You can submit your work to the training GitLab instance.

Course wrap-up (30 minutes)

  • Instructors will review the material covered in the afternoon session and then conclude by discussing the big picture and how the techniques you learned will help you with the work ahead, with heavy emphasis placed on continued learning.
Photo of Jay Jacobs

Jay Jacobs

BitSight Technologies

Jay Jacobs is the senior data scientist at BitSight Technologies. Prior to joining BitSight, Jay spent four years as the lead data analyst for the Verizon Data Breach Investigations Report. Jay is the coauthor of Data-Driven Security, which covers data analysis and visualizations for information security, and hosts the Data-Driven Security and R World News podcast. Jay is also a cofounder of the Society of Information Risk Analysts and currently serves on its board of directors. Jay is also active in the R community; he coordinates his local R user group for the greater Minneapolis area and contributes to local events and functions supporting data analysis.

Photo of Charles Givre

Charles Givre

Booz Allen Hamilton

Charles Givre is an unapologetic data geek who is passionate about helping others learn about data science and become passionate about it themselves. For the last five years, Charles has worked as a data scientist at Booz Allen Hamilton for various government clients and has done some really neat data science work along the way, hopefully saving US taxpayers some money. Most of his work has been in developing meaningful metrics to assess how well the workforce is performing. For the last two years, Charles has been part of the management team for one of Booze Allen Hamilton’s largest analytic contracts, where he was tasked with increasing the amount of data science on the contract—both in terms of tasks and people.

Even more than the data science work, Charles loves learning about and teaching new technologies and techniques. He has been instrumental in bringing Python scripting to both his government clients and the analytic workforce and has developed a 40-hour Introduction to Analytic Scripting class for that purpose. Additionally, Charles has developed a 60-hour Fundamentals of Data Science class, which he has taught to Booz Allen staff, government civilians, and US military personnel around the world. Charles has a master’s degree from Brandeis University, two bachelor’s degrees from the University of Arizona, and various IT security certifications. In his nonexistent spare time, he plays trombone, spends time with his family, and works on restoring British sports cars.