Today’s enterprise architectures are often composed of a myriad of heterogeneous devices. Bring-your-own-device policies, vendor diversification, and the transition to the cloud all contribute to a sprawling infrastructure, the complexity and scale of which can only be addressed by using modern distributed data processing systems.
Kevin Mao outlines the system that Capital One has built to collect, clean, and analyze the security-related events occurring within its digital infrastructure. Raw data from each component is collected and preprocessed using Apache NiFi flows. This raw data is then written into an Apache Kafka cluster, which serves as the primary communications backbone of the platform. The raw data is parsed, cleaned, and enriched in real time via Apache Metron and Apache Storm and ingested into ElasticSearch, allowing operations teams to detect and monitor events as they occur. The refined data is also transformed into the Apache ORC data format and stored in Amazon S3, allowing data scientists to perform long-term, batch-based analysis.
Kevin discusses the challenges involved with architecting and implementing this system, such as data quality, performance tuning, and the impact of additional financial regulations relating to data governance, and shares the results of these efforts and the value that the data platform brings to Capital One.
Kevin Mao is a senior data engineer at Capital One Financial Services currently working on the Cybersecurity Data Lake team within Capital One’s Enterprise Data Services organization. Kevin’s current work involves designing and developing tools to ingest and transform cybersecurity-related data streams from across the organization into datasets that are used by security analysts for detecting and forecasting cyberthreats. Kevin holds a BS in computer science from the University of Maryland, Baltimore County and an MS in computer science from George Mason University. In his free time, he enjoys hiking, running, climbing, and snowboarding.
Comments on this page are now closed.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.