Stream processing is skyrocketing in popularity, but most organizations still face a serious challenge: getting their data into (and out of) their stream processing framework of choice. Even a small organization may have dozens of data sources at course granularity (one or more databases, app events, log files, metrics, and more), and many thousands if you consider each table, application log type, or metric individually. Loading all this data into your stream processing framework often means tracking down dozens of third party connectors that vary greatly in quality and robustness.
On the other hand, Apache Kafka has quickly grown into a de facto standard for data storage with stream processing frameworks. All the major frameworks, including Samza, Spark Streaming, and Storm, include robust and actively-maintained connectors. It has achieved this status because it is a natural fit for streaming data: high throughput, low latency, scales to very large workloads, and uses retention policies to explicitly limit the stored data, something that needs to be done manually with most data stores. And it’s not only good for input and output; it’s also ideal for intermediate storage of derived data streams that may feed back into additional downstream processing jobs.
What if we leveraged Kafka to drastically simplify this ecosystem and improve the overall quality of connectors? This presentation introduces Copycat, a new framework for loading structured data into and out of Kafka, and therefore into and out of all the major stream processing frameworks. First we will describe the types of impedance mismatch that arise between data sources and sinks, that make connectors difficult to write well and lead to high variation in quality; then explain how introducing Kafka solves these problems. Next, we will dive into the details of Copycat and how it compares to some systems with similar goals, discussing key design decisions made that trade off between ease of use for writers of connectors, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Copycat and Kafka for ingestion of data into your stream processing framework can ultimately lead to simplifying your entire data pipeline, and make ETL into your data warehouse as simple as adding another Copycat job.
Neha Narkhede is the cofounder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Previously, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.