Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

Office Hour with Neha Narkhede (Confluent)

Neha Narkhede (Confluent)
2:05pm–2:45pm Wednesday, 09/28/2016
Location: O'Reilly Booth (Table A)

Join Neha for a deeper dive into Apache Kafka use cases, using Kafka for stream processing, comparing Kafka Streams’s lightweight library approach with heavier, framework-based tools such as Spark Streaming or Storm, and how Kafka Connect can be combined with other tools such as stream processing frameworks to create a complete streaming data integration solution.

Photo of Neha Narkhede

Neha Narkhede


Neha Narkhede is the cofounder and CTO at Confluent, a company backing the popular Apache Kafka messaging system. Prior to founding Confluent, Neha led streams infrastructure at LinkedIn, where she was responsible for LinkedIn’s petabyte-scale streaming infrastructure built on top of Apache Kafka and Apache Samza. Neha specializes in building and scaling large distributed systems and is one of the initial authors of Apache Kafka. A distributed systems engineer by training, Neha works with data scientists, analysts, and business professionals to move the needle on results.

Comments on this page are now closed.


Serge Davidov
09/21/2016 10:30am EDT

1) Could you please share and discuss particular challenges/requirements of the use cases of stream processing solutions (built with Kafka, Samza, & Spark, Storm) implementing solutions for Financial Industry/Investment Banking Sector?

Problem space: regulatory compliance whatever or
specifically: -with risk limits,
-not compromising regulatory capital adequacy,
-maintaining (via monitoring and real-time feedback) intra-day liquidity targets,
-counterparty credit risk exposure limits
— by industry, by product, by asset category, by location
all that — real time monitoring to create a feedback loop and “self-correct” throughout the day (to remain compliant without guesswork while maximally utilizing all available risk-taking capacity).
[this is just certain representation of the problem space, it’s not limited to this]

2) How traceably and auditability of the data produced via stem-based solutions can be implemented?
(Generally this stands for being able to [retroactively] prove/explain the validity of the resultant data calculated/generated by the system based on the data that has been fed into it, its algorithm, and any user input (or other control mechanisms).
This is a requirement of various audits,
BUT even before anything else, this is a capability which should be available to Quality Assurance).

(if time permits)
3) Question on implementing financial “reconciliations” for stream-based solutions.
Stated more generally: Assurance of Integrity of the data that is processed is the system in relation to the authoritative source of data.
The twist with steams is that processing happens up-front as events happen, not all 100% of these events will be the actual trades in the book when it’s officially closed for this day. I.e., there is a difference between “real-time events” and the “official version” of the events, which would be considered “Golden Source”. Then what’s in Hadoop that came through streams would not reconcile to that Golden Source. Thus, how the results from the stream system can be trusted?
And I cannot change data in my stream system because 1) you don’t update data in Hadoop 2) even if you loaded the Golden source data, it would not be the data that could explain results generated by the system in the real time through-out the day.
So what explanation to provide to regulators?
…that the data upon which your critical risk-taking decisions were made was generated based on the different data than that which actually does represent that risk…
(Any networking introductions to you clients in investment banking, who might be able to share their experiences in this area, would be highly appreciated.)

Serge Davidov
Deutsche Bank