How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar
Who is this presentation for?Data Scientist, Software Engineer, CTO
As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, we have dozens of online financial products. We face threats from financial fraud every day, identity theft, money laundry, affiliate fraud, merchant fraud etc. Risk control is vital and we have thousands of decision running against each transaction to fight against these threats in our risk management system.
In risk management scenario, the core is decision making. Decisions comprise of a series of rules and models. Needless to say, rules and models development is vital, but another part that’s equally important is the manufacturing of indicators/features required by the decisions. Take some indicators of our risk management system, for example, the intimacy between users, monthly average consumption frequency and money, the login frequency in last minute, last month/year; the time interval between last two transfer transactions, and etc. Clearly, some of these indicators require large volume of historical data stored in a data store, e.g. Hive, and are computed normally in batch mode (e.g. Presto in our case); some indicators depend on data in the current transaction and are needed by decisions of current transaction, the real-time transaction data is stored in message queue such as Kafka, streaming computation is widely adopted (e.g. Spark streaming ). This is a typical Lambda architecture and has been running for many years in our company.
The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Kappa attempts to simplify by only keeping one code base rather than manage one for each batch and speed layers in the Lambda Architecture. The complication of this architecture mostly revolves around having to process this data in a stream, such as handling duplicate events, cross-referencing events or maintaining order- operations that are generally easier to do in batch processing.
Still, we have been seeking a solution that can unify the data store, computing engine and programing language for decision development in our risk control system.
Apache Pulsar is an open-source distributed event streaming system originally created at Yahoo and now part of the Apache Software Foundation. Apache Pulsar addresses the messy operational problems by storing data in segmented streams. The data is appended to topics (aka streams) as they arrive, and segmented and stored in a scalable log storage Apache BookKeeper. As the data is stored only one copy (source-of-truth), it addressed the inconsistency problem in lambda architecture. Also the data can be accessed in streams via unified pub/sub messaging and segments for elastic parallel batch processing. It makes Apache Pulsar a perfect unified messaging/storage solution. Together with a unified computing engine like Spark, it can boost the efficiency of our risk control decision deployment.
In this session, we will share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development.
Prerequisite knowledge- basic big data knowledge - data processing - pubsub messaging
What you'll learn
China Telecom BestPay Co., Ltd
Vincent Xie (谢巍盛) is the Chief Scientist and Director of China Telecom BestPay Co., Ltd. He builds the company’s Artificial Intelligence Group and leads the team to carry out research related to big data and A.I. Previously, he worked for Intel leading an engineering team working on machine learning- and big data-related open source technologies.
Sijie Guo is the PMC Chair of Apache BookKeeper and the PMC member of Apache Pulsar. He worked at Twitter before and led the messaging team. Prior to Twitter, he worked on Yahoo! push notification infrastructure.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
View a complete list of Strata Data Conference contacts