Sep 23–26, 2019

How China Telecom combat financial frauds over 50M transactions a day using Apache Pulsar

Weisheng Xie (China Telecom BestPay Co., Ltd), Sijie Guo (ASF)
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Data, Analytics, and AI Architecture, Financial Services, Streaming and IoT, Telecom

Who is this presentation for?

Data Scientist, Software Engineer, CTO




As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, we have dozens of online financial products. We face threats from financial fraud every day, identity theft, money laundry, affiliate fraud, merchant fraud etc. Risk control is vital and we have thousands of decision running against each transaction to fight against these threats in our risk management system.

In risk management scenario, the core is decision making. Decisions comprise of a series of rules and models. Needless to say, rules and models development is vital, but another part that’s equally important is the manufacturing of indicators/features required by the decisions. Take some indicators of our risk management system, for example, the intimacy between users, monthly average consumption frequency and money, the login frequency in last minute, last month/year; the time interval between last two transfer transactions, and etc. Clearly, some of these indicators require large volume of historical data stored in a data store, e.g. Hive, and are computed normally in batch mode (e.g. Presto in our case); some indicators depend on data in the current transaction and are needed by decisions of current transaction, the real-time transaction data is stored in message queue such as Kafka, streaming computation is widely adopted (e.g. Spark streaming ). This is a typical Lambda architecture and has been running for many years in our company.

The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Kappa attempts to simplify by only keeping one code base rather than manage one for each batch and speed layers in the Lambda Architecture. The complication of this architecture mostly revolves around having to process this data in a stream, such as handling duplicate events, cross-referencing events or maintaining order- operations that are generally easier to do in batch processing.

Still, we have been seeking a solution that can unify the data store, computing engine and programing language for decision development in our risk control system.

Apache Pulsar is an open-source distributed event streaming system originally created at Yahoo and now part of the Apache Software Foundation. Apache Pulsar addresses the messy operational problems by storing data in segmented streams. The data is appended to topics (aka streams) as they arrive, and segmented and stored in a scalable log storage Apache BookKeeper. As the data is stored only one copy (source-of-truth), it addressed the inconsistency problem in lambda architecture. Also the data can be accessed in streams via unified pub/sub messaging and segments for elastic parallel batch processing. It makes Apache Pulsar a perfect unified messaging/storage solution. Together with a unified computing engine like Spark, it can boost the efficiency of our risk control decision deployment.

In this session, we will share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development.

Prerequisite knowledge

- basic big data knowledge - data processing - pubsub messaging

What you'll learn

- what is lambda architecture - How do we do risk control decision deployment in lambda - what is Apache Pulsar - How do we boost the efficiency by leveraging Pulsar
Photo of Weisheng Xie

Weisheng Xie

China Telecom BestPay Co., Ltd

Vincent Xie (谢巍盛) is the Chief Scientist and Director of China Telecom BestPay Co., Ltd. He builds the company’s Artificial Intelligence Group and leads the team to carry out research related to big data and A.I. Previously, he worked for Intel leading an engineering team working on machine learning- and big data-related open source technologies.

Photo of Sijie Guo

Sijie Guo


Sijie Guo is the PMC Chair of Apache BookKeeper and the PMC member of Apache Pulsar. He worked at Twitter before and led the messaging team. Prior to Twitter, he worked on Yahoo! push notification infrastructure.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts