Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Building the data infrastructure for the internet of things at zettabyte scale

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)
14:0514:45 Wednesday, 1 May 2019
Data Engineering and Architecture
Location: Capital Suite 8/9
Average rating: ***..
(3.33, 3 ratings)

Who is this presentation for?

  • Executives, developers, and those on the business side



What you'll learn

  • Explore the architecture design and innovations of Alibaba TSDB, a state-of-the-art database for IoT data management


IDC forecasts that by 2025 the global datasphere will grow to 163 zettabytes (1 ZB = 1 trillion GB), with the majority contributed by IoT devices. This massive amount of data continuously generated by everything around us has created enormous challenges for databases.

Alibaba TSDB, a state-of-the-art database for IoT data management, came into being to meet the demands for high-concurrency storage and low-latency query while providing efficient and economical services to end users. So far, Alibaba has been able to scale the service to thousands of physical nodes and deliver peak performance at 80 million operations per second.

Inside Alibaba Group, TSDB is the backbone service for hosting tens of billions of monitoring metrics, covering all tiers of the company operations from the world’s largest ecommerce site, Taobao, to its nation-wide logistic network, Cainiao. Externally, many valuable businesses and critical applications also depend on TSDB, from the state grid to Shanghai City Brain.

With deep understanding of the IoT domain, TSDB strategically positions itself to focus its expertise on time series and spatiotemporal trajectories data. TSDB manages high-dimensional data through its specially designed indexes, which greatly improves real-time query performance. To minimize I/O overhead, TSDB implements an in-memory cache with a state-of-the-art compression algorithm.

Besides supporting the storage and queries from billions of metrics in real-time fashion, TSDB provides data manipulation, intelligent analysis, and visualization functionalities for IoT and related domains as well. TSDB has a rich set of built-in time series processing functions, such as sampling, interpolation, and aggregation. Its streaming engine performs such operations while ingesting data, hence only retaining low-level intermediate results and greatly reducing memory consumption. TSDB also provides an SQL-like query interface to bridge the gap for business analysts to fully utilize its powerful storage and query engine. TSDB can help companies understand data trends, discover anomalies, reduce production risks, and increase productivity and efficiency.

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB and discuss lessons learned from years of development and continuous improvement. Jian and Sanjian particularly highlight the innovative design of the HIMO compression algorithm—the result of introducing neural networks and reinforcement learning to perform major model selection for compression. HIMO enables a 50% compression ratio improvement and 5x performance acceleration compared to other well-known compression formats. Through the support of GPU and FPGA, it can even provide a 50x performance gain.



Alibaba Group

Jian Chang is a senior algorithm expert at the Alibaba Group, where he is working on cutting-edge applications of AI at the intersection of high-performance databases and the IoT, focusing on unleashing the value of spatiotemporal data. A data science expert and software system architect with expertise in machine learning and big data systems and deep domain knowledge on various vertical use cases (finance, telco, healthcare, etc.), Jian has led innovation projects and R&D activities to promote data science best practices within large organizations. He’s a frequent speaker at technology conferences, such as the O’Reilly Strata and AI Conferences, NVIDIA’s GPU Technology Conference, Hadoop Summit, DataWorks Summit, Amazon re:Invent, Global Big Data Conference, Global AI Conference, World IoT Expo, and Intel Partner Summit, and has published and presented research papers and posters at many top-tier conferences and journals, including ACM Computing Surveys, ACSAC, CEAS, EuroSec, FGCS, HiCoNS, HSCC, IEEE Systems Journal, MASHUPS, PST, SSS, TRUST, and WiVeC. He’s also served as a reviewer for many highly reputable international journals and conferences. Jian holds a PhD from the Department of Computer and Information Science (CIS) at University of Pennsylvania, under Insup Lee.

Photo of Sanjian Chen

Sanjian Chen

Alibaba Group

Sanjian Chen is a senior algorithm expert at the Alibaba Group. He has deep knowledge of large-scale machine learning algorithms. Over his career, he’s developed cutting-edge data-driven modeling techniques and autonomous systems in both academic and industry settings and designed data-analytics solutions that drove numerous high-impact business decisions for multiple Fortune 500 companies across several industries, including retail, banking, automotive, and telecommunications. He’s currently working on building cutting-edge cloud-based AI engines for high-performance distributed database systems that support scalable data analytics in multiple business areas. Sanjian is a frequent invited speaker at top international conferences, including the Strata Data Conference (San Francisco, London), the IEEE Cyber-Physical Systems Week (Chicago), the IFAC conference on Analysis and Design of Hybrid Systems (Atlanta), and IEEE International Conference on Healthcare Informatics (Philadelphia, Dallas). He’s received two IEEE Best Paper Awards and published over 25 papers in top journals and conferences, including two published in the Proceedings of IEEE. He’s also served as an invited reviewer for numerous top international journals and conferences, including the IEEE Design & Test, IEEE Transactions on Computers, ACM Transactions on Cyber-Physical Systems, IEEE Transactions on Industrial Electronics, IEEE RTSS conferences, and the ACM HSCC conference. He holds a PhD in computer and information science from the University of Pennsylvania.