Architecting a data analytics service both in the public cloud and in the on-premise private cloud: ETL, BI, and machine learning (sponsored by SK Holdings)
SK Holdings announced AccuInsight+, a data analytics platform in the cloud with eight data analytic services in January 2019 in CloudZ, one of the biggest cloud service providers in Korea. This platform has been rapidly delivered to different business areas such as the banking, manufacturing, and ecommerce industries. Jungwook Seo walks you through the architecture of the data analytics platform for digital transformation and the differences in building the platform between the public cloud and the on-premises private cloud.
The analytics platform’s architecture consists of three layers: the service layer, an API layer as middleware, and an infrastructure layer for provisioning Docker containers. Additionally, the architecture addresses a number of technical challenges such as multitenancy, microservice architecture, asynchronous communication, provisioning Hadoop and Spark clusters, and so on.
The key point of the architecture is that the API layer is implemented by Vert.x, one of the most popular development toolkits for reactive applications, for many API categories such as data collection, real-time streaming, Hadoop batch, ML training, Hadoop provisioning, visualization, query, and so on. Each API category is implemented according to the microservice architecture (MSA); therefore, many API categories are independent of each other in the cloud. However, they’re able to communicate with each other in terms of a platform due to registering themselves to the “service discovery” service and sharing cluster information with the shared cache for multitenancy. The Vert.x framework enables the APIs to be scalable and ensures high availability as a platform.
The underlying service is dynamic Hadoop provisioning (DHP) for providing Hadoop and Spark clusters inside Docker containers to process the large scale of the data. DHP is based in the Kubernetes engine for Docker container orchestration, and it caches the end point of Hadoop echoes such as Name Node, Resource Manager, Hive, Oozie, and so on. Then the other API categories can use those endpoints for multitenancy. Once you deploy a cluster, you can use batch pipeline (BP) for data ETL from various different data sources to the cluster created. You simply drag and drop components on the canvas of the BP to draw a data pipeline and deploy it one of clusters selected. Each workflow for ETL process is based on Spark jobs on the Spark cluster; therefore, the ETL process is quick because of memory processing.
After ETL jobs, data scientists can use the ML Modeler service, which actually provides a GUI of Spark ML so that you can easily draw Spark ML pipelines without Spark coding. The best ML models that are saved in the same cluster can be applied to the operation environment by simply reading the saved models from one of workflows either by the BP or by the real-time pipeline (RP), which is based on Spark Streaming. The four services—DHP, BP, RP, and ML Modeler—are connected to each other, executing jobs in the YARN containers in the same Hadoop cluster. Those services are implemented by Spark technologies such as Spark Core, Spark Streaming, Spark ML, and Spark SQL.
The predicted data can be generated from ML models and can be saved to the data lake or any database by BP and RP workflow. The data in the data lake can be queried by the BigQL service, which uses the Presto query engine. Another service is the Cloud Search (CS) provisioning ELK stack; the CS provides built-in templates, which easily collect and index log datasets to search. The visualization service, Data Insight (DI), allows you to visualize the data that you want to navigate. DI provides a number of data adaptors to collect data from various data sources such as S3, RDB, Hive, Redis, etc.
The most important service is a deep learning modeler that enables you to manage your own DL models’ lifecycle. Because of Horovod technology, each DL training can be easily distributed across multiple GPUs on the multiple nodes. You can also set up a number of trainings at the same time so that the best result of the model can be found in a short time. The best model can be deployed as a rest API with a click for providing predicted data sets.
AccuInsight+ is rapidly being applied to clients’ sites in different business areas in Korea; for example, manufacturing sites such as SK hynix and Hyundai Electric, banking sites such as KEB Hana Bank and Kookmin Bank, and ecommerce sites LOTTE Department Store, and so on.
Jungwook addresses the differences in providing a data analytics platform between by public cloud service and by on-premises private cloud service from the company’s experience. For example, in the public service, users can deploy the containers as they require, but in the on-premises service, users should get an approval by administrators before their containers are deployed. On-premises users want to monitor all their items, such as data, models, clusters, and batch jobs on a dashboard page with a different view according to each different role.
What you'll learn
- Discover AccuInsight+
Jungwook Seo is a data platform development team leader at SK Holdings, where he spent three years developing AccuInsight+. He has over 20 years of experience as a researcher in a variety of areas, such as distributed systems, cloud computing, and big data, including three big projects in the UK. Previously, he was a researcher in cloud computing and big data for SK Holdings. His PhD focused on the grid project in the University of Manchester, and he successfully executed two more research projects as a postdoc researcher in the Leeds and Cardiff Universities.
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires