Scaling data engineers

Evgeny Vinogradov (Yandex.Money)

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1A 21/22

Data Engineering and Architecture

Secondary topics: Culture and Organization, Financial Services, Model Development, Governance, Operations

Average rating:

(3.50, 2 ratings)

Who is this presentation for?

Data science team managers, data scientists, and data engineers

Level

Intermediate

Description

Yandex.Money has dozens of product teams and over hundred microservices, and this is still growing. On one hand, it gives the company agility—each product team is responsible for one or a few products—and keeps the pace. On other hand, Yandex.Money needs to consolidate and aggregate all of the data and manage to work with it. Evgeny Vinogradov details his experience in managing and scaling data for support of 20+ product teams, including the issues and how to solve the issues on a management and technical level.

These are some of the problems Yandex.Money faces. There is a lack of people who know all microservices together. Yandex.Money needs some of its data engineers to know how all this data from different sources fit together, but in real life, there’s a huge amount of information, so no data engineer knows all the architecture. There are a considerable number more backend developers than data engineers, and they produce a considerable amount of changes at production. There are strict service-level agreement (SLA) requirements for answer time and update time, and users generally don’t care that a data source somewhere in the service isn’t very reliable—for Yandex.Money, the data warehouse isn’t very reliable. Data scientists have a considerably different toolset compared to data engineers, and product engineers don’t always understand what data engineering is, leading to an underestimation of required development time and quality of data produced by the backend.

First, Yandex.Money has to determine how to solve these issues on a management level. The company begins by training, with an initial onboarding program that includes an overview of all microservices, followed by several types of training for each of the data engineering team according to their product team, requiring architect knowledge for a data engineer. It also requires strong typing from data sources and backend developers (“Oh, we got this number as a string—why should we check it? And what do we do if it doesn’t pass schema validation?”). Data engineers are divided into teams like product team, and each data engineering team works closely with one or several product teams and is responsible for integration and verification of data from that product team, as well as checking that new data aggregates with existing data. Other data engineering teams compactly rely on loaded data. A core data engineering team is responsible for sensitive data processing, and each data engineering team member has to visit its product team meetings.

To solve the issues on a technical level, every table with events has a timestamp/rowversion, and the most important part of code conventions is where and how to store unique identifiers. Yandex.Money has a lot of staging tables, which they cluster, so each user brings narrow tables with weak requirements to know the underlying microservices data structure. It means a lot of MERGE, but with rowversion and unique identifiers, it works well. Yandex.Money unifies tooling for data engineering teams and data scientist teams. For data engineers, there’s SQL server and integration services and Kafka and API, and for data scientists, it’s Python and SQL server, plus power BI, reporting services, and Python for visuals. The company would rather try to find a suitable tool for subject matter experts as a data scientist than find a data scientist who knows the company’s existing tooling.

Prerequisite knowledge

Experience with data engineering (useful but not required)

What you'll learn

Understand that data engineers can't be scaled just as regular backend teams, data engineers should be trained as architects, and the scheme of API and data is very important
Learn that if you use streaming (like Kafka), you have to create a special pipeline to supply changes in the scheme (which is a little bit easier with an API) and if you want to train models, data engineers have to provide narrow datasets to the data scientists

Evgeny Vinogradov

Yandex.Money

Evgeny Vinogradov is the head of data warehouse development at Yandex.Money, where he and his team are responsible for data engineering, antifraud systems development, and business intelligence. Previously, he spent 20 years in IT development in different areas from CAD systems in outsource to fintech. He earned his PhD from the Applied Mathematics Department of Saint-Petersburg State University.