Sep 23–26, 2019

Scaling: data engineers

Evgeny Vinogradov (Yandex.Money)
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Culture and Organization, Financial Services, Model Development, Governance, Operations

Who is this presentation for?

Data Science Team management, Data Scinetists, Data Engineers

Level

Intermediate

Prerequisite knowledge

There is no strong requiremens, however, if attendees have Data Engineering experience - it will be helpful.

What you'll learn

- Data Engineers can't be scaled just as regular backend teams - Data Engineer should be trained as Architect - Scheme of API/Data is very important. - If you use streaming (like Kafka), you have to create special pipeline to supply changes in scheme (it is a little bit easier with API) - If you want to train a models - Data Engineer have to provide narrow dataset to Data Scientist.

Description

We have dozens of product teams and over hundred microservices, and this amount is still growing.

On one hand, it gives us agility – each Product Team responsible for one or few products, and keeps the pace.

On other hand, we need to consolidate and aggregate all of the data, and manage to work with it.

And here come a number of issues.

  • There is a lack of people who know all microservices together. So, we would like to see our any of our Data Engeineers has to become a person who knows how all of this data from different sources will fit together (but in real live there is a huge amount of information, so no Data Engineer know all the architecture).
  • A number of backend developers considerably more that number of Data Engineers, and they produce considerable amount of changes at production.
  • There are strict SLA requirements for answer time and update time, and users generally do not care that data source somewhere in the service is not very reliable – for our users Data Warehouse is not very reliable.
  • Data Scientists have considerably different toolset comparing to data engineers.
  • Product Teams do not always understand what data engineering is, which leads to underestimation of required development time required and quality of data produced by backend.

How do we solve all of these issues on a management level?

  1. Training. We have initial onboarding program for newcomers, and it includes overview of all microservices. Then, we have several types of trainings for each of Data Engineering Team according to their currently binded Product Team. It means that we require Architect knowledge for Data Engineer.
  2. We require strong typing from data sources/backend developers (“Oh, we got this number as string – why should we check it? And what to do, if it don’t pass schema validation?”)
  3. We divide Data Engineers into a teams, like product teams. Each Data Engineering Team works closely to one or several Product Teams, and responsible for integration and verification of data from that Product Team(s), as well as checking that new data aggregates with existing. Other Data Engineering Teams compactly rely on loaded data.
  4. We have Core Data Engineering team who is responsible for sensitive data processing (a small part of all we get).
  5. Scouting – each Data Engineering Team member has to visit it’s Product Team meetings.

How do we solve these issues on technical level?

  1. Every table with events have timestamp/rowversion
  2. Code Conventions (the most important part is where to store unique identifiers and how to store them)
  3. We have a lot of staging tables. We cluster them, so bring each user narrow tables, with weak requirements to know the underlying microservices data structure. It means a lot of MERGE, but with rowversion/unique identifiers it works well.
  4. Firstly, we unify tooling for Data Engineers Teams and, secondly, for Data Scientist teams. For Data Engineers there are SQL Server/Integration Services+ Kafka/API. For Data Scientists is it Python + SQL Server, plus PowerBI/Reporting Services/Python for Visuals.
  5. We’d rather try to find a suitable tool for Subject Matter Expert as a Data Scienist then find a Data Scientist who knows our existing tooling.
Photo of Evgeny Vinogradov

Evgeny Vinogradov

Yandex.Money

Graduate of Applied Mathematics Department of Saint-Petersburg State University, PhD. Last 20 year spend in IT development in different areas – from CAD systems in oursourcing till fintech. In 2003 join Yandex.Money team, where he and his team is responsible for Data Engineering, Anti-Fraud systems development and Business Intelligence.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts