Mar 15–18, 2020

Realistic synthetic data at scale: Influenced by, but not, production data

Mehul Sheth (Druva)
11:00am11:40am Wednesday, March 18, 2020
Location: Expo Hall

Who is this presentation for?

Data engineers, data architects, developers




To have high confidence in a product, testing it against a dataset that resembles production data is a must. The challenge is in generating testing data that represents production. The data in production isn’t predictable, it doesn’t follow simple formula, and it’s characterized by many variables. Broadly, test data can be divided into two categories: arbitrary, which is random and unstructured, and realistic, which follows patterns and is predictable and controlled. To generate realistic test data, the right patterns need to be captured by analyzing the existing production data. Access to production data can be regulated and isn’t easy to obtain. However, you have to be able to implement code to read relevant data from production without exposing the actual data and update models used to generate test data when it’s required so that the generated test data represents production data in selected dimensions as directed by the product you’re testing.

Follow Mehul Sheth as he brings you along Druva’s path to generate test data at scale, which is highly influenced by production data and has “genes” of production data but not a single byte taken as-is from production. Druva’s journey and decisions may not be directly applicable in all scenarios, but Mehul highlights the company’s thought process, algorithms, and decisions. You’ll learn how to focus on the ability to assess the model and tweak it to include edge conditions, remain realistic, stay applicable at all times, and is versatile, repeatable, and easily controllable.

Specifically, Mehul describes a process for modeling a directory tree with files and folders with variables (like size of file, number of files and folders in each folder at each depth, patterns in names of files and folders, ratio of different file types, and other variables) that may be important for the application under test. And he explains how to apply this model to generate file sets of different sizes but using completely random data, maintaining the relationships between modeled variables. These datasets are random in raw format; however, they maintain the characteristics of the model and can be used for performance and stress-testing antivirus software, legal discovery software, or backup software. Extending the concept further, it can be used to model any data and metadata like mailboxes or transactional databases.

What you'll learn

  • Learn how Druva models production data without exposing it and uses it to generate synthetic data for testing
  • Discover how to apply similar techniques in your applications
Photo of Mehul Sheth

Mehul Sheth


Mehul Sheth is a senior performance engineer in the performance labs at Druva, where he’s responsible for the performance of the CloudApps product of Druva InSync. He has more than 13 years of experience in development and performance engineering, where he’s ensured production performance of thousands of applications. Mehul loves to tackle unsolved problems and strives to bring a simple solution to the table, rather than trying complex things.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires