Test Data Generation for Databases

 

Populate Realistic, Referentially Correct Schema

Challenges


DBAs and data architects trying to populate test databases often rely on unsafe, and inadequate protection data. There are a number of problems with this approach, including:

Data Insufficiency

 Testing with production databases is not a good strategy for several reasons:

  1. Access to the database itself may be restricted because of the PII or other information within
  2. The database may not exist yet
  3. Current values in the database do not reflect the future scope of data that must be tested
  4. Even masked, the wrong data masking functions can render the test data unrealistic or still non-compliant

Data Realism

DBAs and application developers require test data that has the characteristic content, format, and relationships of future production data. This is the only way to verify that each application step and query will still work.

Concurrency & Security

Developers require data that are secure (i.e., do not violate privacy rules) and do not depend on other developers or phases of development to produce the data they need for their phase. Development may also need to be outsourced, and to occur in different locations concurrently.

Beyond test data realism and anonymity is the issue of availability. That is, another key challenge for database testing solutions is how the test data is provisioned for use. It must be accessible in the formats and silos (or DevOps pipelines) required, and remain consistent or restorable in the event of change.

Solutions

There are several database test data generation tools now on the market. The most flexible, rapid, and robust test data solutions are built with IRI RowGen software.

RowGen allows you to approach database test data management from whatever angle best first you use case. Its primary test data generation method is data synthesis; i.e., RowGen creates synthetic, but highly realistic database test data. Alternatively, IRI RowGen, FieldShield, or Voracity platform users can use their license for database subsetting (with or without data masking). Regardless of your strategy, the Eclipse UI, job metadata, and engine behind all this is the same.

The software works across all databases using the same easily defined, and reuseable test data generation rules. Your test data generation (or masking) rules typically apply consistently acros data classes (like columns), and leverage either random value generation or random value selection from other datasets. Once you approve the rules, you can automate their application in a fit-for-purpose wizard that builds your database test data job.

Before you run the wizard, think about some of the larger issues to consider as you scope out your test data requirements. See the series of articles on database test data management and test data provisioning in the IRI blog, starting here. Then, learn why the RowGen test data product -- or the IRI Voracity data management platform that includes it -- has been the best test data generation tool for achieving these goals since 2004.

No Production Data Needed

IRI RowGen software produces accurate, safe test data that reflects production database table formats, sizes, value ranges, and constraints - without production data. RowGen uses your DDL information to automatically build and load huge, structurally and referentially correct test tables fast. For details, see:

Blog > Test Data > RowGen Automates Database Test Data Generation

Blog > Test Data > How to Generate DB Test Data

Video > Populating a Database with Realistic Test Data from Scratch

Test Data Realism

Valid test tables must contain the full range of values, data types, row layouts, and primary-foreign key relationships that DB application rely on. For information on how RowGen preserves data realism, see:

Blog > Test Data > Making Test Data Realistic

Concurrent Development

By generating test data for each phase of development, you can create phases simultaneously, and independently from other phases. For example:

  • Step 1 - Read a personnel table and join health insurance claim data to make a status file
  • Step 2 - Read the status file and generate a list of doctors
  • Step 3 - Make a web-ready billing summary sorted by patient

RowGen can generate the input table for step 1, the status file for step 2, and the web report in step 3 -- all without needing real data or data from the other steps.

Secure Development

RowGen gives you the ability to synthesize any type, size, layout, and amount of safe test data so you can create and provide for a more secure development environment.

Alternatively, IRI FieldShield allows you to use real data that is protected with field-level functions (like encryption) on a need-to-see basis. For details, see Solutions > Data Masking.

Both tools co-exist in the IRI Workbench GUI which also features automatic DB subsetting with masking. Either way, with safe data, there is no need to certify or bond your outsourced application developers.

Seamless Transition to Production

RowGen shows you what the real data can look like, and what the transformation and reporting application can look like. That's because both RowGen and the SortCL program in IRI CoSort and Voracity use the same metadata to define data manipulations and table layouts.

This means that RowGen can also perform the same data transformations and reporting functions you normally would in the same I/O pass and job script doing the test data generation. It also means that the same layouts (and even transformations) created for data synthesis jobs is immediately ready for data integration, masking, migration, reporting, and other data processing jobs if and when real data becomes available.

If you only use RowGen, you can easily upgrade to CoSort or Voracity to transform and report on the real data (when it becomes available). Using the same job script RowGen used to define and transform test data, you can transform real data in the same format.

Virtualization and Provisioning Options

IRI test data builders can furnish testers with test data in multiple ways, including:

  1. Ad hoc tables (or files in multiple formats) in persistent or federated (view) schema
  2. Golden copies of test data in existing lower environments, plus the ability to create new tables
  3. During the cloning of databases in tools like Windocks, Actifio (Google) and Commvault
  4. From within DevOps pipelines like Amazon CodePipeline, Azure DevOps, Jenkins, GitLab, etc.
  5. Directly through test data management (TDM) hubs from Cigniti or ValueLabs

See the DevOps and Virtual Test Data pages in this section to learn more and link to more information on these options. Compare these methods and costs to more expensive and proprietary solutions that require special drivers or standing up ephemeral databases.