Fastest Big Data Scrubbing and Data Cleansing

Scrub / Cleanse

Clean and Transform Big Data in the Same Pass

Overview

Aggregate

Challenges

Data cleansing can be complicated, time-consuming, and expensive. The functions you write in 3GL, shell scripts, or SQL procedures may be complex and hard to maintain. They may not satisfy all of your business rules or do the whole job.

Custom functions may also run in separate batch steps, or in a special "script transform component" that you must connect to your tool's data flow and run in smaller chunks. That is a problem when data volumes grow.

Data quality tools, on the other hand, can also perform a lot of this work. Unfortunately, they are not especially efficient in volume, and can be difficult to configure or modify. They may also be functional overkill and cost a lot. Sometimes the best solution is not the biggest.

Solutions

The SortCL program in IRI CoSort or IRI Voracity can find and scrub data in more than 125 table and file sources. SortCL uses a simple 4GL and Eclipse GUI to define data, manipulations, and targets down to the field level.

Native data quality functions built into SortCL that you can perform -- or combine with its data transformation, migration, protection and reporting activities -- include:

de-duplication
character validation
data homogenization
value find (scan) and replace
horizontal, and conditional vertical selection
data structure (format) definition and evaluation
detection and flagging of data changes and logic problems

SortCL also supports the definition of custom data formats through template definitions. This allows for format scanning and verification.

For advanced data cleansing (based on complex business rules) at the field level, plug in your own functions or those in data quality vendor libraries. The CoSort documentation refers to examples from Trillium and the Melissa Data address standardization library. Declare a cleansing function for any field in either the pre-action layout or target phases of a job (i.e., up to two DQ routines per field, per job).

The bottom line? With CoSort SortCL -- and maybe specialized data quality libraries you add -- you can cleanse your data in the same I/O pass in which you filter, transform, secure, report on it, or hand it off.

If you need to find and scrub for PII like SSNs in your datasets, SortCL will do this do, as will the standalone IRI FieldShield data masking tool. If you need high quality test data, check out IRI RowGen. RowGen uses SortCL metadata to build intelligent test data that conforms to your business rules so that you can test with the realistic, but safe: good, bad, and null data.

Related Solutions

Product Links