Data Transformation = Addressing Value-Level Issues

Addressing Value-Level Issues

Format Masking, Surrogate Keys, Constants, etc.

Challenges

There are many lower-level, data-value-related problems that ETL architects face, such as:

*Issue*	*Examples or Details*
Applying format masks to represent or re-interpret values	ID'ing gender, 'F' or '2' or masking dates: mm/dd/yy to dd-mm-yyyy
Homogenizing inconsistent production keys	assigning unique numeric surrogates for recognition and faster indexing
Pivoting and unpivoting	transposing columns to rows and vice versa without duplicates
Inserting constants or literal strings	replacing or adding substitute values to protect or pre/re-label data
Setting or labeling values	evaluating a condition to determine a default label like 'NULL'
Validating values	determining if the inbound field is printable, or of a certain data type
Re-casting values to conform to reference standards	unifying different data representations (like country names)
Slowly changing dimensions in source tables	propagating new values (like a price change) in warehouse tables

Solutions

The IRI Voracity data management platform and the IRI CoSort package both address all the above data handling issues, plus:

using a single 4GL and program called SortCL, which is also supported in their Eclipse GUI called IRI Workbench.

With SortCL, you can:

Mask dates and other data formats with new composite values, and create new, re-usable data types
Re-assign source key names to adhere to new ontologies
Pivot (denormalize) for efficient queries, and unpivot (normalize) before loading data to the warehouse
Use in-line 'data' statements to add strings to any location, and with custom repetition counts
Use conditional data or field logic to label qualifying values, like inserts vs. deletes vs. updates, etc.
Use supplied data validation functions or third-party libraries to assess data characteristics or quality
Use integrated Perl Compatible Regular Expression (PCRE) logic to match and re-assert data patterns
Handle and report on multiple types of slowly changing dimensions and other changed data

IRI software like CoSort and RowGen (test data generation) help you generate unique initial or surrogate primary keys with features like:

SEQUENCER, a specially named index field with custom starting and increment values
random data generation, to create random numbers or strings, which can also be sorted and de-duplicated
ROWID, a field with any name but specified as this attribute with features beyond SEQUENCER
UUID (and GUID), a field-level transformation/generation function that can be used in any part of a job