Addressing Value-Level Issues

 

Format Masking, Surrogate Keys, Constants, etc.

Challenges
 
There are many lower-level, data-value-related problems that ETL architects face, such as:
 
 

Issue

Examples or Details

Applying format masks to represent or re-interpret values

ID'ing gender, 'F' or '2' or masking dates: mm/dd/yy to
dd-mm-yyyy

Homogenizing inconsistent production keys

assigning unique numeric surrogates for recognition and faster indexing

Pivoting and unpivoting

transposing columns to rows and vice versa without duplicates

Inserting constants or literal strings

replacing or adding substitute values to protect or pre/re-label data

Setting or labeling values

evaluating a condition to determine a default label like 'NULL'

Validating values

determining if the inbound field is printable, or of a certain data type

Re-casting values to conform to reference standards

unifying different data representations (like country names)

Slowly changing dimensions in source tables

propagating new values (like a price change) in warehouse tables

Solutions

 

The IRI Voracity data management platform and the IRI CoSort package both address all the above data handling issues, plus:

  • other data transformations
  • data type and format conversions
  • basic and advanced report generation
  • field-level data protection

using a single 4GL and program called SortCL, which is also supported in their Eclipse GUI called IRI Workbench.

 

With SortCL, you can:

  • Mask dates and other data formats with new composite values, and create new, re-usable data types
  • Re-assign source key names to adhere to new ontologies
  • Pivot (denormalize) for efficient queries, and unpivot (normalize) before loading data to the warehouse
  • Use in-line 'data' statements to add strings to any location, and with custom repetition counts
  • Use conditional data or field logic to label qualifying values, like inserts vs. deletes vs. updates, etc.
  • Use supplied data validation functions or third-party libraries to assess data characteristics or quality
  • Use integrated Perl Compatible Regular Expression (PCRE) logic to match and re-assert data patterns
  • Handle and report on multiple types of slowly changing dimensions and other changed data


IRI software like CoSort and RowGen (test data generation) help you generate unique initial or surrogate primary keys with features like:

  1. SEQUENCER, a specially named index field with custom starting and increment values
  2. random data generation, to create random numbers or strings, which can also be sorted and de-duplicated
  3. ROWID, a field with any name but specified as this attribute with features beyond SEQUENCER
  4. UUID (and GUID), a field-level transformation/generation function that can be used in any part of a job