Transform Huge Files Faster


Smart Data Architects Keep it Simple

Flat, or sequential, files are a convenient, common format for data exchange. Because they carry little to no storage or access overhead, they are also one of the fastest source formats for data access, manipulation, loading, and reporting. That is one reason why the IRI FACT (Fast Extract) product exists - to produce flat files from big database tables.

 

If you store or use data in fixed or delimited file formats, you may already know that IRI CoSort continues been the fastest data transformation tool for flat files in Unix, Linux and Windows file systems. In 1992, IRI built a popular 4GL for CoSort called SortCL to perform multiple data transformations and generate reports, all in the same job script and I/O pass through large files.

 

SortCL is still available to CoSort package users, and is also the default data manipulation engine for flat files (and other data sources) in the full IRI Voracity ETL package.

Voracity is a total data management platform built on Eclipse -- and powered by CoSort or Hadoop -- that consolidates data discovery, integration, migration, governance and analytics. Flat-file data movement, manipulation, and management activities are best served in this visual design, and deployment environment.

More on Flat-File ETL in CoSort or Voracity

Start with the flat-file profiler in the free IRI Workbench GUI for ETL, built on Eclipse™. The flat-file profiling and metadata definition wizards produce statistics, search for values matching patterns or strings, and create metadata for use in your transformation, reporting, masking, and other jobs.

As you run SortCL-based transformations in CoSort wizards -- or full ETL workflows in Voracity wizards or diagrams -- you can re-use auto-created data definition file (.DDF) metadata for your sources. These files contain field names and attributes which serve as symbolic references in the underlying SortCL scripts that specify the source-target mappings and custom report layouts.

The SortCL job scripts or XML workflows (containing those scripts) that perform the ETL can run on the command line, from batch (shell) scripts, or from within the IRI Workbench ... either ad hoc or on defined schedules. Voracity users can also preview mapping results before executing the whole workflow.

Whether you define ETL jobs with CoSort (.scl) or Voracity (.flow) jobs, you'll benefit from task consolidation, and proven file system I/O, memory, and multi-core optimization techniques. Remove the overhead of high-volume transformation from DB and BI layers, and preclude the need for more hardware, in-memory DBs, and even Hadoop.

In addition to outstanding runtime performance, simple 4GL metadata speeds job creation and modification, and is much easier to learn and program than 3GL, PL/SQL, shell scripts, and ETL tools.

What You Can Do

 In one job script (and I/O pass), you can:
  • Input one or more large sequential sources
  • Run multiple transforms (filter, sort, join, etc.)
  • Compare files to capture changes and BI
  • Re-map, re-format, and pivot fields
  • Create segmented, customized reports
  • Convert data types and file formats
  • Mask or unmask PII at the field level
  • Generate test data in one or more file formats
  • Output to multiple targets simultaneously, including custom-formatted reports

This diagram illustrates SortCL's capabilities in the CoSort product for flat-file transformation, conversion, protection, and reporting -- all in the same pass.


This diagram illustrates SortCL's capabilities in the Voracity platform. The same single-pass capabilities are available here, but in a total data management (curation) environment.


In Voracity you can mash-up flat-file data with data in relational databases, dark (document) data, legacy, and "modern" big data (Hadoop, NoSQL, cloud and SaaS) platforms. And, you can use it to mask sensitive file data at the field level, while you manipulate and munge it for BI and analytic tools.