Tuesday, May 04, 2010

Part 2: Comparing Numerics in Pentaho Data Integration

As a followup to my previous post about comparing numeric values, I've since discovered a little more about the problem. To repeat my original problem: certain numeric field values that should be equal are being detected as different in the Filter rows step. I think it's important to be able to perform accurate comparisons since it is a frequent task in data quality analysis.

Originally, I assumed this had something to do with jdbc. However, since I can re-produce the issue without any SQL, I'm sure this has nothing to do with the version of the MySQL Connector/J jdbc driver. I tried the 5.0.8 version of the driver and I observed the same behavior. I couldn't even get my transform to work correctly with the 5.1.12 version of the connector -- it does not recognize column aliases in my SQL query.

Now for the rest of the story:

My comparison of numeric data was between 2 fields from two data streams, initiated by two separate SQL table inputs.
  • The first data stream is from a "raw" table. From the input step it is passed through a "sort rows" and then a "group by" step to aggregate the numeric values.
  • The second data stream is from a "rollup" table where the raw data is summarized.

The two streams are then merged ( by a unique id) and compared in order to validate the data in the rollup.

At this point, the problem seems more related to the metadata of the fields. I found two resolutions to choose from:

Use "group by" in SQL. The data types of the output numerics are magically set to BigNumber.


Place a "Select/Rename Values" step after the sort and before the "Group By" step to coerce the metadata of the fields to be of BigNumber type.

Personally, I prefer the second option because I like to extract the data as quickly as possible. Pentaho can handle sorting and grouping of somewhat large datasets just fine.

I have a new example transform here: numeric_compare_filter_values_try2.ktr

Here is a picture of the example transform:

No comments: