Wednesday, May 12, 2010

Insert...Select...On Duplicate key with timestamps

In the past, I've joked with colleagues about the trap doors hidden in the TIMESTAMP column, and today I fell through one of them.

I think there are some advantages to using timestamps, especially in large tables as part of a key. If you are not too worried about its limited range, the meager 4 bytes it uses can be an advantage.

But there are some issues with this datatype, since it also comes with some default behaviors to watch out for.

Today I was adding in some data from a stage table to an aggregation table using INSERT...SELECT...ON DUPLICATE KEY UPDATE. In my case, both tables had identical structure, including column names.

I've worked up an example of todays puzzle. Here is the prerequisite table definitions and pre-population:

create table jd_stage (
customer_id int not null ,
order_date timestamp not null ,
order_amount decimal(10,4) not null ,
primary key ( customer_id , order_date )
) ;

create table jd_agg (

customer_id int not null ,

order_date timestamp not null ,
order_amount decimal(10,4) not null ,

primary key ( customer_id , order_date )

) ;


insert into jd_agg
values
( 1 , "2010-05-01 09:00:00" , 49.25 ) ,

( 1 , "2010-05-01 10:00:00" , 75.50 ) ,

( 2 , "2010-05-01 09:00:00" , 95.40 )
;

insert into jd_stage values

( 1 , "2010-05-01 09:00:00" , 50.75 ) ,

( 1 , "2010-05-01 10:00:00" , 24.50 ) ,
( 2 , "2010-05-01 09:00:00" , 4.60 );


In this example, I want to aggregate the data in jd_stage to jd_agg, and I've set the data up so that the final result will be 100.00 for the order_amounts.

My first attempt failed because of how I composed the query, which seems innocent enough:

insert into jd_agg ( customer_id, order_date , order_amount )
select customer_id , order_date , order_amount
from jd_stage
on duplicate key update
order_amount = order_amount + values(order_amount) ;

My punishment for this was :

ERROR 1052 (23000) at line 39: Column 'order_amount' in field list is ambiguous

So, I resolved this by trying....

insert into jd_agg ( customer_id, order_date , order_amount )
select customer_id , order_date , order_amount
from jd_stage
on duplicate key update
jd_agg.order_amount = jd_agg.order_amount + values(order_amount) ;


And my reward was:

ERROR 1062 (23000) at line 49: Duplicate entry '1-2010-05-12 20:50:15' for key 1

What? I don't have any data with the date/time of "2010-05-12 20:50:15". Plus, that just so happened to be the time that I executed the query.

Once again, fooled by timestamps.

According to MySQL Scripture, which I've read a hundred times, but somehow seem to keep forgetting: If you don't want automatic initialization or updating of a timestamp column, then set the default to 0. So I should have done this:

create table jd_agg (
customer_id int not null ,

order_date timestamp not null
DEFAULT 0,
order_amount decimal(10,4) not null ,
primary key ( customer_id , order_date )
) ;

Then the second form of the insert, shown above, is error free:

select * from jd_agg ;
+-------------+---------------------+--------------+
| customer_id | order_date | order_amount |
+-------------+---------------------+--------------+
| 1 | 2010-05-01 09:00:00 | 100.0000 |
| 1 | 2010-05-01 10:00:00 | 100.0000 |
| 2 | 2010-05-01 09:00:00 | 100.0000 |
+-------------+---------------------+--------------+

(sorry for formatting...blogger is not cooperating).

So, what do you do if the evil DBA won't let you add that DEFAULT? I found that adding the order_date in the "update" section also works:

insert into jd_agg ( customer_id, order_date , order_amount )
select customer_id , order_date , order_amount
from jd_stage

on duplicate key update

jd_agg.order_date = jd_agg.order_date
,

jd_agg.order_amount = jd_agg.order_amount + values(order_amount)
;

Tuesday, May 04, 2010

Part 2: Comparing Numerics in Pentaho Data Integration

As a followup to my previous post about comparing numeric values, I've since discovered a little more about the problem. To repeat my original problem: certain numeric field values that should be equal are being detected as different in the Filter rows step. I think it's important to be able to perform accurate comparisons since it is a frequent task in data quality analysis.

Originally, I assumed this had something to do with jdbc. However, since I can re-produce the issue without any SQL, I'm sure this has nothing to do with the version of the MySQL Connector/J jdbc driver. I tried the 5.0.8 version of the driver and I observed the same behavior. I couldn't even get my transform to work correctly with the 5.1.12 version of the connector -- it does not recognize column aliases in my SQL query.

Now for the rest of the story:

My comparison of numeric data was between 2 fields from two data streams, initiated by two separate SQL table inputs.
  • The first data stream is from a "raw" table. From the input step it is passed through a "sort rows" and then a "group by" step to aggregate the numeric values.
  • The second data stream is from a "rollup" table where the raw data is summarized.

The two streams are then merged ( by a unique id) and compared in order to validate the data in the rollup.

At this point, the problem seems more related to the metadata of the fields. I found two resolutions to choose from:

Use "group by" in SQL. The data types of the output numerics are magically set to BigNumber.

or

Place a "Select/Rename Values" step after the sort and before the "Group By" step to coerce the metadata of the fields to be of BigNumber type.

Personally, I prefer the second option because I like to extract the data as quickly as possible. Pentaho can handle sorting and grouping of somewhat large datasets just fine.

I have a new example transform here: numeric_compare_filter_values_try2.ktr

Here is a picture of the example transform: