Remove Duplicate rows using Kettle PDI


Quick tip showing how to use UniqueRows kettle step to remove rows from CSV text file duplicates.

1)sorting the rows using Sort Rows step based on the key field.

2)Use the UniqueRows to remove the duplicates.

Sample Input Data:

CUSTOMER_ID,CUSTOMER_NAME,CUSTOMER_CITY

100,UMA,CYPRESS
100,UMA,CYPRESS
101,POOJI,CYPRESS

Click on input File and fill the gaps as showed in the screen capture.

We are reading Comma separated file and also without header .Please check the highlighted options and select them according to your input.

If you want to trim the incoming string fields make sure you don’t specify length of the string field and if we specify the length the trim function will not work.

Next We need to configure Sort Rows transformation.

You can define temp directory if sort stage requires scratch space and also depending on the system memory you can specify number of rows use the memory.If the number exceeds or memory is not available then it will use the specified scratch space.

Based on your requirement you can capture duplicate rows into an error file by checking  Redirect Duplicate row option.And also warning messaged appeared unique rows stage requires sorted input otherwise you don’t get desired results.

We are using output file step to write the text file output.

After Executing the tranformation here is the output.

100,UMA,CYPRESS
101,POOJI,CYPRESS

As you can see only the Unique Rows are written in the Output file

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s