Remove Duplicate rows using Kettle PDI


Quick tip showing how to use UniqueRows kettle step to remove rows from CSV text file duplicates.

1)sorting the rows using Sort Rows step based on the key field.

2)Use the UniqueRows to remove the duplicates.

Sample Input Data:

CUSTOMER_ID,CUSTOMER_NAME,CUSTOMER_CITY

100,UMA,CYPRESS
100,UMA,CYPRESS
101,POOJI,CYPRESS

Click on input File and fill the gaps as showed in the screen capture.

We are reading Comma separated file and also without header .Please check the highlighted options and select them according to your input.

If you want to trim the incoming string fields make sure you don’t specify length of the string field and if we specify the length the trim function will not work.

Next We need to configure Sort Rows transformation.

You can define temp directory if sort stage requires scratch space and also depending on the system memory you can specify number of rows use the memory.If the number exceeds or memory is not available then it will use the specified scratch space.

Based on your requirement you can capture duplicate rows into an error file by checking  Redirect Duplicate row option.And also warning messaged appeared unique rows stage requires sorted input otherwise you don’t get desired results.

We are using output file step to write the text file output.

After Executing the tranformation here is the output.

100,UMA,CYPRESS
101,POOJI,CYPRESS

As you can see only the Unique Rows are written in the Output file

Advertisements

Pentaho Customer Success Update


Pentaho Customer Success Update. (Post Courtesy of Rebecca Pentaho)

We are lucky to have such awesome customers at Pentaho from all verticals such as retail, healthcare, travel, finance and education. Here is a list of a few of our newest customer stories.

BeachMint Innovative venture capital-backed online retail company built from the ground up with Pentaho. This is an interesting story about how company-wide access to analytics can drive operational excellence, improving customer experience and maximize revenues. They are a big data example — using Kettle with HP Vertica.

Oklahoma School of Community Medicine – This is an innovative story about medical students using Pentaho to help change the way they treat patients. Unique about this story is that the driver for analytics is the head of the medical program (very non-technical).

Lufthansa ­ Largest European airline by number of passengers and revenue, relies on Pentaho to monitor one of its most important core processes, the handover of passenger data between different airlines. Read the full case study in English and German.

the lewis group – Leading UK debt collection agency rolled out Pentaho to deliver performance management dashboards to help improve operational efficiency. They now are analyzing data daily through reports on a client, department and individual level.

Pharos Resources – SaaS provider of retention intelligence for higher education colleges and universities, embedded Pentaho Business Analytics into its Pharos Insight offering and went to market in just four weeks. With enhanced analytics using Pentaho, Pharos has leapfrogged competitors, expanded into new markets, seen 25% revenue growth and a great increase in customer ROI. Watch the customer webcast: Pentaho to Go to Market with Higher Education Student Retention Product in Four Weeks and read the case study.

You can view all Pentaho customer success stories here: http://www.pentaho.com/customers/success-stories/