#Pentaho Data Integration new feature: Data sets #Kettle

Data sets will be included soon in Pentaho Data Integration, check these videos to see how  they work.

A first stab at using data sets to facilitate development of re-usable transformations, mappers, reducers, combiners, …

Introducing golden data unit tests


Book Review: Pentaho Data Integration Cookbook – Second Edition

Pentaho Data Integration Cookbook, Second Edition picks up where the first edition left off, by updating the recipes to the latest edition of PDI and diving into new topics such as working with Big Data and cloud sources, and more.

0674OS_ Pentaho Data Integration Cookbook (2nd edition).jpg

Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho Data Integration Cookbook – Second Edition

Authors: Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán

Paperback: 462 pages

I would like to suggest this useful book since it shows us how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to our everyday needs. Although this books covers advanced topics, all recipes are explained step by step in order to help all type of readers.

Target Audience
If you are a software developer, data scientist, or anyone else looking for a tool that will help extract, transform, and load data as well as provide the tools to perform analytics and data cleansing, then this book is for you.

Rating: 9 out of 10

Chapter 1, Working with Databases – 15 recipes

This chapter shows us how to work with relational databases with Kettle.The recipes show us how to create and share database connections, perform typical database functions (select, insert, update, and delete), as well as more advanced tricks such as building and executing queries at ETL runtime. Remember that in Kettle you can connect to MySQL,Oracle, SQL Server, PostgreSQL, db2 …. and nearly all the database engines available.

Chapter 1Inserting new records when PK has to be generated based on previous values transformation

Chapter 2, Reading and Writing Files – 15 recipes

This topic not only shows us how to read and write files (csv, txt, excel …), but also how to work with semi-structured files, and read data from Amazon Web Services S3 instances.

Chapter 2Loading data into an AWS S3 Instance transformation

Chapter 3, Working with Big Data and Cloud Sources – 8 recipes

This third chapter covers how to load and read data from some of the many different NoSQL data sources (MongoDB, HBase, Hadoop …) as well as from Salesforce.com. I would like to remark the importance of this issue of the book due to the importance of Big Data techniques nowadays.

Chapter 3 Loading data into HBaseLoading data into HBase transformation

Chapter 4, Manipulating XML Structures – 10 recipes

This topic shows us how to read, write, and validate XML  files. Simple and complex XML structures are shown as well as more specialized formats such as RSS feeds. Even an HTML page is generated using XML and XSL transformations. You should read carefully this chapter if you are used to work loading,reading, updating or validating XML files.

Chapter 4Generating an HTML page using XML and XSL sources transformation

Chapter 5, File Management – 9 recipes

This chapter demonstrates how to copy, move, transfer, and encrypt files and directories. Here you will learn how to get data from remote FTP servers, zip files and encrypt files using OpenPGP standard.

Chapter 5Encrypting and decrypting files transformation

Chapter 6, Looking for Data – 8 recipes

This issue shows you how to search for information through various methods via databases, web services, files, and more. This chapter also shows you how to validate data with Kettle’s built-in validation steps. Besides, in last recipe you will learn how to validate data at runtime.

Chapter 6Validating data at runtime transformation

Chapter 7, Understanding and Optimizing Data Flows – 12 recipes

This chapter details how Kettle moves data through jobs and transformations and how to optimize data flows (Processing jobs in parallel, splitting a stream into 2 or more, comparing streams ….).

Chapter 7Run transformations in parallel job

Chapter 8, Executing and Re-using Jobs and Transformations – 9 recipes

This chapter shows us how to launch jobs and transformations in various ways through static or dynamic arguments and parameterization. Object-oriented transformations through subtransformations are also explained.Chapter 8

Moving the reusable part of a transformation to a sub-transformation (Mapping)

Chapter 9, Integrating Kettle and the Pentaho Suite – 6 recipes

This chapter works with some of the other tools in the Pentaho suite (BI Server, Report Designer) to show how combining tools provides even more capabilities and functionality for reporting, dashboards, and more. In this part of the book you will create Pentaho reports from PDI,  execute PDI transformations from BI Server and populating a dashboard with PDI.

Chapter 9Creating a Pentaho report directly from PDI transformation

Chapter 10, Getting the Most Out of Kettle – 9 recipes

This part works with some of the commonly needed features (e-mail and logging) as well as building sample data sets, and using Kettle to read meta information on jobs and transformations via files or Kettle’s database repository.

Chapter 10Programming custom functionality using Java code transformation

Chapter 11, Utilizing Visualization Tools in Kettle – 4 recipes

This chapter explains how to work with plugins and focuses on DataCleaner, AgileBI, and Instaview, an Enterprise feature that allows for fast analysis of data sources.

Chapter 11PDI Marketplace (Here you can install all plugins available)

Chapter 12, Data Analytics – 3 recipes

This part shows us how to work with the various analytical tools built into Kettle, focusing on statistics gathering steps and building datasets for Weka (Pentaho Data Mining tool), you will also read data from a SAS datafile.

Chapter 13Reading data from a sas file transformation

Appendix A, Data Structures, shows the different data structures used throughout the book.

App ASteelwheels database model structure

Appendix B, References, provides a list of books and other resources that will help you
connect with the rest of the Pentaho community and learn more about Kettle and the other
tools that are part of the Pentaho suite.

Book link:


Book Review: Pentaho Data Integration Beginner’s Guide – Second Edition

Hello friends today I am going to review Pentaho Data Integration Beginner’s Guide – Second Edition:

5040OS.jpgFirst of all, I would like to congratulate Maria Carina a great contributor to the community pentaho I met in person in last  Pentaho Community Meeting #PCM13 in  Sintra.

Below you can check the link to purchase the book:


Book review by: David Fombella Pombal (twitter: @pentaho_fan)

Book Title: Pentaho Data Integration Beginner’s Guide – Second Edition

Authors: María Carina Roldán

Paperback: 502 pages

I would like to recommend this book because if you are a noob in Pentaho Data Integration you will gain a lot of knowledge of this cool tool, besides if you are advanced with PDI you can use it as reference guide book.

Target Audience
This book is an excellent starting point for database administrators, data warehouse developers, or anyone who is responsible for ETL and data warehouse projects and needs to load data into them.

Rating: 9 out of 10

Although this book is oriented to PDI 4.4.0 CE version, some new features of PDI 5.0.1 CE are listed in an Appendix of the book

Kettle version

Chapter List

Chapter 1 – Getting Started with Pentaho Data Integration
In this chapter  you learn what Pentaho Data Integration is and installing the software required to start using PDI graphical designer. As an additional task MySQL DBMS server is installed.

Chapter 1Hello world transformation

Chapter 2 – Getting started with Transformations
This chapters introduces us in the basic terminology of PDI and an introduction in handling runtime errors is performed. We will also learn the simplest ways of transforming data.Chapter 2Calculating project duration transformation

Chapter 3 – Manipulating Real-World Data
Here we will learn how to get data from different sorts of files (csv, txt, xml …)  using PDI. Besides we will send data from Kettle to plain files

Chapter 3Creation of a CSV file with random values transformation

Chapter 4 – Filtering, Searching, and Performing Other Useful Operations with Data
Explains how to sort and filter data, grouping data by different criteria and looking up for data outside the main stream of data. Some data cleasing tasks are also performed in this chapter.

Chapter 4Filtering data transformation

Chapter 5 – Controlling the Flow of Data
In this very important for ETL developers chapter we will learn how to control the flow of data. In particular we will cover the following topics: Copying and distributing rows, Splitting streams based on conditions and merging streams of data.

Chapter 5Copying rows transformation

Chapter 6 – Transforming Your Data by Coding
This chapter explains how to insert code in your transformations. Specially you will learn: Inserting and testing Javascript and Java code in your transformations and Distinguishing situations where coding is the best option, from those where there are better alternatives. PDI uses the Rhino javascript engine from Mozilla https://developer.mozilla.org/en-US/docs/Rhino/Overview . For allowing Java programming inside PDI, the tool uses the Janino project libraries. Janino es a supper-small and fast embedded compiler that compiles Java code at runtime http://docs.codehaus.org/display/JANINO/Home . In summary,always remember that code in the Javascript step is interpreted, whereas the code in User Java Class is compiled. This means that a transformation that uses the UDJC step will have much better performance.

Chapter 6Transformation with java code

Chapter 7 – Transforming the Rowset
This chapter will be dedicated to learn how to convert rows to columns (denormalizing) and converting columns to rows (normalizing) . Furthermore, you will be introduced to a very important topic in data warehousing called time dimensions.

Chapter 7Denormalizing rows transformation

Chapter 8 – Working with databases
This is the firs of two chapters fully dedicated to working with databases. We will learn how to connect to a database, preview and get data from a database and insert/update/delete data from a database.

Chapter 8List of some of the many types of databases available to connect to in PDI

Chapter 9 – Performing Advanced Operations with Databases
This chapter explains different advanced operations with databases: Doing simple and complex lookups in a database. Besides an introduction in dimensional modeling and loading dimensions is included.

Chapter 9Database lookup in a transformation

Chapter 10 – Creating Basic Task Flows
So far, we have been working with data (running transformations). A PDI transformation does not run in isolation and usually is embedded in a bigger process. These processes like generating a daily report and transfer the report to a shared repository or updating a data ware house and  sending a notification by email  can be implemented by PDI jobs. In this chapter we will be introduced to jobs, executing tasks upon conditions and working with arguments and named paramenters.

Chapter 10Creating a folder transformation

Chapter 11 – Creating Advanced Transformations and Jobs
This chapter is about learning techniques for creating complex transformations and jobs (create subtransformations, implement process flows, nest jobs, iterate the execution of jobs and transformations …)

Chapter 11Execute transformation included in a job for every input row

Chapter 12 – Developing and Implementing a Simple Datamart
This chapter will cover the following: Introduction to a sales datamart based on a provided database, loading the dimensions and fact table of the sales datamart and automating what has been done.

Appendix A- Working With Repositories
PDI allows us storing our transformations and jobs under 2 different configurations: file-based and database repository. Along this book we have used file-based option, however the database repository is convenient in some situations.

Appendix B- Pan and Kitchen – LaunchingTransformations and Jobs from the Command Line

Despite having used Spoon as the tool for running jobs and transformation you may also run them from a terminal window. Pan is a cmd-line program which lets you launche the transformations designed in Spoon, both the .ktr files and from a repository. The counterpart to Pan is Kitchen, which allows you to run jobs from .kjb files and from a repository.

Appendix C-  Quick Reference – Steps and Job Entries

This appendix summarizes the purpose of  the steps and jobs entries  used in the labs throughout the book.

Appendix D-  Spoon Shortcuts

This very useful appendix includes tables summarizing  the main Spoon shortcuts.

Appendix E-  Introducing PDI 5 features

New PDI 5 features (PDI 5 is currently available now)

Book link:


Pentaho Business Analytics 4.8 and PDI 4.4 are ready

Yesterday, appart from election day pentaho announced the release of the new version of its Business Analytics suite. The new version is 4.8 includes great additions like Mobile, Instaview and many other feature enhancements in the Pentaho Business Analytics Suite 4.8.

Pentaho 48 Overview





Download Pentaho Data Integration 4.4

A new deployment of Kettle was also released, hurry up it is time to get it and try it, the download it from here: www.pentaho.com/download (Enterprise Edition) and coming soon on Sourceforce (PDI Community Edition).


New PDI features

Pentaho Instaview

Pentaho Instaview is the fastest way to start using Pentaho Data Integration to analyze and visualize data. Instaview uses templates to manage the complexities of data access and preparation. You can focus on selecting and filtering the data you want to explore, rather than spending time creating source connections and identifying measure and dimension fields. Once the data has been selected, Instaview automatically generates transformation and metadata models, executes them, and launches Pentaho Analyzer. This allows you to explore your data in the Analyzer desktop user interface.
As your data requirements become more advanced, you have the ability to create your own templates and use the full power of Pentaho Data Integration (PDI).
Watch this video and see the Getting Started with Pentaho Data Integration Instaview Guide to understand and learn more about Pentaho Instaview or

PDI Operations Mart

The PDI Operations Mart enables administrators to collect and query PDI log data into one centralized data mart for easy reporting and analysis. The operations mart has predefined samples for Pentaho Analyzer, Interactive Reporting, and Dashboards. You can create individualized reports to meet your specific needs.

Sample inquiries include

  • How many jobs or transformations have been successful compared to how many failed in a given period?
  • How many jobs or transformations are currently running?
  • What are the longest running jobs or transformations in a given period?
  • What is the highest failure rate of job or transformations in a given period?
  • How many rows have been processed in a particular time period? This enables you to see a trend of rows or time in time series for selected transformations.

The operations mart provides setup procedures for MySQL, Oracle, and PostgresSQL databases. Install instructions for the PDI Operations Mart are available in the Pentaho InfoCenter.

Concat Fields Step

The Concat Fields step is used to join multiple fields into one target field. The fields can be delimited by a separator and the enclosure logic is completely compatible with the Text File Output step.

This step is very useful for joining fields as key/value pairs for the Hadoop MapReduce Output step.

EDI to XML Step

The EDI to XML step converts EDI message text, which conforms to the ISO 9735 standard, to generic XML. The XML text is more accessible and enables selective data extraction using XPath and the Get Data From XML step.

SAS Input Step

The SAS Input step reads files in sas7bdat format created by SAS software. This step allows PDI developers to import files in sas7bdat format.


Pentaho Data Integration: Remote execution with Carte


  • Software: PDI/Kettle 4.3.0,  installed on your PC and on a server
  • Knowledge: Intermediate (To follow this tutorial you should have good knowledge of the software and hence not every single step will be described)

Carte is an often overlooked small web server that comes with Pentaho Data Integration/Kettle. It allows remote execution of transformation and jobs. It even allows you to create static and dynamic clusters, so that you can easily run your power hungry transformation or jobs on multiple servers. In this session you will get a brief introduction on how to work with Carte.

Now let’s get started: SSH to the server where Kettle is running on (this assumes you have already installed Kettle there).

Encrypt password

Carte requires a user name and password. It’s good practise to encrypt this password. Thankfully Kettle already comes with an encryption utility.
In the PDI/data-integration/ directory run:
sh encr.sh -carte yourpassword

Open pwd/kettle.pwd and copy the encrypted password after “cluster: “:

vi ./pwd/kettle.pwd
# Please note that the default password (cluster) is obfuscated using the Encr script provided in this release
# Passwords can also be entered in plain text as before
cluster: OBF:1mpsdfsg323fssmmww3352gsdf7

Please note that “cluster” is the default user name.

Start carte.sh
Make sure first that the port you will use is available and open.

In the simplest form you start carte with just one slave that resides on the same instance:

nohup sh carte.sh localhost 8181 > carte.err.log &
After this, press CTRL+C .
To see if it started:
tail -f carte.err.log
Although outside the scope of the session, I will give you a brief idea on how to set up a cluster: If you want to run a cluster, you have to create a configuration XML file. Examples can be found in the pwd directory. Open one of these XMLs and amend it to your needs. Then issue following command:

sh carte.sh ./pwd/carte-config-8181.xml >> ./pwd/err.log
Check if the server is running

Issue following commands:

[root@ip-11-111-11-111 data-integration]# ifconfig
eth0      Link encap:Ethernet  HWaddr …
          inet addr:  Bcast:
[… details omitted …]
[root@ip-11-111-11-111 data-integration]# wget http://cluster:yourpassword@
–2011-01-31 13:53:02–  http://cluster:*password*@
Connecting to… connected.
HTTP request sent, awaiting response… 401 Unauthorized
Reusing existing connection to
HTTP request sent, awaiting response… 200 OK
Length: 158 [text/html]
Saving to: `index.html’
100%[======================================>] 158         –.-K/s   in 0s
2011-01-31 13:53:02 (9.57 MB/s) – `index.html’ saved [158/158]

If you get a message like the one above, a web server call is possible, hence the web server is running.

With the wget command you have to pass on the
  • user name (highlighted blue)
  • password (highlighted violet)
  • IP address (highlighted yellow)
  • port number (highlighted red)
Or you can install lynx:
[root@ip-11-111-11-111 data-integration]# yum install lynx
[root@ip-11-111-11-111 data-integration]# lynx http://cluster:yourpassword@
It will ask you for user name and password and then you should see a simple text representation of the website: Not more than a nearly empty Status page will be shown.
                                                            Kettle slave server
Slave server menu
   Show status
Commands: Use arrow keys to move, ‘?’ for help, ‘q’ to quit, ‘<-‘ to go back.
  Arrow keys: Up and Down to move.  Right to follow a link; Left to go back.
 H)elp O)ptions P)rint G)o M)ain screen Q)uit /=search [delete]=history list

You can also just type the URL in your local web browser:

You will be asked for user name and password and then you should see an extremely basic page.
Define slave server in Kettle

  1. Open Kettle, open a transformation or job
  2. Click on the View panel
  3. Right click on Slave server and select New.
Specify all the details and click OK. In the tree view, right click on the slave server you just set up and choose Monitor. Kettle will now display the running transformations and jobs in a new tab:
Your transformations can only use the slave server if you specify it in the Execute a transformation dialog.
For jobs you have to specify the remote slave server in each job entry dialog.
If you want to set up a cluster schema, define the slaves first, then right click on Kettle cluster schemas. Define a Schema Name and the other details, then click on Select slave servers. Specify the servers that you want to work with and define one as the master. A full description of this process is outside the scope of this article. For further info, the “Pentaho Kettle Solutions” book will give you a detailed overview.
For me a convenient way to debug a remote execution is to open a terminal window, ssh to the remote server and tail -f carte.err.log. You can follow the error log in Spoon as well, but you’ll have to refresh it manually all the time.

Remove Duplicate rows using Kettle PDI

Quick tip showing how to use UniqueRows kettle step to remove rows from CSV text file duplicates.

1)sorting the rows using Sort Rows step based on the key field.

2)Use the UniqueRows to remove the duplicates.

Sample Input Data:



Click on input File and fill the gaps as showed in the screen capture.

We are reading Comma separated file and also without header .Please check the highlighted options and select them according to your input.

If you want to trim the incoming string fields make sure you don’t specify length of the string field and if we specify the length the trim function will not work.

Next We need to configure Sort Rows transformation.

You can define temp directory if sort stage requires scratch space and also depending on the system memory you can specify number of rows use the memory.If the number exceeds or memory is not available then it will use the specified scratch space.

Based on your requirement you can capture duplicate rows into an error file by checking  Redirect Duplicate row option.And also warning messaged appeared unique rows stage requires sorted input otherwise you don’t get desired results.

We are using output file step to write the text file output.

After Executing the tranformation here is the output.


As you can see only the Unique Rows are written in the Output file

PDI clusters – Part 1 : How to build a simple PDI cluster

I would like to start a collection of posts dedicated to PDI / Kettle clustering.
After surfing the web, I noticed a lot of people is asking how to build PDI clusters, how to test and deploy them in a production environment. Also a lot of questions about Carte usage. So, I will try to make some tutorials about this fantastic feature offered by PDI.
At that time, I want to recommend you a book : “Pentaho Solutions – Business Intelligence and Datawarehousing with Pentaho and MySQL”, written by Roland Bouman and Jos Van Dongen. This book is a fantastic source of knowledge about Pentaho and will help you understanding the Pentaho ecosystem and tools. My complete review about this book here.


      • How to build a simple PDI cluster (1 master, 2 slaves). This post.
      • How to build a simple PDI server on Amazon Cloud Computing (EC2).
      • How to build a PDI cluster on Amazon Cloud Computing (EC2).
      • How to build a dynamic PDI cluster on Amazon Cloud Computing (EC2).

This first post is about building a simple PDI cluster, composed of 1 master and 2 slaves, in a virtualized environment (vmware).
After this article, you will be able to build your PDI cluster and play with it on a simple laptop of desktop (3 giga of ram is a must have).

Why PDI clustering ?

Imagine you have to make some very complex transformations and finally load a huge amout of data into your target warehouse.
You have two solutions to handle this task :

  • SCALE UP : Build a strong unique PDI server with a lot of RAM and CPU. This unique server (let’s call it an ETL hub) will handle all the work by itself.
  • SCALE OUT : Create an array of smaller servers. Each of them will handle a small part of the work.

Clustering is scaling out. You divide the global workload and distribute it accross many nodes, these smaller tasks will be processed in parallel (or near parallel). The global performance equals the slowest node of your cluster.
If we consider PDI, a cluster is composed of :

  • ONE MASTER : this node is acting like a conductor, assigning the sub-tasks to the slaves and merging the results coming back from the slaves when the sub tasks are done.
  • SLAVES : from 1 to many. The slaves are the nodes that will really do the job, process the tasks and then send back the results to the master for reconciliation.

Let’s have a look to this schema. You can see the typical architecture around a PDI cluster : data sources, the master, the registered slaves and the target warehouse. The more PDI slaves you implement, the better parallelism / performance you have.

The virtual cluster

Let’s build our first virtual cluster now. First, you will need vmware or virtual box (or virtual PC from Ms). I use vmware, so from now I will speak about vmware only, but you can transpose easily. I decided to use Suse Enterprise Linux 11 for these virtual machines. It is a personal choice, but you can do the same with Fedora, Ubuntu, etc …

Let’s build 3 virtual machines :

  • The Master : Suse Enterprise Linux 11 – this machine will host PDI programs and PDI repository, a mysql database with phpmyadmin (optional).
  • The Slave 1 : Suse Enterprise Linux 11 – this machine will host PDI programs and will run carte.
  • The Slave 2 : Suse Enterprise Linux 11 – this machine will host PDI programs and will run carte.

As you can see below, the three virtual machines are located on the same subnet, using fixed IP adresses ranging from (Master) to (Slave 2). On the vmware side, I used a “host only” network connection. You have to be able to ping your master from the two slaves, ping the two slaves from the master and also ping the three virtual machines from your host. The easiest way is to disable the firewall on each Suse machine because we don’t need security for this exercise.

The Master configuration

As I said, the Master virtual machine is hosting PDI, a mysql database and the PDI repository. But let’s have a closer look to the internal configuration, especially with the Carte program config files.
From Pentaho wiki, Carte is “a simple web server that allows you to execute transformations and jobs remotely”. Carte is a major component when building clusters because this program is a kind of a middleware between the Master and the Slave servers : the slaves will register themselves with the Master by notifying they are ready to receive tasks to process. On top of that, you can reach Carte web service to remotely monitor, start and stop transformations / jobs. You can learn more on Carte from the Pentaho wiki.

The picture below explains the registration process between slaves and a master.

Master Slave registration

On the Master, two files are very important. The files are configuration files, written in XML. They are self explanatory, easy to read :

  • Repositories.xml : your slave must have a valid repositories.xml file, updated with all informations about your repository connexion (hosted on the Master for this example). See below for my config file.
  • Carte xml configuration file : located in /pwd/, this file contains only one section for defining the cluster master (ip, port, credentials). In the /pwd/ directory, you will find some example configuration files. Pick one, for instance the one labelled “8080” and apply the changes described below. I will keep the port 8080 for communication between the Master and the two Slaves. See below for my config file.

Repositories.xml on Master

Carte xml configuration file on Master

The Slave configuration

As I said, the two Slave virtual machines are hosting PDI. Now let’s have a look on how to configure some very important files, the same files we changed for the Master.

  • Repositories.xml : your slave must have a valid repositories.xml file, updated with all informations about your repository (hosted on the Master for this example). See below for my config file.
  • Carte xml configuration file : located in /pwd/, this file contains two sections : the master section and the slave section. In the /pwd/ directory, you will find some example configuration files. Pick one, for instance the “8080” one and apply the changes described below. Note that the default user and password for Carte is cluster / cluster. Here again the file is self explanatory, see below for my config file.

Repositories.xml on Slave1 and Slave2 :
Same as for the Master, see above.

Carte xml configuration file on Slave1 (note address is, don’t write “localhost” for Slave1)

Carte xml configuration file on Slave2 (note : address is, don’t write “localhost” for Slave2)

Starting everything

Now it is time to fire the programs. I assume you have already started mysql and your PDI repository is active and reachable by PDI. It is quite recommended that you work with a repository hosted on a relational db. Let’s fire Carte on the Master first. The command is quite simple : ./carte.sh [xml config file].

This output means that your Master is running and a listener is activated on the Master adress (ip address) on port 8080. Now let’s start the two slaves. Here again, the command is simple : ./carte.sh [xml config file]. Look below the output for the Slave1, you can see that Carte has now registered Slave1 ( to the master server . Everything is working fine so far.

Finally the output for Slave2. Look below the output for the Slave2, you can see that Carte has now registered Slave2 ( to the master server . Everything is fine so far here again.

At that point, we have a working Master and two registered slaves (Slave1 and Slave2) waiting to receive tasks from the Master. It is time, now, to create the cluster array and a PDI transformation (and a job to run it). Let's go for it.


PDI configuration

First we have to declare the slaves previously created and started. That's pretty easy. Let's select the Explorer mode on the left pane. Do a left click on the "Slave server" folder, this will pop up a new window in which you will declare Slave1 like below.


Repeat the same operation for Slave1 and Slave2 in order to have 3 registered servers like the picture above. Don’t forget to type the right ip port (we are working with 8080 since the begining of this exercise).

Now we have to declare the cluster. Right click on the cluster folder (next folder) and choose New. This will pop up a new window in which you will fill the cluster parameters : Just type a new name for your cluster and then click on the “select servers” button. Now choose your three servers and click ok. You will then notice your cluster is created (Master and Slave) like below.



Creating a job for testing the clusterFor this exercice, I won't create a job but will use an existing one created by Matt Casters. This transformation is very interesting and will only read data from a flatfile and compute statistics in a target flatfile (rows/sec, throuput ...) for each slave. You can download this transformation here, the job here and the flat file here (21 Mo zipped).

I assume you know how link a transformation into a job. Don't forget to change the flatfile location on source (/your_path/lineitem.tbl) and on destination (/your_path/out_read_lineitems). Then, for each of the first four steps, right click and assign the cluster (you named previously, see above) to the step. You will see the caption “Cx2” on top right of each icon. There is nothing else to change. Here is a snapshot of the contextual menu when assigning the cluster to the transformation steps (my PDI release is in french, so you have to look at “Clustering” instead of “Partitionnement”).

Clustering steps


Have a look to the transformation below. The caption “Cx2” on top right of the first four icons means you have assigned your cluster to run these steps. On the contrary, the JavaScript step “calc elapsed time” won’t run on the cluster but on the Master only.

And have a look to the job (calling the transformation above). This is a typical job, involving a start step and the “execute transformation” step. We will start this job with Kitchen later.

Main Job


Running everything

Now it is time to run the job/transformation we made. First we will see how to run the transformation within Spoon, the PDI gui. Then we will see how to run the job (containing the transformation) with pan in the linux console and how to interpret the console output.

First, how to start the transformation within Spoon. Simply click on the green play symbol. The following window will prompt at your screen. Once again, my screen is in french, sorry for that. All you have to do/check is to click on the top right button to select the clustering execution (“Exécution en grappe” in french). I suppose you are already quite familiar with that screen so I won’t continue explaining it.



Then you can run the transformation. Let’s have a look at the Spoon trace (don’t forget to display your output window in PDI, and select the Trace tab).

This trace is fairly simple. First we can see that the Master (ip .128)found his two slaves (ip .129 and ip .130) and the connexion is working well. The Master and the two Slaves are communicating all along the process. As soon as the two Slaves have finished their work, we receive a notification '(All transformations in the cluster have finished”), then we can read a small summary (nb of rows).

Let’s have a look on the Master command line (remember we started Carte by using the Linux command line). For the Master, we have a very short output. The red lines are familiar to you now, they correspond to Carte startup we did a few minutes ago. Have a look below on the green lines : these lines were printed out by Carte while the cluster was processing the job. image

Let’s have a look at Slave 1 output. Here again, the red lines are coming from Carte Startup. The green lines are interesting : you can see Slave 1 receiving its portion of the job to run … and how he did it by reading rows (packets of 50000). You can also notice the step names that were processed by the Slave 1 in cluster mode : lineitem.tbl (reading flatfile), current_time (catch current time), min/max time and slave_name. If you remember well, these steps were flagged with a “Cx2’” on their icon on top right corner (see below) when you assigned your cluster to the transformation steps.

Slave icons


The output for Slave 2, displayed below, is very similar to Slave 1.


That’s very funny to do ! Once you started Carte and created your cluster, you are ready to execute the job. Then you will see your linux console printing informations while the job is being executed by your slaves. This post is about understanding and creating the whole PDI cluster mecanism, I won’t talk about optimization for the moment.


Hey, what’s the purpose of my transformation ?

As I said before, this transformation will only read records from a flatfile (lineitem.tbl) and compute performance statistics for every slave like rows/secs, throuput … The last step of your transformation will create a flatfile containing these stats. Have a look at it.


Once formated with a spreadsheet tool, the stats will look like this.

Stat file

Don’t pay too much attention to the start_time and end_time timestamps : the time setup was not done on my three virtual machines, hence they are not in synch. You will also notice that, in the exemple above, the performances for these two slaves are not homogeneous. That’s normal, don’t forget I’m currently working on a virtualized environment built on a workstation and this tutorial is limited to demontrating how to create and configure a PDI cluster. No optimization was taken in account at that time. On a fully optimized cluster, you will have (almost) homogeneous performance.

Running with the linux Console

If you want to execute your job from the linux command line, no problem. Kitchen is here for you. Here is the syntax for a job execution. Note : VMWARE-SLES10-32_Repo is my PDI repository running on the Master. I’m sure you are already familiar with the other parameters.


For executing your transformation, use pan. Here is the typical command.

Conclusion and … what’s next ?

Well, I hope you found here some explanations and solutions for creating basic PDI clustering. You can create more than 2 slaves is you want, the process is the same. Don’t forget to add these new slaves in the cluster definition in Spoon. As I said, no particular attention was given on optimization. This will be the topic for a next post in the near future. Feel free to contact me if you need further explanations about this post or if you want to add some usefull comments, I will answer with pleasure.

Next post will be about creating the same architecture, with … let’s say 3 or 4 slaves, in the Amazon Cloud Computing infrastructure. It will be a good time to speak about could computing in general (pros, cons, architecture …).