Pentaho Data Integration scheduling with Jenkins


“As a System Administrator I need  to find a scheduling solution for our Pentaho Data Integration Jobs “
Reblog from  http://opendevelopmentnotes.blogspot.com/2014/09/pentaho-data-integration-scheduling.html
Scheduling is a crucial task in all ETL and Data Integration processes. The scheduling options available on the community edition of Pentaho Data Integration (Kettle) basically relay on the Operating System capability (Cron on Linux, Task Scheduler on Windows) but there is at last another free, open source and solid alternative for job scheduling,Jenkins.
Jenkins is a Continuos Integration tool, the de facto standard adopted in Java projects, and it’s so extensible and  easy to use that do a perfect job in scheduling Jobs and Transformations developed in Kettle.
So let start to build a production ready (probably) scheduling solution.

System configuration

OS: Oracle Linux 6
PDI: 5.1.0.0
Java: 1.7
Jenkins: 1.5

Install Jenkins

Jenkins install on Linux is trivial, just run some commands and in a few minutes you will have the system up and running.

#sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
#sudo rpm –import https://jenkins-ci.org/redhat/jenkins-ci.org.key
#sudo yum install jenkins

At the end of the installation process you will have your Jenkins system ready to run.

Before starting Jenkins verify to have Java installed running:

#java -version

and if it’s not found on your system just install it with:

#sudo yum install java

Now it’s time to start Jenkis:

#sudo service jenkins start

Open you browser and go to console page.

Resolve port conflict

If you are not able to navigate to the web page check the log file:

#sudo cat /var/log/jenkins

Probably there is a port conflict (in my case I was running another web application on the same machine).

Look at your config file:

#sudo nano /etc/sysconfig/jenkins

and change the default ports:

JENKINS_PORT=”8082″

JENKINS_AJP_PORT=”8011″

Job example

Now that Jenkis is up and running is time to test a simple Job.

The transformation and job are self explained:

Scheduling

Go to the Jenkins web console and click on New Item.
Give it a name and check the Free style project box.
Set the schedule (each minutes only to test the job).
Now fill the Build section with the Kitchen command and save the project.
Just wait one minute and look at the left side of the page, you will find your Job running.
Click the Build Item and select Console Output. You will be able to see the main output of Kitchen.

CONCLUSION

Jenkins is a powerful tool and, even if it’s not the primary purpose, you can use it as your Enterprise Scheduler taking advantage of all the options for executing, monitoring and manage your Kettle Jobs.
Explore all the features that Jenkins provides and build your own free, solid and open source scheduling solution.
Take advantage of the big Jenkins community in order to meet the most complex scheduling scenarios and from time to time, if you find any interesting thing, remember to give back it to the community.
Advertisements

One thought on “Pentaho Data Integration scheduling with Jenkins

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s