Pentaho Data Integration scheduling with Jenkins


“As a System Administrator I need  to find a scheduling solution for our Pentaho Data Integration Jobs “
Reblog from  http://opendevelopmentnotes.blogspot.com/2014/09/pentaho-data-integration-scheduling.html
Scheduling is a crucial task in all ETL and Data Integration processes. The scheduling options available on the community edition of Pentaho Data Integration (Kettle) basically relay on the Operating System capability (Cron on Linux, Task Scheduler on Windows) but there is at last another free, open source and solid alternative for job scheduling,Jenkins.
Jenkins is a Continuos Integration tool, the de facto standard adopted in Java projects, and it’s so extensible and  easy to use that do a perfect job in scheduling Jobs and Transformations developed in Kettle.
So let start to build a production ready (probably) scheduling solution.

System configuration

OS: Oracle Linux 6
PDI: 5.1.0.0
Java: 1.7
Jenkins: 1.5

Install Jenkins

Jenkins install on Linux is trivial, just run some commands and in a few minutes you will have the system up and running.

#sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
#sudo rpm –import https://jenkins-ci.org/redhat/jenkins-ci.org.key
#sudo yum install jenkins

At the end of the installation process you will have your Jenkins system ready to run.

Before starting Jenkins verify to have Java installed running:

#java -version

and if it’s not found on your system just install it with:

#sudo yum install java

Now it’s time to start Jenkis:

#sudo service jenkins start

Open you browser and go to console page.

Resolve port conflict

If you are not able to navigate to the web page check the log file:

#sudo cat /var/log/jenkins

Probably there is a port conflict (in my case I was running another web application on the same machine).

Look at your config file:

#sudo nano /etc/sysconfig/jenkins

and change the default ports:

JENKINS_PORT=”8082″

JENKINS_AJP_PORT=”8011″

Job example

Now that Jenkis is up and running is time to test a simple Job.

The transformation and job are self explained:

Scheduling

Go to the Jenkins web console and click on New Item.
Give it a name and check the Free style project box.
Set the schedule (each minutes only to test the job).
Now fill the Build section with the Kitchen command and save the project.
Just wait one minute and look at the left side of the page, you will find your Job running.
Click the Build Item and select Console Output. You will be able to see the main output of Kitchen.

CONCLUSION

Jenkins is a powerful tool and, even if it’s not the primary purpose, you can use it as your Enterprise Scheduler taking advantage of all the options for executing, monitoring and manage your Kettle Jobs.
Explore all the features that Jenkins provides and build your own free, solid and open source scheduling solution.
Take advantage of the big Jenkins community in order to meet the most complex scheduling scenarios and from time to time, if you find any interesting thing, remember to give back it to the community.

Creating a connection to SAP HANA using Pentaho PDI


 

Reblog from http://scn.sap.com/community/developer-center/hana/blog/2014/09/04/creating-a-connection-to-sap-hana-using-pentaho-pdi

In this blog post we are going to learn how to create a HANA Database Connection within Pentaho PDI.

1)  Go to SAP HANA CLIENT installation path and copy the “ngdbc.jar”

*You can get SAP HANA CLIENT & SAP HANA STUDIO from :https://hanadeveditionsapicl.hana.ondemand.com/hanadevedition/

 

1.png

2) Copy and paste the jar file to : <YourPentahoRootFolder>/data-integration/lib

2.png

3) Start Pentaho PDI and create a new Connection

* Make sure your JAVA_HOME environment variable is setting correctly.

3.png

3_1.png

3_2.png

4) Create a transformation,  rick click on Database connection to create a new database connection

4.png

 

5) Select “Generic Database” connection type and Access as “Native(JDBC)”

 

5.png

6)  Fill the following parameter on Settings

Connection Name: NAMEYOURCONNECTION

Custom Connection URL: jdbc:sap://YOUR_IP_ADDREES:30015

Custom Driver Class Name: com.sap.db.jdbc.Driver

User Name: YOURHANAUSER

Password: YOURHANAPASSWORD

6.png

 

7) Test your connection.

7.png

How to create custom reports using data from MongoDB


anonymousbi:

This demonstrate below show how to visually create a report directly against data stored in MongoDB (with no coding required). The following topics are shown:

Pentaho Data Integration tool is used to create a transformation that does the following:
Connect to and query MongoDB.
Query results are sorted.
Sorted results are grouped.
Pentaho Report Designer is used to visually create a report by using the data from a PDI transformation.

Originally posted on Tech Ramblings:

This demonstrate below show how to visually create a report directly against data stored in MongoDB (with no coding required).  The following topics are shown:

  1. Pentaho Data Integration tool is used to create a transformation that does the following:
    1. Connect to and query MongoDB.
    2. Query results are sorted.
    3. Sorted results are grouped.
  2. Pentaho Report Designer is used to visually create a report by using the data from a PDI transformation.

View original

Oracle convert a string field with a list of elements in a set of rows


I will show one tricky way of creating a  subquery to build a set of rows coming from a string field which includes a list of valuaes separated by comma

Given the example of a string field with the following content ‘A,B,C,D’.Using REGEXP_SUBSTR you can extract only one of the 4 matches (A,B,C,D): the regex [^,]+ matches any character sequence in the string which does not contain a comma.

If you run:

SELECT REGEXP_SUBSTR ('A,B,C,D','[^,]+') as set_of_rows
FROM   DUAL

you’ll get A.

and if you’ll try running:

SELECT REGEXP_SUBSTR ('A,B,C,D','[^,]+',1,1) as set_of_rows
FROM   DUAL

you’ll also get A only that now we also sent two additional parameters: start looking in position 1 (which is the default), and return the 1st occurrence.

Now lets run:

SELECT REGEXP_SUBSTR ('A,B,C,D','[^,]+',1,2) as set_of_rows
FROM   DUAL

this time we’ll get B (2nd occurrence) and using 3 as the last parameter will return C and so on.

The use of recursive connected by along with level makes sure you’ll receive all the relevant results (not necessarily in the original order though!):

SELECT DISTINCT REGEXP_SUBSTR ('A,B,C,D','[^,]+',1,LEVEL) as set_of_rows
FROM   DUAL
CONNECT BY REGEXP_SUBSTR ('A,B,C,D','[^,]+',1,LEVEL) IS NOT NULL
order by 1

will return:

set_of_rows
A
B
C
D

which not only contains all 4 results, but also breaks it into separate rows in the resultset and will be useful to add it on an IN() sql clause

This query “abuses” the connect by functionality to generate rows in a query on dual. As long as the expression passed to connect by is true, it will generate a new row and increase the value of the pseudo column LEVEL. Then LEVEL is passed to regex_substr to get the nth value when applying the regular expression

World Cup Dashboard 2014 – in 15 minutes


anonymousbi:

Awesome Dashboard by the way

Originally posted on Pentaho Business Analytics Blog:

dashboard fifa

Are you caught in the World Cup craze? Two ofmy passions are English football and analytics (hence I’m a SE at Pentaho based in London). So when it came time for this years’ World Cup, naturally I combined my passions to analyse who is going to win and what makes a winning team?

It turns out that a Big Data Analytics team in Germany tried to predict the winners based on massive data sets. Thus far three of their top five predicted teams have faltered. So what went wrong? Is Big Data not accurate? Are analytics not the answer?

Fret not. Exploring their methodology, their analysis was based from only one source of data. At Pentaho, we believe that the strongest insights come from blended data. We don’t just connect to large data sets; we make connecting toall data easy, regardless of format or location.

So why is my little…

View original 386 more words

What’s new at Pentaho – Q2 2014 review


Originally posted on Pentaho Business Analytics Blog:

We are a very busy bunch at Pentaho. We are makers and doers, firmly focused on the future of analytics.

Sometimes it’s good to stop and reflect back on customer success stories, product developments and recent achievements. It is an exciting time for Pentaho with the first PentahoWorld worldwide users conference taking place in October, the recent release of Pentaho 5.1 and our users being recognized in national awards for the work they have done using Pentaho analytics.

Below is a summary of what’s new at Pentaho from Q2 2014: news, content, press mentions and blogs.

Press releases

  1. Pentaho Equips Companies to Easily Scale Big Data Operations, Regardless of IT Resources
  2. Introducing the Pentaho Excellence Awards
  3. Pentaho Data Science Pack Operationalizes Use of R and Weka
  4. Pentaho to Host First Worldwide Users’ Conference
  5. Bywaters’ Customers Save Money and Co2 with Pentaho Embedded Analytics
  6. Pentaho Business Analytics Certified on Cloudera 5 for…

View original 567 more words