The Pentaho Journey – A Big Data Booster


Originally posted on Pentaho Business Analytics Blog:

Pentaho-Journey-ingographic-updated

Ten years ago we set out to commoditize BI, disrupt the existing old school proprietary vendors and give customers a better choice. It’s been an exciting journey building up a scalable analytic platform, building an open source community, surviving a deep recession, beating up the competition, building a great team, providing great products and services to our customers, and being a major player in the big data market.

Some of the key points along the way:

2008 – the recession hits and frankly as painful as that was it actually helped Pentaho as we were the best value proposition in BI and people were looking to reduce expenditures. It also drastically reduced the cost of commercial office space and we opened up our San Fran office in Q2 2009.

2009 – at a happy hour in Orlando in November we stumbled upon our ability to natively integrate Pentaho Data Integration…

View original 222 more words

Pentaho 5.3 – Taming messy and extreme data


anonymousbi:

#Future has arrived #Pentaho53

Originally posted on Pentaho Business Analytics Blog:

cloudpoints
There has definitely been an evolution of how the industry talks about data. About five years ago the term ‘Big Data’ emerged to define the volume aspect of Big Data. Soon after, the definition of Big Data expanded to a better one that explains what it really is; not just big, but data that moves extremely fast, often lacks structure, varies greatly from existing data, doesn’t fit well with more traditional database technologies, and frankly, is best described as “messy”.

Fast-forward to 2015 and Pentaho’s announcement of version 5.3 this week to deliver on demand big data analytics at scale on Amazon Web Services and Cloudera Impala. This release is driven by what we see in more and more of our customers – (a new data term for you) — EXTREME data problems! Our customer NASDAQ is a very interesting example of where traditional relational data systems have maxed out and have been replaced by cloud architectures that include Hadoop, Pentaho…

View original 285 more words

Packt $5 eBook Bonanza campaign #packt5dollar


Packt $5 eBook Bonanza – Every title, every topic

Hi friends, the $5 eBook Bonanza is here!

Treat yourself to the eBook or Video of your choice for just $5 and get as many as you like until January 6th 2015. To get you started, we’ve put together the Top 20 Titles of 2014 for you to pick up here. But don’t forget, you can get ANY eBook or Video for $5 in this offer.

For specific tech, you can browse all our categories at the bottom of this page.
5dollar-300px_0

Pivotal Greenplum useful DBA SQL queries


1. Viewing table data distribution across segment Servers To view the data distribution of a table’s rows (the number of rows on each segment), you can run a query such as:

SELECT gp_segment_id, count(*)
FROM abc
GROUP BY gp_segment_id;

To see the data distribution of a table’s rows on segment servers and the directory location of where the data is located, you can run a query such as:

SELECT
table_name.gp_segment_id,
hostname,
fselocation as datadir,
count(*)
FROM table_name,
pg_filespace_entry pgfse,
gp_segment_configuration gsc
WHERE
gsc.dbid=pgfse.fsedbid
and table_name.gp_segment_id= pgfse.fsedbid
GROUP BY
table_name.gp_segment_id,hostname,datadir
ORDER BY
gp_segment_id;

Balanced Distribution: A table is considered to have a balanced distribution if all of the segments have roughly the same number of rows. 2. Check the size of a table Answer: Table Level:

select
 pg_size_pretty(pg_relation_size('schema.tablename'));

Replace schema.tablename with your search table. Table and Index:

select
pg_size_pretty(pg_total_relation_size('schema.tablename'));

Replace schema.tablename with your search table. 3. Check the Schema size Answer: Schema Level:

psql> select schemaname ,
round(
sum(pg_total_relation_size(schemaname||'.'||tablename))
/1024/1024) "Size_MB"
from pg_tables where schemaname='SCHEMANAME' group by 1;

Replace SCHEMANAME with your schema name. –4. Check the database size Answer: To see size of specific database:

psql> select
pg_size_pretty(pg_database_size('DATABASE_NAME'));

Example:
gpdb=# select pg_size_pretty(pg_database_size('gpdb'));

pg_size_pretty
----------------
24 MB
(1 row)

To see all database sizes:

 psql> select
datname,pg_size_pretty(pg_database_size(datname))
from pg_database;
psql>
select
sodddatname,
(sodddatsize/1073741824) AS sizeinGB
from gp_toolkit.gp_size_of_database;
 

6. Check partitioned table size including indexes and partitions Answer: Table size with partitions: The following SQL gives you employee_dailly table size, which includes partitions.

select
schemaname,
tablename,
round(
sum(pg_total_relation_size(schemaname || '.' || partitiontablename))
/1024/1024) "MB"
from pg_partitions
where tablename='employee_daily'
group by 1,2;

schemaname | tablename | MB
-----------+----------------+-----
public | employee_daily | 254

7. Generate Scripts to grant DML priv to other user or role

david=>
select
'grant select,insert,update, delete on '||schemaname||'.'||tablename||' to gpadmin'
from pg_tables
where tableowner='david';

?column?
------------------------------------------------------------------------------
grant select,insert,update, delete on david.countries to gpadmin
grant select,insert,update, delete on david.departments to gpadmin
grant select,insert,update, delete on david.employees to gpadmin
grant select,insert,update, delete on david.job_history to gpadmin
grant select,insert,update, delete on david.jobs to gpadmin
grant select,insert,update, delete on david.locations to gpadmin
grant select,insert,update, delete on david.regions to gpadmin
grant select,insert,update, delete on david.ext_countries_error to gpadmin
grant select,insert,update, delete on david.ext_countries to gpadmin
grant select,insert,update, delete on david.ext_departments_error to gpadmin
grant select,insert,update, delete on david.ext_departments to gpadmin
grant select,insert,update, delete on david.ext_employees_error to gpadmin
grant select,insert,update, delete on david.ext_employees to gpadmin
grant select,insert,update, delete on david.ext_job_history_error to gpadmin
grant select,insert,update, delete on david.ext_job_history to gpadmin
grant select,insert,update, delete on david.ext_jobs_error to gpadmin
grant select,insert,update, delete on david.ext_jobs to gpadmin
grant select,insert,update, delete on david.ext_locations_error to gpadmin
grant select,insert,update, delete on david.ext_locations to gpadmin
grant select,insert,update, delete on david.ext_regions_error to gpadmin
grant select,insert,update, delete on david.ext_regions to gpadmin
(21 rows)
-- Count
8. select count(*) from pg_tables where schemaname='david'; 

 -- List users
9. select rolname from pg_roles where rolcanlogin is TRUE;

--- List Roles
10. select rolname from pg_roles where rolcanlogin is FALSE;  

11. select * from pg_views ;
12. select * from information_schema.table_privileges;
13. select * from pg_user;
14. select * from pg_views where viewname like '%priv%';
15. select * from pg_views where viewname like '%role%';
16. select * from pg_views where viewname like '%gran%';
17. select * from pg_views where viewname like '%part%';
18. select * from pg_views where viewname like '%schema%';
19. select * from pg_views where viewname like '%stat%';
20. select * from information_schema.role_table_grants;
21. select * from information_schema.role_routine_grants;
22. select * from information_schema.applicable_roles;
23. select * from information_schema.enabled_roles;
24. select * from gp_toolkit.gp_roles_assigned;

25. Find dependent views created on any table The following query will give you the dependent views for that table:

# select
a.*,b.definition
from information_schema.view_table_usage
a, pg_views b where a.table_name='test' and
a.view_name=b.viewname;

When you alter an existing table (for example dropping a column), you may want to determine if there are any views that depend on the columns you are changing.

Example:
\dt test
List of relations
Schema | Name | Type | Owner | Storage
--------+------+-------+------------+---------
public | test | table | gpadmin | heap

create view test_view as (select * from test);

--Run the following query
--with "where a.table_name='<table>'

select
a.*,b.definition
from
information_schema.view_table_usage a,
pg_views b
where a.table_name='test'
and
a.view_name=b.viewname;

view_catalog | view_schema | view_name | table_catalog |
table_schema | table_name | definition
-------------+------------+-----------+-----------+--------------
+------------+--------------------
ddata | public | test_view | ddata |
public| test | SELECT test.a FROM test;

–26. Connections per database and activity:

SELECT usename,
 application_name, -- only valid for 9.x
 client_addr,
 backend_start,
 query_start,
 current_query
FROM pg_stat_activity
order by usename,client_addr

–26 b. Running queries and age:

SELECT age(query_start, backend_start) AS queryage,*
 FROM pg_stat_activity 
WHERE current_query NOT LIKE '%IDLE%'
 ORDER BY queryage DESC

–27 List of users:

select * from pg_user

–28 Connections per database:

SELECT datname, numbackends 
FROM pg_stat_database
WHERE numbackends > 0
ORDER BY numbackends DESC, datname

–29 Connections per database v2:

SELECT datname, COUNT(*) AS numbackends
FROM pg_stat_activity 
GROUP BY datname HAVING COUNT(*) > 0

–30 List procedures by user:

select * from pg_proc, pg_user
 where pg_user.usesysid = pg_proc.proowner
 order by proname

–31 Top 20 tables by size:

SELECT nspname || '.' || relname AS table,
 pg_size_pretty(pg_total_relation_size(c.oid)) AS size
FROM pg_class c
LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
WHERE n.nspname NOT IN ('pg_catalog', 'information_schema') AND
 c.relkind <> 'i' AND
 n.nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(c.oid) DESC
LIMIT 20;

MDX Solve Order, SCOPE_ISOLATION and the Aggregate() function


Reblog from http://cwebbbi.wordpress.com/2014/10/14/mdx-solve-order-scope_isolation-and-the-aggregate-function/

Solve order in MDX is a mess. Back in the good old days of Analysis Services 2000 it was a difficult concept but at least comprehensible; unfortunately when Analysis Services 2005 was released a well-intentioned attempt at making it easier to work with in fact ended up making things much, much worse. In this post I’m going to summarise everything I know about solve order in MDX to try to make this complicated topic a little bit easier to understand.

If you’re an experienced MDXer, at this point you’ll probably lose interest because you think you know everything there is to know about solve order already. Up until two weeks ago that’s what I though too, so even if you know everything I say in the first half of this post keep reading – there’s some new stuff at the end I’ve only just found out about.

Let’s start with a super-simple cube built from a single table, with two measures (Sales Amount and Cost Amount) and a Product dimension containing a single attribute hierarchy with two members (Apples and Oranges). Everything is built from the following table:

image

Solve Order and calculated members in the WITH clause

To understand what solve order is and how it can be manipulated, let’s start off looking at an example that uses only calculated members in the WITH clause of a query. Consider the following:

WITH
 MEMBER [Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%'
MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]})
SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

There are two calculated members here:

  • Cost % divides Cost Amount by Sales Amount to find the percentage that costs make up of the sales amount
  • Total Fruit sums up the values for Apples and Oranges

The output of the query is as follows:

image

Solve order controls the order that MDX calculations are evaluated when two or more of them overlap in the same cell. In this case Cost % and Total Fruit are both evaluated in the bottom right-hand cell; Total Fruit is calculated first, giving the values of 30 for Sales Amount and 21 for Cost Amount, and Cost % is calculated after that. The bottom right-hand cell is the only cell where these two calculations overlap and the only cell where solve order is relevant in this query.

In this case, 70% is the value you would expect to get. You, however, can control solve order for calculations in the WITH clause by setting the SOLVE_ORDER property for each calculated member, like so:

WITH
 MEMBER [Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%',
SOLVE_ORDER=1
 MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}),
SOLVE_ORDER=2
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

Now the value in the bottom right-hand corner is 135% instead of 70%: Cost % is calculated first, then Total Fruit second so 60%+75%=135%. The SOLVE_ORDER property of a calculated member is an integer value, and the lower the SOLVE_ORDER value the earlier the calculation will be evaluated, so with Cost % having a solve order of 1 and Total Fruit having a solve order of 2, this forces Cost % to be calculated first now even though in this case it gives what is clearly an ‘incorrect’ result.

Solve Order and calculated members defined on the cube

Things now get a bit more complicated. There’s a different way of controlling solve order if your calculations are defined on the cube itself: in this case, solve order is determined by the order that the calculations appear on the Calculations tab. So if the calculations tab of the Cube Editor contains the calculations in this order:

CREATE MEMBER CURRENTCUBE.[Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%';
CREATE MEMBER CURRENTCUBE.[Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

image

…and you run the following query:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

You get the incorrect result again:

image

…but if you change the order of the calculations so that Total Fruit comes first:…and rerun the same query, you get the correct results:

image

image

The SOLVE_ORDER property can also be used with calculations defined on the cube to override the effect of the order of calculations. So defining the following calculations on the cube:

CREATE MEMBER CURRENTCUBE.MEASURES.[Cost %] AS
DIVIDE([Measures].[Cost Amount], [Measures].[Sales Amount]),
FORMAT_STRING='PERCENT', SOLVE_ORDER=2;
 
CREATE MEMBER CURRENTCUBE.[Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}), SOLVE_ORDER=1;

image

…means that, even though Total Fruit comes after Cost % on the Calculations tab, because it has a lower solve order set using the SOLVE_ORDER property it is evaluated before Cost % and the query still returns the correct value:

image

Solve order and calculations defined in the WITH clause and on the cube

What happens if some calculations are defined on the cube, and some are defined in the WITH clause of a query? By default, calculations defined on the cube always have a lower solve order than calculations defined in the WITH clause of a query; the SOLVE_ORDER property has no effect here. So if Total Fruit is defined in the WITH clause and Cost % on the cube, you get the incorrect result:

image

WITH MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]})
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],
MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

Of course, if Total Fruit is defined on the cube and Cost % is defined in the WITH clause you will get the correct answer. However, usually measures like Cost % are defined on the cube and it’s calculations like Total Fruit, which define custom groupings, that are defined on an ad hoc basis in the WITH clause. This is a problem.

The SCOPE_ISOLATION property

This default behaviour of calculations defined on the cube always having a lower solve order than calculations in the WITH clause can be overridden using the SCOPE_ISOLATION property. Setting SCOPE_ISOLATION=CUBE for a calculated member defined in the WITH clause will give that calculated member a lower solve order than any calculations defined on the cube. So, with Cost % still defined on the cube the following query now gives the correct results:

WITH MEMBER [Product].[Product].[Total Fruit] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}),
SCOPE_ISOLATION=CUBE
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

The Aggregate() function

Using the MDX Aggregate() function (and in fact also the VisualTotals() function – but you probably won’t ever want to use it) inside a calculation has a similar effect to the SCOPE_ISOLATION property in that it forces a calculation to be evaluated at a lower solve order than anything else. Therefore, in the previous example, instead of using the SCOPE_ISOLATION property you can change the calculation to use the Aggregate() function instead of Sum() and get the correct results:

WITH
 MEMBER [Product].[Product].[Total Fruit] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
 SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

The general rule is, therefore, whenever you are creating custom-grouping type calculated members like Total Fruit in the WITH clause of a query, to use the Aggregate() function rather than Sum(). The fact that Aggregate() takes into account the AggregateFunction property of each measure on the cube (so that distinct count, min and max measures are dealt with correctly) is another good reason to use it.

Using the Aggregate() function in calculations defined on the cube has the same effect. Even when the Total Fruit calculated member is defined after Cost % on the Calculations tab, as here:

image

…so long as Total Fruit uses the Aggregate() function, running the test query gives the correct result:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[Total Fruit]}
ON ROWS
FROM SALES

image

There are some very interesting details about the way Aggregate() changes solve order though.

First of all, using the Aggregate() function in a calculated member doesn’t change the solve order of the whole calculation, just the part of the calculation that uses the Aggregate() function. With the following calculations defined on the cube:

CREATE MEMBER CURRENTCUBE.[Measures].[Cost %] AS
DIVIDE([Measures].[Cost Amount],[Measures].[Sales Amount]),
FORMAT_STRING='0.0%';

CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Aggregate] 
AS AGGREGATE({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]});
 CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Sum] AS
SUM({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[Two Aggregates] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[Two Sums] AS
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

CREATE MEMBER CURRENTCUBE.[Product].[Product].[One Aggregate One Sum] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
+
SUM({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]});

…running the following query:

SELECT
{[Measures].[Sales Amount],[Measures].[Cost Amount],MEASURES.[Cost %]}
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],[Product].[Product].[One Aggregate],[Product].[Product].[One Sum],[Product].[Product].[Two Aggregates],[Product].[Product].[Two Sums],[Product].[Product].[One Aggregate One Sum]}
ON ROWS
FROM SALES

…gives these results:

image

The value returned for the calculation [One Aggregate One Sum], which contains an Aggregate() and a Sum(), shows that the value returned by the Aggregate() is evaluated at a different solve order than the value returned by Sum(), even if they are inside the same calculated member.

Furthermore, in some very obscure cases the contents of the set passed to the Aggregate() function determine whether its special solve order behaviour happens or not. I don’t know for sure what all those cases are but I have seen this happen with time utility (aka date tool aka shell) dimensions. Here’s an example.

The demo cube I’ve been using in this post has been changed to add a new dimension, called Data Type, which has just one hierarchy with one member on it called Actuals; Data Type is a fairly standard time utility dimension. The Cost % calculation has also been changed so that it’s now a calculated member on the Data Type dimension, although it is still defined on the cube. Here’s its new definition:

CREATE MEMBER CURRENTCUBE.[Data Type].[Data Type].[Cost %] AS
DIVIDE(
([Measures].[Cost Amount],[Data Type].[Data Type].&[Actuals]),
([Measures].[Sales Amount],[Data Type].[Data Type].&[Actuals])),
FORMAT_STRING='0.0%';

Now if I run the following query:

WITH MEMBER [Product].[Product].[Simple Set] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Oranges]})
 MEMBER [Product].[Product].[Nextmember Function Used] AS
AGGREGATE({[Product].[Product].&[Apples],[Product].[Product].&[Apples].NEXTMEMBER})
 MEMBER [Product].[Product].[Descendants Function Used] AS
AGGREGATE(DESCENDANTS({[Product].[Product].&[Apples],
[Product].[Product].&[Oranges]}))
 MEMBER [Product].[Product].[Descendants Function Used Twice] AS
AGGREGATE({
DESCENDANTS([Product].[Product].&[Apples]),
DESCENDANTS([Product].[Product].&[Oranges])
})
 MEMBER [Product].[Product].[Descendants Function Used Twice With Union] AS
AGGREGATE(
UNION(
DESCENDANTS([Product].[Product].&[Apples]),
DESCENDANTS([Product].[Product].&[Oranges])
))
 SELECT
{[Measures].[Sales Amount]} * [Data Type].[Data Type].ALLMEMBERS
ON COLUMNS,
{[Product].[Product].&[Apples],[Product].[Product].&[Oranges],
[Product].[Product].[Simple Set],[Product].[Product].[Nextmember Function Used],
[Product].[Product].[Descendants Function Used], [Product].[Product].[Descendants Function Used Twice],
[Product].[Product].[Descendants Function Used Twice With Union]}
ON ROWS
FROM [Sales With Data Type]

I get these results:

image

Note that for some of the calculations, the Aggregate() function results in a lower solve order in the way we’ve already seen, but not for all of them. Using the NextMember() function, or having two Descendants() functions without wrapping them in a Union() function, seems to stop SSAS assigning the calculation a lower solve order. Ugh. Luckily, though, I have only been able to replicate this with calculated members from two non-measures dimensions; if Cost % is a calculated measure Aggregate() always gives the lower solve order. Apparently this is something that SSAS does on purpose to try to recognise ‘visual total’-like calculated members and make them work the way you want automatically. This is definitely something to beware of if you are using time utility dimensions and calculations on other dimensions though, as it may result in incorrect values being displayed or performance problems if you’re not careful.

[Thanks to Gabi Münster for showing me how Aggregate() works with different sets and Marius Dumitru for confirming that this is intended behaviour]

Pentaho Data Integration scheduling with Jenkins


“As a System Administrator I need  to find a scheduling solution for our Pentaho Data Integration Jobs “
Reblog from  http://opendevelopmentnotes.blogspot.com/2014/09/pentaho-data-integration-scheduling.html
Scheduling is a crucial task in all ETL and Data Integration processes. The scheduling options available on the community edition of Pentaho Data Integration (Kettle) basically relay on the Operating System capability (Cron on Linux, Task Scheduler on Windows) but there is at last another free, open source and solid alternative for job scheduling,Jenkins.
Jenkins is a Continuos Integration tool, the de facto standard adopted in Java projects, and it’s so extensible and  easy to use that do a perfect job in scheduling Jobs and Transformations developed in Kettle.
So let start to build a production ready (probably) scheduling solution.

System configuration

OS: Oracle Linux 6
PDI: 5.1.0.0
Java: 1.7
Jenkins: 1.5

Install Jenkins

Jenkins install on Linux is trivial, just run some commands and in a few minutes you will have the system up and running.

#sudo wget -O /etc/yum.repos.d/jenkins.repo http://pkg.jenkins-ci.org/redhat/jenkins.repo
#sudo rpm –import https://jenkins-ci.org/redhat/jenkins-ci.org.key
#sudo yum install jenkins

At the end of the installation process you will have your Jenkins system ready to run.

Before starting Jenkins verify to have Java installed running:

#java -version

and if it’s not found on your system just install it with:

#sudo yum install java

Now it’s time to start Jenkis:

#sudo service jenkins start

Open you browser and go to console page.

Resolve port conflict

If you are not able to navigate to the web page check the log file:

#sudo cat /var/log/jenkins

Probably there is a port conflict (in my case I was running another web application on the same machine).

Look at your config file:

#sudo nano /etc/sysconfig/jenkins

and change the default ports:

JENKINS_PORT=”8082″

JENKINS_AJP_PORT=”8011″

Job example

Now that Jenkis is up and running is time to test a simple Job.

The transformation and job are self explained:

Scheduling

Go to the Jenkins web console and click on New Item.
Give it a name and check the Free style project box.
Set the schedule (each minutes only to test the job).
Now fill the Build section with the Kitchen command and save the project.
Just wait one minute and look at the left side of the page, you will find your Job running.
Click the Build Item and select Console Output. You will be able to see the main output of Kitchen.

CONCLUSION

Jenkins is a powerful tool and, even if it’s not the primary purpose, you can use it as your Enterprise Scheduler taking advantage of all the options for executing, monitoring and manage your Kettle Jobs.
Explore all the features that Jenkins provides and build your own free, solid and open source scheduling solution.
Take advantage of the big Jenkins community in order to meet the most complex scheduling scenarios and from time to time, if you find any interesting thing, remember to give back it to the community.