6. Talend

6.1. Getting started with Talend

Talend Integration is an open source database tool which is, as it names suggests, used for integrating databases. Its concept is to allow the users to design all the tasks in a graphic manner via an eclipse-based tool, the Talend Studio and then it will try to convert those into the Java code to make it runnable.

The idea is to create the job in the sutdio which contains a series of execution to be carried on. In each job, we will define all the tasks thanks to a rich set of useful and ready-to-run components at disposal.

6.2. Talend Jobs & MetaData

The Talend is based on Eclipse, thus making its workspace look like the latter’s one. In the Job Designs panel there situated a folder called data preparation whose responsibility is to do prepare our database.

6.2.1. Data Preparation Jobs

Warning

This one is applicable only in this very first versions of the project. Il wont be needed once once the project matures with time.

The reason for the warning is because we have to start from scratch, there’s no data at the beginning. As explained in the database section, we can distinguish the tables into 2 types:

  • fixed one: the rarely or never change table in which the data don’t change frequently, mostly in case of referential data. For instance the country table (or the market) or the server (chanel).
  • movable one: these are the table which will be updated frequently like transaction ones.

The main tasks of these Talend’s job is to fill (or update) the transaction tables. We have to have some data at disposal in order to insert the new one without being given the error as the transaction is linked to those fixed tables.

For this reason, preparatory tasks will need to be created in order to extract from the log itself!!! the fixed data in question and then try to fill the tables progressively. There are problems with this method as one can point out easily that may be with the ‘large enough’ data sample, there will exist the data which is supposed to be present but go missing in the selected data.

Enough with the vague idea, the implementation is a job which comprises several mapping from the csv files fed by the python scripts into the corresponding databases tables.

6.2.2. Data Filling

This job is composed of several jobs consecutive.

6.2.2.1. Pre-Jobs

The first ones called “pre-job” are those who will work set up the context variables such as the current time the list of files whose names match our naming convention.

pre_job
java.util.Calendar cal = java.util.Calendar.getInstance();
 //yesterday (most recent log)
SimpleDateFormat formatter = new SimpleDateFormat("yyMMdd"); //date format

cal.add(java.util.Calendar.DATE, -1);
String yesterday = formatter.format(cal.getTime());
context.yesterday = yesterday;
context.atc_yesterday = String.format("atc_trans_%s", yesterday);

cal.add(java.util.Calendar.DATE, -1);
String the_day_bf_yesterday = formatter.format(cal.getTime());
context.the_day_bf_yesterday = the_day_bf_yesterday;
context.atc_the_day_bf_yesterday = String.format("atc_trans_%s", the_day_bf_yesterday);

context.atc_prod_path = String.format("/data2/gctmp/pppdelde/PPP_STAT/ATC/PDT/rawData/ATC_%s.csv", yesterday); //setting up the context

As seen in the code snippet, the supposed location for the ATC is “/data2/gctmp/pppdelde/PPP_STAT/ATC/PDT/rawData/ (actually it is for the in production log). We set the context variables as the yesterday (because it’s the latest log available, explaing the one day delay) and the day before that in order to pass to the next 2 jobs whose tasks are to drop & create the daily db table.

Also is given the location of the csv needed in the main jobs.

6.2.2.2. Main-Jobs

Then the main one which will extract from the csv files of a certain product fed by the python script and will fill the transaction of that product (convention name: product_transaction_currentDate). It uses a tmap in order to map attributes in the csv file onto the db fields. Mostly, it will help cleaning up and put data in the right order.

6.2.2.3. Post-Jobs

The post-job one is a script to fill the aggregation table of that transaction table (convention name: product_agg_trans). Il will get the transaction from the current date in order to push into this table by grouping them by a certain attributes, depending on the product.

Project Versions

Table Of Contents

Previous topic

5. JasperReports

Next topic

7. JavaScript & Amchart