vrijdag 14 december 2012

Using Talend Big Data to put files on Hadoop HDFS

As I struggled too much to perform this simple task (couldn't locate any help), I thought I'd write up my step-by-step instructions. 

Target audience: You have a working hadoop cluster with hdfs set up, and you have Talend Big Data (or another Talend flavour) installed (and know how to use it).  You should also already by able to make a correct connection between talend and HDFS (in my example, hadoop runs on a different server).

[For ultra-short version, see bottom]

In the picture you can see the Job, just 3 components (I could have done with 1, but I don't think the others overcomplicate).

I've made a separate component to connect to hadoop.  Not necessary, but just handy :)

the settings:

  • Distribution: make sure that's correctly configured (also make sure that the necessary .jar files are consistent between the two versions  - you can find a lot of help on that topic)
  • Namenode URI: Don't forget the quotes (always have to do that in Talend), and then it's "hdfs://<hadoopserver>:<filesystemPortnr>"
    • instead of the servername (e.g. localhost) you can use the ip as well ( )
    • By default the port nr is 9000, if you're unsure you can check it in the <HADOOP_HOME>/conf/core-site.xml file, value of fs.default.name
  • User name: a user that has proper rights on the HDFS system (I installed hadoop on an ubuntu server, and with the install the hduser was created and configured as superuser for HDFS automatically)
  • that's it, no password needed (why  ?? I don't know, hdfs doesn't seem to ask for one by default)

Next component:

I suppose this is self-explanatory: I'm collecting json files from a directory

Next component:

And the configuration (and here is where I struggled for quite some time):

  • "Use an existing connection": as I made the first component, I just refer to that here.
  • Local Directory: directory where the source files are located
    • In my case I use the variable passed by the Filelist component
    • In a simple case (delete the FileList component) you can harcode : "C:\path\to\files"
  • HDFS directory: the target directory (so what you have configured on your hadoop server as HDFS filesystem)
    • if not sure; on ubuntu you can check like this:
    • or, via the webinterface (from your Talend system):
      • by default, address is 'hostname:50070' (so e.g. localhost:50070)
      • also note that the port number mentioned in the title (in my case 54310) is the one you need to use in HDFSConnection component (see above)
      • Just click "browse the filesystem" to see what exists, and what the permissions are
  • Overwrite file: self-explanatory, I suppose
  • Files
    • Filemask: This is the source filemask.  In the simpler scenario (no filelist component), you can use "*.*" (i.e. take all files from source directory)
    • New name: this is the filename that should be given on the target system.  I used the variable passed from filelist (the current file name).  In the simpler scenario you can use "" (2 double quotes with nothing inbetween) to use the original filenames

That's it, after running it, I can see:

My 2 input files are on hadoop HDFS.

And now, to demonstrate you understand what's going on in this overcomplex example:  Why did I use the variable for filemask and new name ?  If I had e.g. 1000 source files, what would happen if I put "*.*" and "" respectively ?


Ok, so when I published this first (via another channel), I got some requests for the simple scenario, so if you want the hardcoded 1 step source files to HDFS, here it is:

I hope this saves some people some time, let me know if basic stuff like this is useful, then I'll definitely post more.

zaterdag 1 december 2012

Easy Fix for Frozen Linux Mint 14.1 installation on Virtualbox

Help with Virtualbox Installation of Linux Mint 14.1 MATE 32-bit

Every now and again I do try a new Linux distro, just to keep a bit in touch with things.  I've got a couple of linux servers running, but as a means to an end, and not to experiment with Linux (Hadoop Clusters, LAMPs, NAS, ...).  For those I used to use Red Hat (before it became Fedora), then Fedora, and the last 5 years or so I've been using Ubuntu.

I've recently set up my Hadoop clusters in ubuntu 12.4, and read about a new Linux Mint release: 14(.1).  For an eclipse based IDE (with app servers) I usually go for more of those desktop oriented stuf, so I wanted to give it a go.

My usual work method is to test it out on virtualbox.  In trying to install I ran into a problem (that I've seen before), that I couldn't quickly solve by checking the InterDaWeb, but is quite easily solved.  So, just a quick heads-up for the people that may experience the same thing

When setting up a new VM, and starting the install, system hangs on this part:

--> Yep, the first screen :)  After a short while there's no disk nor CD activity anymore.

I tried rebooting, and pressing F2 to get the menu:
Besides the normal start, I also tried the integrity check, but the system was unresponsive as well.

Just as a guess, I tried the compatibility mode as well, and that quickly revealed the problem:

So, that sure rang a bell :)  I've had that problem before.  The quick and easy fix:
- shut down the VM (otherwise you can't change settings)
- go into Settings - System - Processer, and check the box for "enable PAE/NX.

- Start up the VM again

That's all, for me it solved my problem, and hopefully this will prevent some people from having to look too hard for this tip !