vrijdag 14 december 2012

Using Talend Big Data to put files on Hadoop HDFS

As I struggled too much to perform this simple task (couldn't locate any help), I thought I'd write up my step-by-step instructions. 

Target audience: You have a working hadoop cluster with hdfs set up, and you have Talend Big Data (or another Talend flavour) installed (and know how to use it).  You should also already by able to make a correct connection between talend and HDFS (in my example, hadoop runs on a different server).

[For ultra-short version, see bottom]



In the picture you can see the Job, just 3 components (I could have done with 1, but I don't think the others overcomplicate).

I've made a separate component to connect to hadoop.  Not necessary, but just handy :)

the settings:


  • Distribution: make sure that's correctly configured (also make sure that the necessary .jar files are consistent between the two versions  - you can find a lot of help on that topic)
  • Namenode URI: Don't forget the quotes (always have to do that in Talend), and then it's "hdfs://<hadoopserver>:<filesystemPortnr>"
    • instead of the servername (e.g. localhost) you can use the ip as well (127.0.0.1 )
    • By default the port nr is 9000, if you're unsure you can check it in the <HADOOP_HOME>/conf/core-site.xml file, value of fs.default.name
  • User name: a user that has proper rights on the HDFS system (I installed hadoop on an ubuntu server, and with the install the hduser was created and configured as superuser for HDFS automatically)
  • that's it, no password needed (why  ?? I don't know, hdfs doesn't seem to ask for one by default)

Next component:


I suppose this is self-explanatory: I'm collecting json files from a directory


Next component:

And the configuration (and here is where I struggled for quite some time):


  • "Use an existing connection": as I made the first component, I just refer to that here.
  • Local Directory: directory where the source files are located
    • In my case I use the variable passed by the Filelist component
    • In a simple case (delete the FileList component) you can harcode : "C:\path\to\files"
  • HDFS directory: the target directory (so what you have configured on your hadoop server as HDFS filesystem)
    • if not sure; on ubuntu you can check like this:
    • or, via the webinterface (from your Talend system):
      • by default, address is 'hostname:50070' (so e.g. localhost:50070)
      • also note that the port number mentioned in the title (in my case 54310) is the one you need to use in HDFSConnection component (see above)
      • Just click "browse the filesystem" to see what exists, and what the permissions are
  • Overwrite file: self-explanatory, I suppose
  • Files
    • Filemask: This is the source filemask.  In the simpler scenario (no filelist component), you can use "*.*" (i.e. take all files from source directory)
    • New name: this is the filename that should be given on the target system.  I used the variable passed from filelist (the current file name).  In the simpler scenario you can use "" (2 double quotes with nothing inbetween) to use the original filenames

That's it, after running it, I can see:

My 2 input files are on hadoop HDFS.

And now, to demonstrate you understand what's going on in this overcomplex example:  Why did I use the variable for filemask and new name ?  If I had e.g. 1000 source files, what would happen if I put "*.*" and "" respectively ?

...

Ok, so when I published this first (via another channel), I got some requests for the simple scenario, so if you want the hardcoded 1 step source files to HDFS, here it is:



I hope this saves some people some time, let me know if basic stuff like this is useful, then I'll definitely post more.






13 opmerkingen:

  1. Hi! Very interesting post! I'm trying to do the same but for the Hadoop 2.2.0 version and I'm having some conection problems. Do you know why can it be?

    BeantwoordenVerwijderen
    Reacties
    1. Thanks Jon. Can you describe your problem a bit more in detail ? Hadoop on different machine ? what have you tried, and what errors/problems are you getting ?

      Verwijderen
  2. How to change the default superuser to normal user ?

    BeantwoordenVerwijderen
    Reacties
    1. Do you mean the HDFS user ? You can link your user management in various ways to HDFS. Whatever you assign in the hdfs directory structure (use the fs -ls -al command to verify) can be used from Talend. More info on HDFS user management: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

      Verwijderen
  3. Thanks always for posting reliable information and well researched topics on the hadoop subject which otherwise can be learned either at regular or only at hadoop online training centers.

    BeantwoordenVerwijderen
  4. hi Carlo, it is great. I have a done the same thing but I am getting an error as "org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 "
    I am using Hadoop 2.7
    please can u help me....

    BeantwoordenVerwijderen
    Reacties
    1. Most likely you just have to replace the hadoop.jar in the Talend libraries folder (do a search if you don't know where) with the versions corresponding to your hadoop cluster. Restart TOS-DI (or whatever Talend product you're using), and you should be good.

      Verwijderen
  5. hi carlo,i am facing some issue hive connection.When i connect through repository i can see that connection is successful.But when i write a query to fetch the schema through hive i am getting db connection failed. Please can you help me.

    BeantwoordenVerwijderen
  6. And more over i am using hortonworks 2.3.4 but in cluster configurations i can see hartonworks 2.3.0.Will there be any compatable issues? if so how to add the hadoop properties for 2.3.4 and make it run successful.
    Please help me carlo

    BeantwoordenVerwijderen
    Reacties
    1. Do you use ambari to configure your HDP ? (is it a prebuilt image, or dit you set up your own cluster)
      also, how do you 'see that the connection is successful' ? Please give more details of what you're doing (is it from Talend DI, or (locally or remote) on a hadoop cluster (via ambari, shell, ...).

      Verwijderen
  7. I am using Talend Open Studio.I set up the cluster by the Configurations Manually.I am checking the connection in the cluster component itself.I am trying to move data from one cluster to another.

    BeantwoordenVerwijderen
  8. nice and informative post thanks for sharing this information it is really useful and helpful too.
    Big Data Training in hyderabad

    BeantwoordenVerwijderen