vrijdag 14 december 2012

Using Talend Big Data to put files on Hadoop HDFS

As I struggled too much to perform this simple task (couldn't locate any help), I thought I'd write up my step-by-step instructions. 

Target audience: You have a working hadoop cluster with hdfs set up, and you have Talend Big Data (or another Talend flavour) installed (and know how to use it).  You should also already by able to make a correct connection between talend and HDFS (in my example, hadoop runs on a different server).

[For ultra-short version, see bottom]

In the picture you can see the Job, just 3 components (I could have done with 1, but I don't think the others overcomplicate).

I've made a separate component to connect to hadoop.  Not necessary, but just handy :)

the settings:

  • Distribution: make sure that's correctly configured (also make sure that the necessary .jar files are consistent between the two versions  - you can find a lot of help on that topic)
  • Namenode URI: Don't forget the quotes (always have to do that in Talend), and then it's "hdfs://<hadoopserver>:<filesystemPortnr>"
    • instead of the servername (e.g. localhost) you can use the ip as well ( )
    • By default the port nr is 9000, if you're unsure you can check it in the <HADOOP_HOME>/conf/core-site.xml file, value of fs.default.name
  • User name: a user that has proper rights on the HDFS system (I installed hadoop on an ubuntu server, and with the install the hduser was created and configured as superuser for HDFS automatically)
  • that's it, no password needed (why  ?? I don't know, hdfs doesn't seem to ask for one by default)

Next component:

I suppose this is self-explanatory: I'm collecting json files from a directory

Next component:

And the configuration (and here is where I struggled for quite some time):

  • "Use an existing connection": as I made the first component, I just refer to that here.
  • Local Directory: directory where the source files are located
    • In my case I use the variable passed by the Filelist component
    • In a simple case (delete the FileList component) you can harcode : "C:\path\to\files"
  • HDFS directory: the target directory (so what you have configured on your hadoop server as HDFS filesystem)
    • if not sure; on ubuntu you can check like this:
    • or, via the webinterface (from your Talend system):
      • by default, address is 'hostname:50070' (so e.g. localhost:50070)
      • also note that the port number mentioned in the title (in my case 54310) is the one you need to use in HDFSConnection component (see above)
      • Just click "browse the filesystem" to see what exists, and what the permissions are
  • Overwrite file: self-explanatory, I suppose
  • Files
    • Filemask: This is the source filemask.  In the simpler scenario (no filelist component), you can use "*.*" (i.e. take all files from source directory)
    • New name: this is the filename that should be given on the target system.  I used the variable passed from filelist (the current file name).  In the simpler scenario you can use "" (2 double quotes with nothing inbetween) to use the original filenames

That's it, after running it, I can see:

My 2 input files are on hadoop HDFS.

And now, to demonstrate you understand what's going on in this overcomplex example:  Why did I use the variable for filemask and new name ?  If I had e.g. 1000 source files, what would happen if I put "*.*" and "" respectively ?


Ok, so when I published this first (via another channel), I got some requests for the simple scenario, so if you want the hardcoded 1 step source files to HDFS, here it is:

I hope this saves some people some time, let me know if basic stuff like this is useful, then I'll definitely post more.

18 opmerkingen:

  1. Hi! Very interesting post! I'm trying to do the same but for the Hadoop 2.2.0 version and I'm having some conection problems. Do you know why can it be?

    1. Thanks Jon. Can you describe your problem a bit more in detail ? Hadoop on different machine ? what have you tried, and what errors/problems are you getting ?

  2. How to change the default superuser to normal user ?

    1. Do you mean the HDFS user ? You can link your user management in various ways to HDFS. Whatever you assign in the hdfs directory structure (use the fs -ls -al command to verify) can be used from Talend. More info on HDFS user management: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html

  3. Thanks always for posting reliable information and well researched topics on the hadoop subject which otherwise can be learned either at regular or only at hadoop online training centers.

  4. hi Carlo, it is great. I have a done the same thing but I am getting an error as "org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 "
    I am using Hadoop 2.7
    please can u help me....

    1. Most likely you just have to replace the hadoop.jar in the Talend libraries folder (do a search if you don't know where) with the versions corresponding to your hadoop cluster. Restart TOS-DI (or whatever Talend product you're using), and you should be good.

  5. hi carlo,i am facing some issue hive connection.When i connect through repository i can see that connection is successful.But when i write a query to fetch the schema through hive i am getting db connection failed. Please can you help me.

  6. And more over i am using hortonworks 2.3.4 but in cluster configurations i can see hartonworks 2.3.0.Will there be any compatable issues? if so how to add the hadoop properties for 2.3.4 and make it run successful.
    Please help me carlo

    1. Do you use ambari to configure your HDP ? (is it a prebuilt image, or dit you set up your own cluster)
      also, how do you 'see that the connection is successful' ? Please give more details of what you're doing (is it from Talend DI, or (locally or remote) on a hadoop cluster (via ambari, shell, ...).

  7. I am using Talend Open Studio.I set up the cluster by the Configurations Manually.I am checking the connection in the cluster component itself.I am trying to move data from one cluster to another.

  8. nice and informative post thanks for sharing this information it is really useful and helpful too.
    Big Data Training in hyderabad

  9. Hi ,
    I am using tHDFSPut component to move the files from local directory to hdfs path. This component is moving files when there are files. the same component is throwing errors when there are no files in the local directory, but the job is getting terminated with the return code 0 even with errors in the component.
    I want my job to be failed if there are no files in the source directory. Appreciate if anyone can suggest on this.

  10. Thanks for the information to know more about Big data Hadoop go throw this link Big Data Hadoop Online course

  11. Talend Open Studio for Big Data: combining big data technologies into a unified open source environment simplifying the loading, extraction, transformation and processing of large and diverse data sets

    Talend Platform for Big Data: a powerful and versatile big data integration and data quality solution that simplifies the loading, extraction and processing of large and diverse data sets so you can make more informed and timely decisions. Know More Info Click Here:: Hadoop Administration Training.

  12. It is really nice to see the best blog for bigdata.This blog helped me a lot easily understandable too.
    Hadoop Training in Velachery | Hadoop Training .
    Hadoop Training in Chennai | Hadoop .

  13. I found your blog very interesting and very informative. I think your blog is great information source & I like your way of writing and explaining the topics. If you Want More Details Talend Click here.