Target audience: You have a working hadoop cluster with hdfs set up, and you have Talend Big Data (or another Talend flavour) installed (and know how to use it). You should also already by able to make a correct connection between talend and HDFS (in my example, hadoop runs on a different server).
[For ultra-short version, see bottom]
In the picture you can see the Job, just 3 components (I could have done with 1, but I don't think the others overcomplicate).
I've made a separate component to connect to hadoop. Not necessary, but just handy :)
- Distribution: make sure that's correctly configured (also make sure that the necessary .jar files are consistent between the two versions - you can find a lot of help on that topic)
- Namenode URI: Don't forget the quotes (always have to do that in Talend), and then it's "hdfs://<hadoopserver>:<filesystemPortnr>"
- instead of the servername (e.g. localhost) you can use the ip as well (127.0.0.1 )
- By default the port nr is 9000, if you're unsure you can check it in the <HADOOP_HOME>/conf/core-site.xml file, value of fs.default.name
- User name: a user that has proper rights on the HDFS system (I installed hadoop on an ubuntu server, and with the install the hduser was created and configured as superuser for HDFS automatically)
- that's it, no password needed (why ?? I don't know, hdfs doesn't seem to ask for one by default)
I suppose this is self-explanatory: I'm collecting json files from a directory
And the configuration (and here is where I struggled for quite some time):
- "Use an existing connection": as I made the first component, I just refer to that here.
- Local Directory: directory where the source files are located
- In my case I use the variable passed by the Filelist component
- In a simple case (delete the FileList component) you can harcode : "C:\path\to\files"
- HDFS directory: the target directory (so what you have configured on your hadoop server as HDFS filesystem)
- if not sure; on ubuntu you can check like this:
- or, via the webinterface (from your Talend system):
- by default, address is 'hostname:50070' (so e.g. localhost:50070)
- also note that the port number mentioned in the title (in my case 54310) is the one you need to use in HDFSConnection component (see above)
- Just click "browse the filesystem" to see what exists, and what the permissions are
- Overwrite file: self-explanatory, I suppose
- Filemask: This is the source filemask. In the simpler scenario (no filelist component), you can use "*.*" (i.e. take all files from source directory)
- New name: this is the filename that should be given on the target system. I used the variable passed from filelist (the current file name). In the simpler scenario you can use "" (2 double quotes with nothing inbetween) to use the original filenames
That's it, after running it, I can see:
My 2 input files are on hadoop HDFS.
And now, to demonstrate you understand what's going on in this overcomplex example: Why did I use the variable for filemask and new name ? If I had e.g. 1000 source files, what would happen if I put "*.*" and "" respectively ?
Ok, so when I published this first (via another channel), I got some requests for the simple scenario, so if you want the hardcoded 1 step source files to HDFS, here it is:
I hope this saves some people some time, let me know if basic stuff like this is useful, then I'll definitely post more.