vrijdag 22 mei 2015

Basic Twitter input via Talend and MongoDB

I'm not sure if I'll have time to write the full project I've been doing, but I just wanted to share a simple first step.  It takes some fun technologies and a use case that everyone can relate to (or has access to themself).
This blog described how to 1) create the proper authentication items (oAuth) to connect to twitter, 2) how to use Talend to process the tweets and 3) how to store them in a native (lossless) form using MongoDB.


I'll assume here that you're able to do the proper installs and configuration (there's plenty of help on the interweb).
For this example, you'll have to add the plugin Twitter connection:
1) Starting MongoDB server

Create a shortcut (cf image), and in the Target field, amend the executable with command line parameters (where you want the DB to be):
D:\apps\pro\MongoDB\Server\3.0\bin\mongod.exe --dbpath "I:\data\MongoDB"
Then just double click, and a command prompt should open, starting up the MongoDB server.  Just in case it's not clear; you want to leave this window open for as long as you need the MongoDB server to be accessible (if it should be always on, it's better to configure it as a service)

2) Checking Talend

When you click the 'new job' after firing up Talend, it should show the Palette on the right hand side.  After following the instructions from the Install Twitter Component link above, you should see the components there:

3) Create twitter token

find de 'app manager' (e.g. bottom of the page)
click create new app:

If you've done that correctly, you are taken to the details page of your new 'app'. You need to create a set of tokens for the Talend Twitter components to be able to access tweets:
  • goto 'Keys and Access Tokens'
  • at the bottom, click 'Create my access token'
Now there should be an additional section "Your access token", where you find a number of fields. The application settings together with the access token section provide the needed data for your twitter component.

4) Create Jobs in Talend

4.1) Dump tweets to file (JSON format)

You'll need 4 components, 3 of which come from the twitter part of the palette (the newly installed components):
The flow is simply as follows:
  • create a connection to twitter
  • if that's successful
    • get the (raw json) input
    • dump each row to a file
  • close the connection to twitter


Copy the corresponding items from your twitter app (application settings and access token).  Also make sure that the connection type is the Twitter API.  It's also possible to open up a stream (Twitter stream), but we want a process that runs periodically (I'm not planning to deploy this on a computer that's always on).  I just have to take into account that I might have overlaps in tweets when I run this job, so my export to MongoDB should cater for that (cf later).


First of all you have to make sure to select the connection (tTwitterOAuth_1 in my case), but it should be selected by default.  As you can see I've made a complete column mapping, to show what it can look like, but I'm actually NOT USING it in this example.  That is because this component can send the mapped rows, but also have an additional option to send the 'raw json' to the next component.  I'm not quite sure whether the component maps all the possible fields, or whether it's done correctly.  Furthermore, I've specifically selected a schemaless document store that is good in storing JSON docs, so I don't have to deal with mapping in an early stage.

If you're working with e.g. a relational DB, then you probably want to use the column mapping (note that the 'operation' possibilities are the hard-wired mappings to the json fields, so you cannot modify these, only use what's been provided).
(raw JSON doesn't use the column mapping, Structured does)

I don't want to have just any tweet, but the tweets where my company's twitter account (AE, @AE_NV) appears.

As I'm using the '@' symbol, and want to extract some more useful information by utilizing the reserved prefix characters (# & @), I'm checking the appropriate box in the 'Advanced Settings':


not much to change.  Just give it a filename (and as I'm using a static name to make this example simpler, I've checked append so that it doesn't overwrite data everytime I run the job.  Of course it introduces a lot of redundancy, but optimizing that flow is not the topic of this blog).

Once you linked the twitterinput with fileoutput (using JSON raw !!) you click on the 'sync columns' of the tFileOutputDelimited_1 component.


Save the job, and then run it.  If all goes well, you'll see something like this (I limited my search further so I wouldn't get too many results):
Checking the file that should have been written should be a valid JSON format:
In case you're not familiar with JSON, it's important here that the each separate tweet is encapsulated in an opening and closing curly bracket (if the whole file starts and ends with square brackets, that's not what we wanted).
Another thing to check is that (I've highlighted an example of "key":value pair) the key is always in quotes, and (depending on the type of data) the value can be, but doesn't have to be (e.g. false without quotes).

I'm not going into details, but the hardest part was to figure out how the raw jsonString could be used to dump a correct JSON (only this combination worked for me: selecting 'raw JSON' as row connector, and using the delimited output file type (e.g. not tOutputFileJSON, or tOutputFileRAW). This combination works well for the next step.

4.2) load JSON file into MongoDB

Don't worry, this is a piece of cake :)
Create a new job, but just have a single component in it:
The configuration:

  • MongoDB Directory is the path to the files for your mongoDB (what was specified in point 1) after the --dbpath parameter)
  • Server, Port, DB and Collection are self-explanatory, I assume
  • I've checked 'drop collection if exist', but that's just for testing purposes, obviously
  • Data File : what we specified in the previous job as file to save to
  • file type: JSON
  • Action on data: Insert 
    • Yes, you're right :)  It should be the other option 'upsert', but then I would have to configure the unique key from the tweet to be used as 'key' in the document store, and here define it as the unique identifier to use to decide update or insert.  
    • In this example (with a drop collection), I'm rebuilding every time.  If you're not dropping the collection and have the duplicate tweets, you will get an error (if you've mapped the tweet ID to be the MongoDB _ID)
That's it.

Run the job, and then open you're mongoVue (or command line client) to check if the correct count of items has been inserted (in my case the job output stated that 10 tweets had been processed, so that's the correct number).

To check, clicking on the [+] should not show just a single field (with e.g. 'jsonString' as the fieldname), but the rich tree structure that is a tweet :)

It's not the purpose here to go into the use of mongoDB, nor the richness of the information that can be extracted with the full embedded treestructure, but I'll give a simple example here.  In the screenshot above you can see that there is a field (for a tweet) that is called 'retweeted_status'.  That in itself is an object which has a number of fields.  One of those fields is 'favorite_count' (how many times has a retweet been marked as favorite).  Let's say I want to see which one of our tweets was picked up by others and then favorited (so not directly favorited, which is represented in the 'favorite_count' field in the root).  Using MongoVue, you get something like this:
Using the command line client, you would give following command:
db.tweetRaw.find({ "retweeted_status.favorite_count" : { "$gt" : 0 } }, { "retweeted_status.favorite_count" : true });

This is just a basic first step (I'll try to write some follow-ups of how this can turn into a complexer system, using hadoop, Neo4J, ... to really make it a data scientist exploration story.

Geen opmerkingen:

Een reactie posten