vrijdag 22 mei 2015

Basic Twitter input via Talend and MongoDB

I'm not sure if I'll have time to write the full project I've been doing, but I just wanted to share a simple first step.  It takes some fun technologies and a use case that everyone can relate to (or has access to themself).
This blog described how to 1) create the proper authentication items (oAuth) to connect to twitter, 2) how to use Talend to process the tweets and 3) how to store them in a native (lossless) form using MongoDB.


I'll assume here that you're able to do the proper installs and configuration (there's plenty of help on the interweb).
For this example, you'll have to add the plugin Twitter connection:
1) Starting MongoDB server

Create a shortcut (cf image), and in the Target field, amend the executable with command line parameters (where you want the DB to be):
D:\apps\pro\MongoDB\Server\3.0\bin\mongod.exe --dbpath "I:\data\MongoDB"
Then just double click, and a command prompt should open, starting up the MongoDB server.  Just in case it's not clear; you want to leave this window open for as long as you need the MongoDB server to be accessible (if it should be always on, it's better to configure it as a service)

2) Checking Talend

When you click the 'new job' after firing up Talend, it should show the Palette on the right hand side.  After following the instructions from the Install Twitter Component link above, you should see the components there:

3) Create twitter token

find de 'app manager' (e.g. bottom of the page)
click create new app:

If you've done that correctly, you are taken to the details page of your new 'app'. You need to create a set of tokens for the Talend Twitter components to be able to access tweets:
  • goto 'Keys and Access Tokens'
  • at the bottom, click 'Create my access token'
Now there should be an additional section "Your access token", where you find a number of fields. The application settings together with the access token section provide the needed data for your twitter component.

4) Create Jobs in Talend

4.1) Dump tweets to file (JSON format)

You'll need 4 components, 3 of which come from the twitter part of the palette (the newly installed components):
The flow is simply as follows:
  • create a connection to twitter
  • if that's successful
    • get the (raw json) input
    • dump each row to a file
  • close the connection to twitter


Copy the corresponding items from your twitter app (application settings and access token).  Also make sure that the connection type is the Twitter API.  It's also possible to open up a stream (Twitter stream), but we want a process that runs periodically (I'm not planning to deploy this on a computer that's always on).  I just have to take into account that I might have overlaps in tweets when I run this job, so my export to MongoDB should cater for that (cf later).


First of all you have to make sure to select the connection (tTwitterOAuth_1 in my case), but it should be selected by default.  As you can see I've made a complete column mapping, to show what it can look like, but I'm actually NOT USING it in this example.  That is because this component can send the mapped rows, but also have an additional option to send the 'raw json' to the next component.  I'm not quite sure whether the component maps all the possible fields, or whether it's done correctly.  Furthermore, I've specifically selected a schemaless document store that is good in storing JSON docs, so I don't have to deal with mapping in an early stage.

If you're working with e.g. a relational DB, then you probably want to use the column mapping (note that the 'operation' possibilities are the hard-wired mappings to the json fields, so you cannot modify these, only use what's been provided).
(raw JSON doesn't use the column mapping, Structured does)

I don't want to have just any tweet, but the tweets where my company's twitter account (AE, @AE_NV) appears.

As I'm using the '@' symbol, and want to extract some more useful information by utilizing the reserved prefix characters (# & @), I'm checking the appropriate box in the 'Advanced Settings':


not much to change.  Just give it a filename (and as I'm using a static name to make this example simpler, I've checked append so that it doesn't overwrite data everytime I run the job.  Of course it introduces a lot of redundancy, but optimizing that flow is not the topic of this blog).

Once you linked the twitterinput with fileoutput (using JSON raw !!) you click on the 'sync columns' of the tFileOutputDelimited_1 component.


Save the job, and then run it.  If all goes well, you'll see something like this (I limited my search further so I wouldn't get too many results):
Checking the file that should have been written should be a valid JSON format:
In case you're not familiar with JSON, it's important here that the each separate tweet is encapsulated in an opening and closing curly bracket (if the whole file starts and ends with square brackets, that's not what we wanted).
Another thing to check is that (I've highlighted an example of "key":value pair) the key is always in quotes, and (depending on the type of data) the value can be, but doesn't have to be (e.g. false without quotes).

I'm not going into details, but the hardest part was to figure out how the raw jsonString could be used to dump a correct JSON (only this combination worked for me: selecting 'raw JSON' as row connector, and using the delimited output file type (e.g. not tOutputFileJSON, or tOutputFileRAW). This combination works well for the next step.

4.2) load JSON file into MongoDB

Don't worry, this is a piece of cake :)
Create a new job, but just have a single component in it:
The configuration:

  • MongoDB Directory is the path to the files for your mongoDB (what was specified in point 1) after the --dbpath parameter)
  • Server, Port, DB and Collection are self-explanatory, I assume
  • I've checked 'drop collection if exist', but that's just for testing purposes, obviously
  • Data File : what we specified in the previous job as file to save to
  • file type: JSON
  • Action on data: Insert 
    • Yes, you're right :)  It should be the other option 'upsert', but then I would have to configure the unique key from the tweet to be used as 'key' in the document store, and here define it as the unique identifier to use to decide update or insert.  
    • In this example (with a drop collection), I'm rebuilding every time.  If you're not dropping the collection and have the duplicate tweets, you will get an error (if you've mapped the tweet ID to be the MongoDB _ID)
That's it.

Run the job, and then open you're mongoVue (or command line client) to check if the correct count of items has been inserted (in my case the job output stated that 10 tweets had been processed, so that's the correct number).

To check, clicking on the [+] should not show just a single field (with e.g. 'jsonString' as the fieldname), but the rich tree structure that is a tweet :)

It's not the purpose here to go into the use of mongoDB, nor the richness of the information that can be extracted with the full embedded treestructure, but I'll give a simple example here.  In the screenshot above you can see that there is a field (for a tweet) that is called 'retweeted_status'.  That in itself is an object which has a number of fields.  One of those fields is 'favorite_count' (how many times has a retweet been marked as favorite).  Let's say I want to see which one of our tweets was picked up by others and then favorited (so not directly favorited, which is represented in the 'favorite_count' field in the root).  Using MongoVue, you get something like this:
Using the command line client, you would give following command:
db.tweetRaw.find({ "retweeted_status.favorite_count" : { "$gt" : 0 } }, { "retweeted_status.favorite_count" : true });

This is just a basic first step (I'll try to write some follow-ups of how this can turn into a complexer system, using hadoop, Neo4J, ... to really make it a data scientist exploration story.

zaterdag 24 januari 2015

Arduino project: Drummer's metronome

My little brother and I are both drummers, and a couple of weeks ago he told me about this problem he had.  He just needed a simple metronome, that had 3 or 4 programmable bpm settings.  He wanted it to just have a blinking led, and the whole thing should come in the shape of a foot pedal (think of guitar effect pedal box with 2 switches: "on/off"  and "next setting".

My brother makes these cases himself, but for the inside he asked me if it would be difficult to do in an Arduino project.  Hell no, I replied :)

In fact, it was really easy to do, and only took me a couple of hours (design, drawing it, putting it together, programming, testing/debugging, improving, ...).  As I didn't really find any references that took te same approach as I did (with a short loop to get better responsiveness on the button press), I thought to share this with the world.

The Design:

Not much special to say about that.  RGB Led, two push buttons and a couple of resistors, that's it.

The Code:

So, what's so different ?  I only found approaches that set the interval the light needs to be on, and then off, and put that entire time interval in one loop call.  I found this not workable (especially not in a live environment like a concert), as this means the press of the button is only checked right before the next cycle.  So, if you have 60 BPM, there's a beat every second, and so only 60 moments it gets checked whether you pressed one of the buttons.  So, a bit simplified: uno is at 16 Mhz (megaherz = 1.000.000 'ticks' per second), so has 16.000.000 ticks per second.  Taking into account the normal length of a button press (especially done with the foot on a pedal during a concert) this means a significant portion of the time the button press is missed because it fell completely during a single call to loop().
In short: responsive buttons means trying to get the loop() execution time as low as possible, which in turn means not using the delay() function.
Another small problem (in my tests the error wasn't really noticable against an actual metronome) is that there is a small amount of time it takes to execute the code, and the delay() calls don't take that into account.  So two delays (one for LED on, the other for LED off) of 250 (microseconds) should give a BPM of 2x250ms = 500ms = 0,5 sec ==> 2 BPsec ==> 120 BPM.  However, let's say the execution of the rest of the code takes 10 ms, the time it takes to get to the next loop is not 500ms, but 510 ms.  This would introduce a very small offset to your BPM setting (and no good way to correct it, as it's hard to determine exactly what the overhead is).

With these two arguments in mind, I've used a more robust manner, which starts of by using the internal timer (ticking in microseconds), and just looking based on the difference between the current timestamp and the baseline whether the light should be on or off (taking the BPM into account).  This should be by far more accurate, and is more robust in terms of responsiveness and scalability (if you tweak the code to do more things - like storing or sending some values around, playing audio markers, read potentiometers, ... - the metronome will keep ticking away as close to the real BPM as possible, even for very long periods of time (not the case in the previous approach).


I've put up a small video of this first prototype in action.

I'm well aware that it would be pretty easy to expand this to make it so it can handle any BPM, and you don't have to hardcode them, but that's not what my little brother wanted.

Planned improvements:

  1. Instead of using an arduino uno, I would like to go for something a lot smaller (and cheaper), I'll blog when that's done.
  2. For myself I would probably prefer a version with a potentiometer to set the BPM rate, and so also a small display to show what BPM we are at. So, one push button (the 'Next' one) replaced by a potentiometer and interaction with a screen.

vrijdag 20 december 2013

3 easy steps to parametrized deploy environments using SSIS and Visual Studio

Just a little something I'd thought I'd share with you.  Recently someone asked me to have a look at a problem with parametrized connection manager variables that just wouldn't work.  I'll spare you the details.  Suffice it to say that they overcomplicated things and through redundancy couldn't get things to work.

So, if you're looking for an easy, yet pretty solid way to go about deploying your SSIS packages in a multi-environment setup, give this one a spin.
More info:
  • - Business Intelligence
  • - ETL flow
  • - Microsoft Stack (Visual Studio, MS SQL Server, SQL Server Integration Services)
  • - multi-environment

In the example I'll be using, the source repository happens to be a Sharepoint Server (Microsoft Office Sharepoint Server - MOSS --> that's why you'll see that acronym a lot)

1. Set up your connection managers in Visual Studio

I've created 3 connection managers; one for my source system (that's the one I'll be using in the example), one for my DWH staging area, and one for the actual DWH.  I've got a lot of SSIS packages (as can be seen below), but for the example I'll just pick one (E_MOSS_10).  Above you see the detail for that package's data flow task.  Simple source to target.

I suppose you're with me so far :)

In visual studio I don't do anything else too fancy (NO PARAMETERS at this stage), just fill in the credentials for the connection manager:
I then build and deploy my package onto my SQL Server, and then switch tooling: exit VS, enter MSSQL Server Management Studio.

Woohoo, successful deployment !
As you know (and if you don't, you really shouldn't be reading this blog, but preferably some more initiating topics), your Mgmt Studio lists the deployed SSIS packaged separately.  So at the completion of step 1, we can drill down to the right package (10), right click it, and execute.
If you have a quick look at the connection Managers tab, you'll see initial values (sometimes null), indicating that the values from Visual Studio (the non-parametrized settings for the connection managers) are used.

Step 2 - Simple configuration after deployment

If you want to switch environment (e.g. to a production environment after you've deployed), you can use the Configure option for the deployed ssis package:
Above the end result, and below how you enter the value.  If you do this in configuration (not execute), these settings will remain the same until your do a new deployment.

After that, same execute as before, and voila, you've now successfully switched environment.  However, this is not good enough, because for each deployment you have to go through a manual process of entering all the correct values.  And let's say that I want to swap between several environments after deployment (on the fly).  What a hassle ... but wait, there's step 3 ...

Step 3 - Setting up Environments

Independently from your ssis packages you can configure Environments in SQL Mgmt Studio.  This is really great, because in a formal production environment I can't stress enough that the production user and passwords should not be shared with all the development team !  Here an environment can be prepared (preferably multiple ones; DEV, TEST, UAT, PROD) and the developer never needs to worry about connection strings and credentials.  But back to the example.

Below the SSIS projects (per deployment folder, in my case ssis01) there is a Environments folder
Right-click and select "Create Environment ..."
Something like that, click ok and it should show up:
(hit refresh if it doesn't)
Double-click and in the Variables section, create the variables you need (in my example I've just done user and password, in a real situation you'll need the connection string as well)
Notice the Sensitive checkbox ? Your development team will never see that production password (go ahead, try to uncheck it :) )

Last step is to connect the environment variables to the connection parameters.  But, wait ... didn't we not do any variable creation, not ?  Yes, we did(n't).
For the connection manager all the properties are converted to parameters by default (that's why we saw them in the connection managers tab when we did the configuration --> note that this only applies to connection managers, for everything else you need to create and use parameters in Visual Studio before you can link them !)

First we need to tell the SSIS project (bundle of packages) that we want to be able to use the environment:

In the configuration dialog go to the references section and click add (and then add your newly created environment).  
And the, finally, the mapping itself.  In the parameters section, click the [...] button to edit the value for a connection manager property (as we did before), but this time instead of entering the value, select the use environment variable option, and in the dropdown the created variables should show up.
That's it.  Mine looks like this:
If you create multiple environments, just make sure that the variable names you use are the same (so in my case 'MossUser' should exist with a different value in each environment).

When you execute a package, all you need to do know is select the right environment:

There you go.  Simple stuff, but great as a best practice in real BI product factories.

Let me know what you think ...

vrijdag 14 december 2012

Using Talend Big Data to put files on Hadoop HDFS

As I struggled too much to perform this simple task (couldn't locate any help), I thought I'd write up my step-by-step instructions. 

Target audience: You have a working hadoop cluster with hdfs set up, and you have Talend Big Data (or another Talend flavour) installed (and know how to use it).  You should also already by able to make a correct connection between talend and HDFS (in my example, hadoop runs on a different server).

[For ultra-short version, see bottom]

In the picture you can see the Job, just 3 components (I could have done with 1, but I don't think the others overcomplicate).

I've made a separate component to connect to hadoop.  Not necessary, but just handy :)

the settings:

  • Distribution: make sure that's correctly configured (also make sure that the necessary .jar files are consistent between the two versions  - you can find a lot of help on that topic)
  • Namenode URI: Don't forget the quotes (always have to do that in Talend), and then it's "hdfs://<hadoopserver>:<filesystemPortnr>"
    • instead of the servername (e.g. localhost) you can use the ip as well ( )
    • By default the port nr is 9000, if you're unsure you can check it in the <HADOOP_HOME>/conf/core-site.xml file, value of fs.default.name
  • User name: a user that has proper rights on the HDFS system (I installed hadoop on an ubuntu server, and with the install the hduser was created and configured as superuser for HDFS automatically)
  • that's it, no password needed (why  ?? I don't know, hdfs doesn't seem to ask for one by default)

Next component:

I suppose this is self-explanatory: I'm collecting json files from a directory

Next component:

And the configuration (and here is where I struggled for quite some time):

  • "Use an existing connection": as I made the first component, I just refer to that here.
  • Local Directory: directory where the source files are located
    • In my case I use the variable passed by the Filelist component
    • In a simple case (delete the FileList component) you can harcode : "C:\path\to\files"
  • HDFS directory: the target directory (so what you have configured on your hadoop server as HDFS filesystem)
    • if not sure; on ubuntu you can check like this:
    • or, via the webinterface (from your Talend system):
      • by default, address is 'hostname:50070' (so e.g. localhost:50070)
      • also note that the port number mentioned in the title (in my case 54310) is the one you need to use in HDFSConnection component (see above)
      • Just click "browse the filesystem" to see what exists, and what the permissions are
  • Overwrite file: self-explanatory, I suppose
  • Files
    • Filemask: This is the source filemask.  In the simpler scenario (no filelist component), you can use "*.*" (i.e. take all files from source directory)
    • New name: this is the filename that should be given on the target system.  I used the variable passed from filelist (the current file name).  In the simpler scenario you can use "" (2 double quotes with nothing inbetween) to use the original filenames

That's it, after running it, I can see:

My 2 input files are on hadoop HDFS.

And now, to demonstrate you understand what's going on in this overcomplex example:  Why did I use the variable for filemask and new name ?  If I had e.g. 1000 source files, what would happen if I put "*.*" and "" respectively ?


Ok, so when I published this first (via another channel), I got some requests for the simple scenario, so if you want the hardcoded 1 step source files to HDFS, here it is:

I hope this saves some people some time, let me know if basic stuff like this is useful, then I'll definitely post more.