Twitter Elections experiment implementation
Some challenges encountered while setting up the processing of the tweets:
- Dutch sentiment analyzers are not easy to come by, so we needed to translate the tweets prior to sentiment analysis.
- Limiting tweets to Belgium funnily isn’t as easy as first expected. We defined a location and a range, but this often includes parts of The Netherlands and France (Belgium is not a disk unfortunately). And our neighbors also tend to use the same tags for parties (e.g. PS). So some tags are falsely detected as being for belgian parties.
- Tags are not tied uniquely to a party, so sometimes a tag is used for something completely a-political.
- The free versions of the sentiment and translation services have low daily usage limits 🙂
- The documentation on how to implement Map Reduce in Java for Amazon specifically is horribly lacking (or difficult to find -I didn’t see any good tutorials).
- We wrote a simple Java application to read tweets (using the Twitter4j library).
- Microsoft Translator was used to translate from Dutch and French to English (using the Microsoft Translator Java API which made accessing the service a real breeze)
- The Alchemy API did the sentiment analysis (using their Java API which makes using their API super easy).
- Amazon’s AWS SDK for Java + the Eclipse plugin allowed us to implement the Map/Reduce.
- Testing the Mappers and Reducers was done using JUnit and Mockito. The former a well-known long-time friend, the latter a recent new friend (thanks to the Hadoop book). I have worked with a lot of mock-libraries in the past, but I must say that somehow Mockito has really charmed me with its ease-of-use and simplicity.
Some other random observations
- AWS does a tremendous job at simplifying executing Map Reduce programs, but there’s still a lot Amazon could do to make it really pleasant for developer to create MR programs.
- It is very difficult to find good documentation on how to locally test and deploy the full application. I had to often deploy the app to AWS, which means wasting a lot of ‘AWS Instance Hours’ when deploying. So this I’ll need to further investigate.
In the following post i’ll be showing some data that I got from this experiment.