Your Big Data Might Be Small

Recently we’ve been working on a prototype of a textual analysis product. The requirements are fairly simple: take a corpus of user comments and find distinctive terms for each user. The purpose of the exercise was to validate the utility of this analysis to the client as well as to explore what architectural choices would be appropriate in a full version of the product. We tried a couple of different approaches and in this post I’m going to look at the pros and cons of them and how the aims of the prototype sometimes conflicted.

The code for this post is available on Github so you see in detail the differences between the two approaches. The data I’ve used for this example is the first two years of reddit comments (2007-2008) from the reddit comments corpus on I’ve had to remove certain proprietary elements from the code in order to open-source it - the biggest difference is that the code on Github does a simple count of terms for each user instead of using a more interesting algorithm to define what makes a term distinctive for a particular user.

First Try: Big Data

The obvious solution for this sort of problem is to use a Hadoop based architecture such as Spark to allow us to scale. Unfortunately, due to the overheads of such an architecture and the size of the data we were prototyping with (10s of GB) this first solution proved problematic.

The data from the client was in the following format:

As we were using Spark, the natural choice was to use Scala to write the program which turned out to be a lot of fun! Due to the functional nature of the language we had to learn some new techniques for transforming data but it was still less painful than Java (sorry!).

The basic steps we took in the program were as follows (see the source for more detail):

  1. Create a map of user IDs to user names from the plain text file
  2. Create an RDD of comments and associated user IDs from MongoDB (using the MongoDB Hadoop library)
  3. Tokenise the comments so for each user we have a number of tuples in the form (user_id, token)
  4. Use reduceByKey to get the per-user frequency of each token
  5. Select the ten most frequent terms for each user and write to ElasticSearch (using the ElasticSearch Spark library.)

Once this was all working it was a fairly simple process to run against a dataset which we encapsulated in a fabfile for convenience. However, what we discovered is that running the process locally using a sample dataset of ~2GB was a very slow operation. On my laptop it was taking around 20 minutes to run which makes for a painful and slow development process - each time we tweaked the algorithm there was a long wait before we could inspect the results. We couldn’t reduce the size of the sample data too much as this would make validating the results impossible.

In an attempt to fix our development woes we tried to run the process on a cluster of machines - as Spark is designed to do. Unfortunately this presented us with a whole new set of issues:

After struggling with this approach for a while we decided to take a step back and think about the problem again…

Next Step: Back to Basics

In my spare time I’d been experimenting with some of the cool new features of the C++11 standard. Things like initializer lists and easier iteration of STL types have made modern C++ a much more viable candidate for quick prototyping than in the past.

Thinking about how much easier C++ had become along with the fact that the data we were prototyping with was well within the range of typical RAM these days led me to try a simpler approach. The (slightly naive) C++ implementation I came up with gave us a 5-fold increase in speed, which was a big advantage in productivity.

The process used in this version was as follows (again, see the source for more detail):

  1. Create a map of user IDs to user names from the plain text file
  2. Read in comments directly from the BSON data dump using libbson
  3. Tokenise the comments, hash each token and store a map of users and their respective token counts
  4. Use a priority_queue to store each user’s ten most frequent terms
  5. Output JSON in ElasticSearch’s bulk format for quick importing

You’ll probably notice that their are some crucial differences in this compare to the previous version. Most importantly:

It’s likely that a lot of the speed-up we experienced can be attributed to the first two points above. Both of these are possible in Spark by using BSONFileInputFormat and HashingTF. It would be interesting to implement these and see what, if any, speed-up remains in the C++ version.


In the end the problems we faced were caused by the conflicting requirements of the project. Firstly, we needed to validate the algorithms used and this required us to be able to iterate quickly whilst still running over a decently-sized sample of the data. Secondly, we needed to build out a scalable architecture which necessarily will have greater startup and runtime overheads.

Although the project was certainly a success, we could do better in future by identifying the different requirements and approaching them separately. I would also look at using a different prototyping language than C++. We use python in most of our projects and it would be interesting to see how close we could get in performance to the C++ version. Projects like NumPy allow python developers to utilise more space-efficient data structures like those available in C++ whilst still retaining the ease of typical python programming. Another advantage of python is that it’s supported by Spark, although this has downsides compared to Scala such as less documentation and fewer code examples to draw from.