Long ago, letsmakeit.com was a Flash P2P app with social media features. As such, I’ve always had an interest in the distributed machine learning techniques used by the biggest guys out there.
The Start: Learning from the Pros
I’m basing most of my work here on MapR’s fantastic Mahout library, which implements a “collaborative filtering” recommendation algorithm. This looks at user data as a list of things; you and I may “co-occur” in our viewed movie lists if we’ve both seen the same movie. The sweet thing about Mahout’s version is that it can take in any and all sets of user lists to make good recommendations on any and all other sets of user lists. You could suggest a movie to me based on my IP address region AND my viewed categories etc. The more input dimensions, the better the results. Most importantly, there’s an element of “surprise” when looking at co-occurrences in lists you share with others. This means that you and me are kind of similar if we’ve both watched Spiderman but much more similar if we’ve both watched an obscure movie like Sharknado.
The Middle: The limitations of Spark
After some frustration trying to get this running fast in Mahout’s native Spark implementation, I quickly realized that while I typically love functional programming methods, it just stinks for this problem. Mahout is essentially taking these HUGE matricies of every user list crossed with every other list ( NxN matrix ) that are almost exclusively zeros, and doing a humongous (although distributed) O(N^3) matrix multiplication to crank out recommendations. They say they achieve “realtime” by throwing out most updates that aren’t likely to matter much, and do the matrix crunch as much as possible. Ba-humbug I say! Throwing out the vast majority of your hot inputs is a pretty bogus way of being realtime in my opinion. Out of the way Apache Spark, and hello Apache Ignite! It’s pretty much the same as Spark in many ways but allows “mutable” state. In Spark, a huge dataset can never change; you can only apply a function to it and get another huge unchangeable result. In Ignite, I can change the tiniest datum within a huge cluster of servers and build my own dang computation graph however I please.
One recurring lesson in scalable architecture is that you’re only as scalable as your data store, so achieving realtime got a whole lot easier with this move. Taking full advantage of the many zeros in the Mahout matricies, I converted their matrix state into distributed Apache Ignite Hashtables, such that a missing key in the hashtable could be interpretted as a zero in the matrix. Although the savings depend on the use-case, this typically cuts the necessary memory (and size of cluster!) by about 90% for my case. As a finishing touch, I decomposed the matrix multiplication (N^3) into a bunch of hashtable lookups that costs no more than N^2 per update, but is typically faster than N due to all those zeros. This happens whenever there’s a user event like “User watched Spiderman”. If every user has watched every movie available, that update would cost N^2, but more likely it’s his 20th movie, so the vast majority of theoretical N^2 operations will never happen!
The End: Realtime recommendation query language in Scala?!
By the end of the project, I was able to twist the core Mahout algorithm into a full-blown, even type-safe realtime query language that lets you dynamically mix different dimensions of user data together to get all sorts of suggestions. Here’s an example that’s actually completely valid Scala code, boiling your whole biz layer down to 1 file of SQL-like declarative programming queries like this:
// Get all similar accounts to me based on friend and contact lists val similarAccounts = Select(ACCOUNT_OBJ, From( myContacts ) Join myFriends Where( Not( InputEqualsOutput() ) ) ) // Get the publishers that people similar to me subscribe to, who I haven't already subscribed to. val recommendedPublishers = Select(PUBLISHER_OBJ, From( similarAccounts ) --> (publishers) Where (Not(InputRelatedBy(publishers)) And Not(InputEqualsOutput()) ) )
The first variable “similarAccounts” is a function that given a user Id as input, will produce a list of similar account Ids by joining your device contact list with who you’re friends with to produce people similar to you by personal relations. The second query uses a nice functional composition that takes your similar accounts and traverses (–>) their “subscribed to” relationships to publishers. This closes the loop and shows how we take 2 independent dimensions (friends + contacts) to recommend something unrelated such as who best to subscribe to. The data backing the query functions is constantly changing as events happen and updates as fast as it can in an async, eventually-consistent way. Ignite also makes distributed event-driven programming easy, so whenever these query results change for any user, we can immediately ship the new suggestions off to another micro-service without manually requesting them.
The Circle of Wall Headbutts
Here’s a simplified version of where this can fit into your architecture. I started drawing a slightly more realistic one but I think this is ugly enough: