Finally, I made a recommendation engine using Spark!
One of the things that I’ve wanted to do for a really long time is to apply some machine learning and more advanced analytics with my own datasets. For a while it wasn’t that feasible: R cannot handle very large datasets on its own, MATLAB and Octave takes a long time to program, and then when I initially heard about Spark, I didn’t have the time to install Spark on my own. Fortunately, Cloudera Manager now has Spark for CDH5, and now we can use Spark (as well as its machine learning libraries) on our own data using our own cluster. Installing Spark became just a few button clicks on Cloudera Manager and suddenly we were ready to start applying some data science on our own datasets using Spark!
When it came to first using Spark, I found Spark’s quick start guide to be extremely helpful. It was very straightforward; the guide helped me start running commands using Spark’s CLI and end with running my own scripts using spark-submit. However, there were a few gotcha parts within the quick start because we were using the CDH version of Spark, such as getting the README.md file (It comes with downloading Spark from the website but not with Cloudera Manager), but it wasn’t too bad. There is also a choice of using three different programming languages with Spark (Scala, Java, and Python). Although I found going through the quick start guide with Python to be easier and more straightforward (there’s no need to compile the Python script), I had to eventually program in Scala and compile my code because the machine learning library wasn’t working well with Python and YARN. At first I was slightly confused on the compile process for Scala, but eventually I learned that it ultimately involves making a .sbt file and .scala file and compiling it using a sbt command that can be downloaded here.
After getting my feet wet with Spark, next came creating the recommendation engine, where I used Spark’s machine learning library and collaborative filtering algorithm. I used a recent tutorial as a guideline to create a recommendation engine with our own data, which made this process smooth and, honestly, a bit painless.
I can now see why Spark is becoming such a big deal, it’s a very powerful tool capable of using data science algorithms, which initially was quite difficult to do with big data. I also got to apply some more of my programming skills into making my own Spark scripts, which is a nice change of pace from making SQL scripts and Tableau dashboards. Like my coworker once said while we were implementing Flume for streaming and collecting log data, onward and upward!