Friday, October 16, 2015

Code Halos and Rethinking Data Analytics - Spark is HOT!

Technical Guru
Peter Rogers
Our resident mobile technology and digital guru, Peter Rogers introduces us to his latest research and findings.  If understanding how to incorporate big data analytics in mobile and IoT apps is not your thing, go no further.  Seriously, go NO further.  Peter gets excited about...actually I have not idea.  He digs deep into Spark - an open source big data processing framework built around speed, ease of use and sophisticated analysis, with support for modern programming languages.  Enjoy!

Three years ago, a professor told me Spark was the future, but I never really thought to look into it until now.   I am amazed at how powerful it is!  If you look at the top there books on Safari Books Online today, you will see Spark, JavaScript and Micro-Services.  I am not the only one that is impressed!

In a Digital World the currency is data but this requires an extensive processing facility. It is not just IoT and its associated world of sensors that are propagating tidal waves of data; analytics is also being built into the majority of software and our whole life styles are being set against a backdrop of continual statistical analysis, predictive analytics and artificial intelligence. This is why the single most read technical book at the moment is on Spark, the future of data processing.

Unfortunately before I can even introduce Spark I have to give you a Big Data 101 and this will either be the most painful or the most incredible (free) guide you will find on the subject. I hereby thank the referenced blogs in advance.

We have to start at a place called Functional Programming. You may have heard of the following functional programming languages: Haskell; Erlang; Ellixir; Lisp; D; R; Scala; Wolfram; Standard ML; and Clojure [Kevin B: No, sorry missed that class]. You would probably be more surprised to know that CoffeeScript, the later versions of JavaScript and Underscore.js also have aspects of functional programming too.

If this strikes fear in your heart then do not panic, as we have an easy path. Even though it has routes in Lambda Calculus you really do not need to know anything about that. I am going to teach you everything you need to know but you will have to do some reading and there is no better place than this guide.

The purpose of functional programming is to break our programs down into smaller and simpler units that are more reliable and easy to understand. This is the perfect paradigm for being able to distribute tasks across a network of dedicated computers all working together to solve the task faster.

The core concepts of functional programming are as follows:

  • Functional programs are immutable – nothing can change
  • Functional programs are stateless – there is an ignorance of the past
  • Functions are first class and higher order – functions can be passed in as arguments to other functions, can be defined in a single line and are pure (have no side effects)
  • Iteration is performed by recursion as opposed to loops

Referential transparency – there are no assignment statements as the value of a variable never changes once defined

We generally start by defining functions that perform all the grunt work and these follow the general rules:

  • All functions must accept at least one argument
  • All functions must return data or another function
  • Use recursion instead of loops
  • Use single line (arrow) functions where utility methods require a function as a parameter
  • Chain your functions together to achieve the final result

To consider an example then let us consider a 3D visualisation that wants to take an API response from a Mensa API Web Service and visualise it to show an average of IQs against towns. We start by defining a set of utility functions that combined in order to do the grunt work. We will end up with a single line of code that obtains the coordinates for drawing the graph.

  • A simple function that adds two numbers together
  • A recursive function that adds everything up in a numerical array
  • A simple function that takes an average given a total and a count
  • A function that builds on the last to take the average for a whole array
  • A function that returns a function that finds an item in a data structure
  • A recursive function that combines arrays

With this in place we can start to look at the built in methods that often exist in seemingly non-functional languages. In JavaScript, we can find the map() and reduce() methods inside the Array object. These respectively use a processing function to: convert a set of values to another set of values; and reduce an array to a single value.  Now we can use our freshly defined utility functions along with the Array methods to create a single line of code that retrieves the coordinates - of course we would probably be using Scala here – but WOW!

var pts = combArr( find(data, ‘IQ’).map(averageForArray), find(data, ‘town’) );

See functional programming was not that hard after all? Now we come to MapReduce which is “a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” Google famously used MapReduce to regenerate its index of the World Wide Web but have now moved on to streaming solutions (Percolator, Flume and MillWheel) for live results.

A lot of people regularly mention MapReduce in conversation but few actually know what it is or how it works. Luckily for you, I will provide a few guides to get you up and running. It helps to remember that the map() function is used for converting a set of values into a new set of values; and the reduce() function is used to combine a set of values into a new single value. Now consider you can split all of your tasks up into map() and reduce() functions that are run in a distributed network of computers. Each worker node is working on a map() function against its local data and writing to temporary storage. There is next a Shuffle where worker nodes redistribute data based on the output of the map() function such that all data belonging to one key is located on the same worker node. Each worker node now processes the reduce() function in parallel and a combined result is returned.

In a nutshell we use the map() and reduce() functions from functional programming but we apply them to a 5 step distributed program context.

  1. We prepare the map() input by designating map processors and provide the processor with all the input data 
  2. We run the custom map() code exactly once for each input value
  3. We shuffle the map() output to the reduce processors
  4. We run the custom reduce() code exactly once for each value from the map() output
  5. We collect all the reduce() output, sort it and return it as the final outcome

This has worked fine in the past but the problem is the shuffling takes time due to reading and writing to a file system, along with HTTP network connections pulling in remote data. It is also a sequential batch processing system and so it cannot handle live data, which is why Google moved on some time back.

Finally we arrive at Spark, which is an open source big data processing framework built around speed, ease of use and sophisticated analysis, with support for modern programming languages like Java, Scala and Python.

Spark has the following advances over Hadoop and Storm:

  1. It offers a comprehensive framework for big data processing with a variety of diverse data sets
  2. It offers real-time streaming data support as well as batch processing
  3. It enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk
  4. It supports Java, Scala and Python, and comes with a built in set of high level operations in a concise API for quickly creating applications
  5. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.

Excited yet? Well other than the core API there are a number of additional libraries:

  • Spark Streaming for processing real-time data
  • Spark SQL exposes Spark datasets over JDBC and runs SQL-like queries
  • Spark MLlib offers common learning algorithms for Machine Learning
  • Spark GraphX offers graphs and graph-parallel computation
  • BlinkDB allows running interactive SQL queries on large volumes of data
  • Tachyon offers a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks
  • Integration adaptors connect other products like Cassandra and Kafka

Now if I told you that the following two lines of code could print out the word count of a file of theoretically any size imaginable then you would probably be left scratching your head at how the code is so short and then ask for a programming guide to this wizardry immediately.

val d = t.flatMap(l => l.split(" ")).map(w => (w, 1)).reduceByKey(_ + _)

Kevin Benedict
Writer, Speaker, Analyst and World Traveler
View my profile on LinkedIn
Follow me on Twitter @krbenedict
Subscribe to Kevin'sYouTube Channel
Join the Linkedin Group Strategic Enterprise Mobility
Join the Google+ Community Mobile Enterprise Strategies

***Full Disclosure: These are my personal opinions. No company is silly enough to claim them. I am a mobility and digital transformation analyst, consultant and writer. I work with and have worked with many of the companies mentioned in my articles.