In a Digital World the currency is data but this requires an extensive processing facility. It is not just IoT and its associated world of sensors that are propagating tidal waves of data; analytics is also being built into the majority of software and our whole life styles are being set against a backdrop of continual statistical analysis, predictive analytics and artificial intelligence. This is why the single most read technical book at the moment is on Spark, the future of data processing.
Unfortunately before I can even introduce Spark I have to give you a Big Data 101 and this will either be the most painful or the most incredible (free) guide you will find on the subject. I hereby thank the referenced blogs in advance.
If this strikes fear in your heart then do not panic, as we have an easy path. Even though it has routes in Lambda Calculus you really do not need to know anything about that. I am going to teach you everything you need to know but you will have to do some reading and there is no better place than this guide.
The purpose of functional programming is to break our programs down into smaller and simpler units that are more reliable and easy to understand. This is the perfect paradigm for being able to distribute tasks across a network of dedicated computers all working together to solve the task faster.
The core concepts of functional programming are as follows:
- Functional programs are immutable – nothing can change
- Functional programs are stateless – there is an ignorance of the past
- Functions are first class and higher order – functions can be passed in as arguments to other functions, can be defined in a single line and are pure (have no side effects)
- Iteration is performed by recursion as opposed to loops
Referential transparency – there are no assignment statements as the value of a variable never changes once defined
We generally start by defining functions that perform all the grunt work and these follow the general rules:
- All functions must accept at least one argument
- All functions must return data or another function
- Use recursion instead of loops
- Use single line (arrow) functions where utility methods require a function as a parameter
- Chain your functions together to achieve the final result
To consider an example then let us consider a 3D visualisation that wants to take an API response from a Mensa API Web Service and visualise it to show an average of IQs against towns. We start by defining a set of utility functions that combined in order to do the grunt work. We will end up with a single line of code that obtains the coordinates for drawing the graph.
- A simple function that adds two numbers together
- A recursive function that adds everything up in a numerical array
- A simple function that takes an average given a total and a count
- A function that builds on the last to take the average for a whole array
- A function that returns a function that finds an item in a data structure
- A recursive function that combines arrays
var pts = combArr( find(data, ‘IQ’).map(averageForArray), find(data, ‘town’) );
See functional programming was not that hard after all? Now we come to MapReduce which is “a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.” Google famously used MapReduce to regenerate its index of the World Wide Web but have now moved on to streaming solutions (Percolator, Flume and MillWheel) for live results.
A lot of people regularly mention MapReduce in conversation but few actually know what it is or how it works. Luckily for you, I will provide a few guides to get you up and running. It helps to remember that the map() function is used for converting a set of values into a new set of values; and the reduce() function is used to combine a set of values into a new single value. Now consider you can split all of your tasks up into map() and reduce() functions that are run in a distributed network of computers. Each worker node is working on a map() function against its local data and writing to temporary storage. There is next a Shuffle where worker nodes redistribute data based on the output of the map() function such that all data belonging to one key is located on the same worker node. Each worker node now processes the reduce() function in parallel and a combined result is returned.
In a nutshell we use the map() and reduce() functions from functional programming but we apply them to a 5 step distributed program context.
- We prepare the map() input by designating map processors and provide the processor with all the input data
- We run the custom map() code exactly once for each input value
- We shuffle the map() output to the reduce processors
- We run the custom reduce() code exactly once for each value from the map() output
- We collect all the reduce() output, sort it and return it as the final outcome
This has worked fine in the past but the problem is the shuffling takes time due to reading and writing to a file system, along with HTTP network connections pulling in remote data. It is also a sequential batch processing system and so it cannot handle live data, which is why Google moved on some time back.
Finally we arrive at Spark, which is an open source big data processing framework built around speed, ease of use and sophisticated analysis, with support for modern programming languages like Java, Scala and Python.
Spark has the following advances over Hadoop and Storm:
- It offers a comprehensive framework for big data processing with a variety of diverse data sets
- It offers real-time streaming data support as well as batch processing
- It enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk
- It supports Java, Scala and Python, and comes with a built in set of high level operations in a concise API for quickly creating applications
- In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.
Excited yet? Well other than the core API there are a number of additional libraries:
- Spark Streaming for processing real-time data
- Spark SQL exposes Spark datasets over JDBC and runs SQL-like queries
- Spark MLlib offers common learning algorithms for Machine Learning
- Spark GraphX offers graphs and graph-parallel computation
- BlinkDB allows running interactive SQL queries on large volumes of data
- Tachyon offers a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks
- Integration adaptors connect other products like Cassandra and Kafka
Now if I told you that the following two lines of code could print out the word count of a file of theoretically any size imaginable then you would probably be left scratching your head at how the code is so short and then ask for a programming guide to this wizardry immediately.
val d = t.flatMap(l => l.split(" ")).map(w => (w, 1)).reduceByKey(_ + _)
Writer, Speaker, Analyst and World Traveler
View my profile on LinkedIn
Follow me on Twitter @krbenedict
Subscribe to Kevin'sYouTube Channel
Join the Linkedin Group Strategic Enterprise Mobility
Join the Google+ Community Mobile Enterprise Strategies
***Full Disclosure: These are my personal opinions. No company is silly enough to claim them. I am a mobility and digital transformation analyst, consultant and writer. I work with and have worked with many of the companies mentioned in my articles.