Comparison of Big Data Frameworks: Spark vs. Hadoop
Learn what is the difference between Spark and Hadoop frameworks for working with big data and determine which to use for your project.
Table of Contents
Before we go into the tech nitty gritty – here’s one interesting story for you.
In Ashburn, Virginia there sits a scientific research facility called the Janelia Research Campus, a center dedicated almost entirely to neuroscience. A researcher, Jeremy Freeman, began his work there. What Freeman ultimately wants to do is to record and look at every neuron in the human brain simultaneously, in real time, as the brain responds to stimuli.
Freeman started with the zebrafish – a fish that is transparent – and began to collect data about the neuron activity as the fish was given different stimuli. Over time, the data sets he collected became quite large – after all, he was recording every neuron individually and the activity as neurons interacted with one another, over time, in real time.
What he needed was technology that was fast, could record in real time and could analyze gazillion of data very quickly. He found that technology in Apache Spark.
Now, most business owners are not neuroscientists. But they do have a need to gather data and to have big data analysis, often in real time. Think about the retailer or the fintech enterprise that needs to be alerted when the activity within its system is triggering a potential cyber threat. Or the medical practice that needs real-time monitoring data on a large number of patients.
Data science can provide all of these solutions if the right framework and architecture are in place. It can gather unstructured data from multiple sources, organize that data, analyze it, and deliver it in formats that the non-techie can understand.
But, which is the best big data framework? The one that will give optimal performance for a variety of data collection and analysis needs?
Table of Contents
The two predominant frameworks to date are Hadoop and Apache Spark.
Both have advantages and disadvantages, and it bears taking a look at the pros and cons of each before making a decision on which best meets your business needs.
Apache Spark vs. Apache Hadoop: Comparison
Both frameworks are open source and made available through the Apache Software Foundation, a non-profit organization that supports software development projects. Hadoop came first, released 0.1.0 version in April 2006. Apache Spark initial release was in May 2014 and in no time it began a Top-Level Apache Project.
The important difference for a business enterprise really comes from a look at how each framework gathers, sorts, stores, and analyzes big amounts of unstructured data.
Pros and Cons of Hadoop
Source: Wikipedia
There are four key components of the Hadoop framework ecosystem:
- Common utilities and libraries.
- HDFS (Hadoop Distributed File System) – data is stored in the form of small memory blocks (128 MB by default) that are distributed across the cluster (groups of computers).
- YARN – the technology that manages the cluster. It allocates resources to prevent overloads.
- MapReduce – data is collected to the HDFS disk, then read from that disk, mapped according to function, reduced, and the stored back in the disk.
There are certainly pros to this distributed file system.
- Data can be transferred between nodes (devices) as necessary.
- Different users can execute applications without worry of overloading the system.
- Data can still be processed efficiently if a node should fail. There will be no problems with memory.
- The framework can store and process complex data sets, reducing the risk of data loss or failure.
- It is scalable – the size of a cluster can be increased from a single machine to hundreds of servers without extensive amounts of work.
- Security is built-in through Apache Sentry, including role-based authorizations for data access.
With all of these benefits, it is difficult to imagine any shortcomings of this framework. This is, however, the major one that can significantly impact a business.
And that disadvantage is HDFS (Hadoop File System)
Go back for a minute to Jeremy Freeman and his neurological research. He wanted to be able to livestream his data in real time and see it as it occurred. With MapReduce component, the data manipulation and analysis cannot occur in real time. Data must be read from the disk, mapped with a function, reduced, and stored all over again. This is slow and much has already changed by the time the data is actually delivered.
The healthcare provider cannot afford to wait for the data from monitoring devices; and the longer a fintech enterprise must wait for data that signifies a potential threat, that cyber crime can already be in the works.
Pros and Cons of Spark
Source: Wikimedia
When Apache Spark software became available in 2014, Hadoop vendors added the technology to their arsenals quickly, promoting it along with the many other compatible tools for Hadoop. It was seen as an “add-on” to Hadoop, to perform certain functions that the original framework could not.
Now, however, it is understood that Spark can operate as a standalone service, using features and elements of Hadoop as a user chooses. Thus, the growing debate in data science circles about which framework is preferable.
The major components of Apache Spark are the following:
- The Core – the processing engine of the application. Its key features are in-memory processing and data referencing from external sources.
- Streaming – high-speed ability to stream data and provide analytics in real time. This is accomplished through RDD (Resilient Distributed Dataset) that processes data in a continuous way or Data Frames.
- GraphX – it can deliver data and data analysis in the form of graphs, which can be customized.
- MLlib – it’s machine learning library means that it can be used for ML applications. These are very fast in comparison to other ML frameworks that Apache has.
The Pros of Apache Spark
- It’s compatible with Java, Scala, R and Python programming languages, meaning that you are free to choose to most suitable one for your project.
- Because it operates on RAM within a device, rather than an HDFS disk, it is incredibly fast.
- It is simple – Spark has some great API’s that result in ease of working with huge data sets (think neuron activity of zebrafish).
- Compatibility with Hadoop YARN and ability to use Hadoop storage functions when necessary.
- Offers API integrations with Kafka and Twitter Streaming, particularly handy for analyzing social data.
- It is inexpensive because it is open source. Even using a developer will cost less because of this factor.
- Incorporates a machine learning module that facilitates continuous pieces of training of your algorithms and ultimately improves the data quality.
The Major Con
Spark does not have a real storage layer. That is still left to Hadoop or the use of some other cloud storage platform.
Side by Side Comparison
If we try to encapsulate what is the difference between Spark and Hadoop, it might be these five things.
The key difference between Apache Spark and Hadoop is that the latter was originally designed some 10 years ago. At that time, in-memory calculations were hard to perform due to the cost of RAM and this issue still remains today. Apache Spark, on the contrary, is capable of using as much RAM as it is available and due to the current affordability, you can always add additional blocks whenever needed.
Speed is very different. Obviously, Spark is much faster and, for that reason, is the preferred framework in a number of industries – fintech, healthcare, scientific research, and, in some cases, marketing and manufacturing. Hadoop must operate in steps, while Spark reads data, analyzes it and writes the results – boom.
For this reason, Spark is often said to be 10X faster in batch processing and even 100X faster in analytics according to Kirk Borne, a principal data scientist at Booz Allen Hamilton.
Failure recovery is different. Obviously, Hadoop has the resiliency to system failures – data is written to disk and stored. And if one node fails, another takes over. Spark has resiliency too, despite early concerns about it. Its RDD stores data objects and they can be stored on disks through integration with Hadoop or another data platform or in memory.
Both are open source and thus will be less expensive than other enterprise systems, even though developers will be required to build the system.
Now think about this – every day, 2.5 quintillion bytes of data are created. And 90% of all the data that exists was created in the past two years. Businesses use data to reveal trends and patterns in their industry as a whole and within their own organizations. When data gets just too big to be collected, organized, and analyzed by traditional methods, frameworks must be built to do it. This is how businesses will remain competitive.
At Romexsoft, we believe that Apache Spark satisfies the needs of any business and any current or future business size. If you are looking for a big data solution, we are here to discuss your options.