A Detailed Procedure For Migrating From Hadoop To Apache Spark

A variety of data from different resources gets generated because we have a huge volume of data and this process remains in continuous flow which will create more data in future. This huge volume of data is called Big Data and storing this Big Data is a problem for us.

Hadoop became one of the most popular tool which uses a distributed system to store and process this data to solve the problem. But now we have a new tool Apache Spark which is a more efficient tool which is based on top of Hadoop distributed file system (HDFS).

So, What makes Apache Spark better?

Performance

Apache Spark is 100x times faster than Hadoop.

Data Processing

Apache Spark supports batch time processing as well as real time processing whereas Hadoop can deal only with batch time processing of data.

Execution Based

Apache Spark executes based on lazy evaluation whereas Hadoop does not.

We have 128 MB of file which contains certain number of words and we have to count occurrence of words in the file. Let’s look at how Hadoop and Apache Spark execute the program?

Hadoop Execution –

Mapper Phase – Every word in file mark as entry one.

Sort and shuffle phase – Mapper phase output data transformed into format required for reducer phase.

Reducer phase – Aggregates the databases on key value pairs to provide final output.

Hadoop uses HDFS (Hadoop Distributed File System) for storing data in distributed fashion. Here 128 MB file gets divided into two parts each part has 64Mb of size and stored in two different machines. Now, to count the occurrence of words, Hadoop will be executed parallely on two different machines with the three different phases discussed above.

Ahhhhh…… Parallel execution is okay but don’t you find any disadvantage here?

Yes, we have a problem here as every phase output in Hadoop is written down on the disk of the machine and it is read again from the disk to execute the next phase. So, there will be more I/O (Input/Output) operation performed on the disk which will degrade the speed of the program and this is the main reason Hadoop execution becomes slow.

Yes by using Apache Spark. Let’s look at how Apache Spark solves this program?

Apache Spark Execution –

As shown in the above figure again the 128 MB of file is divided into two parts 64 MB each which resides in the disk on the two different machines. We will call it B1 and B2 block.

The Magic Of Apache Spark –

As the Apache execution begins and the very first line of the code is to read files from respective machines and as soon as the Apache Spark reads the file (B1 block) from the first machine it will allocate the new block in the memory of the first machine, we will call it as B3 block. Similar execution will happen on the second machine and B4 blocks get created. These B3 and B4 blocks are called number RDD.

As the blocks B3 and B4 get created, the next step is to count the number of words from block B3 and B4 respectively which contains the read data (file contains). As soon as execution continues on block B3 and B4, new blocks B5 and B6 get created which contain the output of the program.

This execution happens only in the memory of the machine. No intermediate output is written down on the disk and read it again for the next further execution. Isn’t it’s interesting mechanism to imagine how fast computation will take place as the whole execution is based only in memory? Because of this Apache Spark performance increases 100x times faster than Hadoop.

What Else Apache Spark Used?

1) RDD (Resilient Distributed Dataset) ?

It is called a distributed dataset as the file (Dataset) gets distributed over two different machines and the term Resilient deals with fault tolerance. If one block B1, B2 or machine gets corrupted or shutdowns then Apache Spark already created the copy of the blocks which is stored in different machines by which it gets the data back, and hence it is called as Resilient.

2) Lazy Evaluation –

Let’s say we have to count the occurrence of words based on first five lines from the file, in Hadoop it will read whole contains file and gives you occurrence of words for first five lines but in Apache Spark, for every running code the DAG (Directed Acyclic Graph) graph gets created and he knows he has to give occurrence of word count output only on first five lines of file. So, Apache Spark has the DAG in which while execution it will ready only first five lines and gives the occurrence of word count only on first five lines of the files.

This execution called it Lazy Evaluation which is implemented in Apache Spark and not in Hadoop which will increase the performance of Apache Spark.

Conclusion

Hadoop was created as the engine for processing large amounts of existing data. It has a low level of abstraction, on the other hand, I found Spark is easier and faster, with a lot of convenient high-level tools like Spark SQL and Machine learning library like MLlib and functions that can simplify work for you.

Thank you for reading!!!