CHAPTER – 1
CHAPTER – 1
In this chapter Big Data, Hadoop and its architecture, MapReduce and its execution overview and Scheduling algorithms are further explained. Also need of study, scope of study, objectives, research methodology and chapter plan is discussed.
We live in a digital world where data is increasing rapidly because of the ever increasing use of internet, sensors and heavy machines at a very high rate. All the industries, organizations are affected by this large amount of data. The large amount of data is a technical problem. But now it is viewed as an opportunity which can be used for many purposes. Analysing this data helps in better decision making in an organization and can predict some values about its future. Some organizations need to keep the records for better functionality. These records need to be stored in large a database which occupies large space. Services like social-networking sites, e-commerce, retail websites, market research organizations etc. It becomes very complex to handle this data in traditional databases with simple analytical processes. With the increase in size and availability of data, advancement of technology is the utmost requirement.
The term big data may seem to reference the volume of data, that isn’t always the case. The term big data may also refer to the technology which includes tools and processes that an organization or company requires for handling the large amounts of data and storage facilities. A widely recognized definition belongs to Information Digital Corporation, “big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling the high velocity capture, discovery, analysis” 1. Big Data is not the created content, nor is it even its consumption. It is the analysis of all the data surrounding around humans and devices. The most fundamental challenge for Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions 2.
The Apache Hadoop project was created by Doug Cutting and Mike Caferella in 2005. The name for the project came from the toy elephant of Cutting’s young son. The now?famous yellow elephant has become a household word in just a few years and is a foundational part of almost all big data strategies 3.
Hadoop is one of the technologies used to process Big Data. It is defined as an open-source platform that provides analytical technologies and computational power required to work with such large sets of data. Hadoop is an Apache Software foundation project written in JAVA. The Hadoop core project provides the basic services for building a cloud computing environment with commodity hardware and the API’s for developing software that will run on that cloud 4.
Hadoop platform provides an improved programming model, which is used to create and run distributed systems quickly and efficiently. Hadoop distributes the data in advance. Data is replicated in a cluster of computers for reliability and availability. Processing occurs when the data is stored.
1.4.1 Features of HadoopHadoop has the power to add new nodes without need to change the clusters (data). Hadoop cluster can hold more than one node without any difficulty.
Since Hadoop is used to store large large amount of data it is more affordable when compared to any other servers.
Hadoop is flexible to work. Also, multiple data can be combined in Hadoop.
Hadoop can perform map and reduce jobs very effectively.
Hadoop is a fault tolerant system.
Since Hadoop does parallel computing, system is more effective and efficient in terms of deriving the results.
Hadoop offers a large cluster of local servers to store large amount of data.
1.4.2 Cluster Architecture of HadoopA Hadoop cluster consists of single MasterNode and multiple SlaveNodes. The master node consists of a NameNode and a JobTracker. The slave node or worker node acts as both a DataNode and a TaskTracker.
Figure 1.1 Hadoop Cluster Architecture 5
It performs the Heartbeat mechanism as each DataNode sends a “Heart beat signal” to NameNode after every few minutes or default time set to make NameNode aware of the active/inactive status of DataNodes 6. The core of Apache Hadoop consists of a storage part known as HDFS (Hadoop Distributed File System) and a processing part which is known as MapReduce programming model.
MapReduce master is responsible for scheduling of computational work to its slave nodes. While the HDFS master is responsible for partitioning the storage access to slave nodes and keeping track of it. More number of slave nodes can be added easily to Hadoop system for increasing storage and computational capabilities.
MapReduce 7 is a programming model for processing large data sets and the name of an implementation of the model is provided by Google.
Figure 1.2 Execution Overview of MapReduce 7
Hadoop Mapreduce job consists of two user defined functions: map and reduce 8. The input of a hadoop Mapreduce job is given as a set of key-value pairs (k,v) and the map function is called for each of these pairs. The map function produces intermediate key-value pairs (k’,v’). Then, the Hadoop MapReduce frameworks groups these intermediate key-value pairs by intermediate key k’ and calls the reduce function for each group. Then, the reduce function produces zero or more aggregated results. The Hadoop MapReduce uses a distributed file system to read and write its data. It uses Hadoop Distributed File System (HDFS), which is the open source counterpart of the Google File System 9.
HADOOP DISTRIBUTED FILE SYSTEM
The Hadoop Distributed File System (HDFS) represents a distributed file system that is designed to provide very large amounts of data and to provide high-throughput access to the data sets 10. Based on the HDFS design, the files are redundantly stored across multiple nodes to ensure high availability of the parallel applications. HDFS stores file system metadata and application data separately 11.
An HDFS cluster includes two types of nodes: Name and DataNodes that operate in a master slave relationship. In the HDFS design, the NameNode reflects the master, system namespace, maintains the file system tree as well as metadata for all the files and directories in the tree. All this information is persistently stored on a local disk via two files that are labelled the namespace image and the edit log, respectively 12.
Figure 1.3 Execution Overview of HDFS 5The NameNode keeps track of all the DataNodes where the blocks for a given file are located. The information is restored every time the system starts up, as it is dynamic is nature. Any client can access the file system on behalf of a user task by communicating with the NameNode and the DataNodes, respectively. The DataNodes store and retrieve blocks based on requests made by the clients or the NameNode and they periodically update the NameNode with lists of the actual blocks that they are responsible for 13.
Scheduling plays an important role in big data optimization, which helps in reducing the processing time. The aim of scheduling of jobs 14 is to enable faster processing of jobs and to reduce the response time as much as possible by using better techniques for scheduling depending on the jobs, along with the best utilization of resources. Scheduling in big data platforms involves the processing and completion of multiple tasks by handling and changing data in an efficient way with a minimum number of migrations. The requirements of traditional scheduling models came from applications, databases and storage resources grown over the years. As a result, the cost and complexity of adapting traditional scheduling models to big data platforms have increased, prompting changes in the way data is stored, analysed and accessed. The traditional model is being expanded to incorporate new building blocks. They address the challenges of big data with new information processing frameworks built to meet big data’s requirements.
NEED OF STUDY
Since many years we used the same traditional method of storing data but processing that data is a difficult job. So, we make use of Hadoop for storing and processing data efficiently. In the past years, Hadoop supported a single scheduler that was intermixed with the JobTracker logic. This type of implementation was perfect for the traditional batch jobs of Hadoop.
As the years passed, the data generated comes in huge amounts therefore advancement is also the need of the hour. Hadoop has started to use more than one scheduling algorithm. Now, Hadoop is a multi-user data warehouse that supports a variety of different types of processing jobs, with a pluggable scheduler framework providing greater control. MapReduce acts as a programming model for processing large data sets. A number of scheduling algorithms exist which are compatible with MapReduce, however keeping in view the various parameters pertaining to each algorithm. The performance of algorithm is difficult to evaluate and to find the best possible scheduling algorithm the study is proposed.
SCOPE OF STUDY
Big Data is a term used for the voluminous amount of data generated every second through different sources. The main aim of Big Data analytics is to process large amount of data within reasonable time period. Hadoop is a programming framework provided for the analysis of Big Data. Hadoop uses MapReduce for processing data sets and HDFS to provide necessary services and basic structure to fulfil core requirements. Scheduling is defined as the process by which work specified is assigned to resources for the final completion.
Different schedulers used are: Static Schedulers and Dynamic Schedulers. The static scheduler includes First in first out (FIFO) scheduler, Fair scheduler and Capacity scheduler. The dynamic scheduler includes Deadline – Constraint scheduler.
OBJECTIVES OF THE STUDY
To extensively study the concepts of Big Data, Hadoop, MapReduce, HDFS and Scheduling Algorithms.
To evaluate the best Hadoop scheduling algorithm from among the extensively used algorithms using a tool.
To improvise and validate the best evaluated Hadoop scheduling algorithm.
Chapter 1: This chapter gives the basic information of Big Data, Hadoop, MapReduce, HDFS and Scheduling algorithms. Also, Need of study, Scope of study and Objectives of study are discussed.
Chapter 2: This chapter provides the literature survey, literature review about Big data, Hadoop, MapReduce, HDFS and Scheduling Algorithms.
Chapter 3: This chapter gives the overview of scheduling, scheduling policies, scheduling models and algorithms, parameters, classification of Hadoop schedulers and further classification of different types of scheduling algorithms.
Chapter 4: This chapter describes the static schedulers, its introduction, software description, different types of static schedulers, their advantages and drawbacks.
Chapter 5: This chapter describes the dynamic schedulers, its introduction, types of dynamic schedulers, their advantages and drawbacks.
Chapter 6: This chapter provides the research methodology used to evaluate different Hadoop schedulers and evaluation of final results.
Chapter 7: This chapter describes the improved deadline – constraint scheduler, its introduction, improved system and its algorithm, its advantages and drawbacks.
Chapter 8: This chapter gives the analysis by comparing different algorithms on the basis of different parameters and providing the final results.
Chapter 9: This chapter gives conclusion of work done and the future scope.
This chapter provides an overview of Big Data, Hadoop, MapReduce, HDFS, Scheduling Algorithms and their types. This chapter also provides the brief study about the need of study, scope of study, objectives of study, research methodology and chapter plan.
CHAPTER – 2
CHAPTER – 2
In this chapter literature review is presented for Big Data, Hadoop, MapReduce, HDFS and Scheduling algorithms and its different types are deeply discussed.
A literature review is a critical analysis of published sources, or literature on a particular topic. It is an assessment of the literature and provides a summary, classification, comparison and evaluation about a topic of interest. Literature review is a text of a scholarly paper, which includes the current knowledge including substantive findings, as well as theoretical and methodological contributions to a particular topic.
2.3 BIG DATA
Jonathan Stuart Ward et al. 53 proposed that big data is mainly associated with two ideas: data storage and data analysis. Big implies significance, complexity and challenge. This paper attempts to compare and analyse the various definitions which have gained some degree and to provide a clear and concise definition of an otherwise ambiguous term.
Bernice Purcell et al. 54 proposed that Big Data is comprised of large data sets that cannot be handled by traditional systems. Big data includes structured data, semi-structured and unstructured data. The data storage technique used for big data includes multiple clustered Network Attached Storage (NAS) and Object Based Storage (OBS). The Hadoop architecture is used to process unstructured and semi-structured data using MapReduce to locate all relevant data and then select that the data which directly answers the query.
Vidyasagar S. D 55 did a survey on Big Data and Hadoop system and found that organizations need to process and handle petabytes of Data sets in efficient and inexpensive manner. Hadoop is an Efficient, reliable, Open Source Apache License used to deal with large data sets. Hadoop is designed to run on cheap commodity hardware, it automatically handles data replication and node failure, processing data, Cost Saving and efficient and reliable data processing.
Mrs. Mereena Thomas 56 proposed that firms like Google, eBay, LinkedIn and Facebook were built around big data from the beginning. It is a collection of massive and complex data sets that include the huge quantities of data, social media analytics, data management capabilities, real time data etc. Big Data is a data whose complexity requires new techniques, algorithms and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuring Big Data and solves the problem of making it useful for analytics.
Natalija Koseleva et al. 57 stated that data generation has increased drastically over the past few years. Data management has also grown in importance because extracting the significant value out of raw data is an important issue. Collecting big amount of data, using different kinds of big data analysis can help to improve construction process from the energy efficiency perspective. This article reviews the understanding of Big Data, methods used for Big Data analysis and the main problems with Big Data in the field of energy.
Jason Venner 4 presented that Hadoop is one of the technologies used to process Big Data. It is defined as an open-source platform that provides analytical technologies and computational power required to work with such large sets of data. Hadoop is an Apache Software foundation project written in JAVA. The Hadoop core project provides the basic services for building a cloud computing environment with commodity hardware and the API’s for developing software that will run on that cloud.
Chuck Lam 58 states that Big data can be difficult to handle using traditional databases. Apache Hadoop is a NoSQL applications framework that runs on distributed clusters. Hadoop is an efficient platform for writing and running distributed applications that uses large amount of data sets. Hadoop distributes data in advance. Data is replicated in a cluster of components and processed when it is stored.
C. Lee et al. 59 state that the current Hadoop implementation assumes that every node in a cluster has the same computing capacity and that the tasks are data local, which may increase extra overhead and MapReduce performance. In a homogeneous cluster, the Hadoop strategy can make use of the resources of each node. For heterogeneous cluster, they proposed a data placement algorithm to resolve the unbalanced node workload problem.
S. Suresh et al. 19 proposed that Mapreduce is a parallel programming model used to solve wide range of BigData applications. Hadoop is an open source implementation which provides an abstracted environment for running large scale data intensive applications in a scalable and fault tolerant manner. There are several hadoop scheduling algorithms which are proposed with various performance goals.
Rahul Pawar et al. 60 there are various applications which have a huge amount of database in different format. All databases maintain log files that keep records of database changes in a system formation. This can include tracking various user events and activity. Apache Hadoop can be used for log processing at scale of a system. Log files have become a standard part of large applications and large scale industry, computer networks and distributed systems. Log files are often the only way to identify and locate an error in software as well as system, because log file analysis is not affected by any time-based issues known as probe effect or a system effect.
Jeffrey Dean and Sanjay Ghemawat 7 proposed that MapReduce is a programming model for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce provides fine-grain fault tolerance for large jobs; failure in the middle of a multi-hour execution does not require restarting the job from scratch.
Jens Dittrich 8 states that there are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. Hadoop MapReduce, is the most popular open source implementation of the MapReduce framework proposed by Google. A Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce. The input of a Hadoop MapReduce job is a set of key-value pairs (k, v) and the map function is called for each of these pairs.
Abdelrahman Elsayed et al. 61 MapReduce has been invented by Google to deal with huge volume of data. In this paper an overview about MapReduce programming model was presented along with MapReduce capabilities, limitations. Further, the performance of MapReduce was evaluated and enhanced by managing skew of data. In addition, power enhancement methodologies for clusters that run MapReduce jobs were introduced and discussed.
Shafali Agarwal 62 proposed that MapReduce provides a distributed parallel computing across multiple nodes and return result on a particular node. MapReduce plays a vital role in parallel data processing because of its salient features such as scalability, flexibility and fault tolerance. A lot of research work has been done on the extension of Map Reduce carried out with new functionalities and mechanism to optimizing it for a new set of problems. It was reviewed that the extended version of Mapreduce for more data intensive applications such as HaLoop and Spark Map Reduce work well for Iterative computation.
DT Editorial Services, Black Book 6 states that Hadoop is an open-source platform that provides analytical technologies and computational power required to work with such large volumes of data. A Hadoop cluster consists of single MasterNode and multiple SlaveNodes. The master node consists of a NameNode and a JobTracker. The slave node or worker node acts as both a DataNode and a TaskTracker. Hadoop MapReduce is a computational framework used in hadoop to perform all the mathematical computations. It is based on a parallel and distributed implementation of MapReduce algorithm that provides high performance.
P. Sudha et al. 63 proposed that Big data has to deal with large and complex dataset that can be structured, semi-structured or unstructured and will typically not fit into memory to be processed. MapReduce is a programming model for processing large datasets distributed on large clusters. MapReduce framework is basically designed to compute data demanding applications to support effective decision making. Since its introduction, remarkable research efforts have been put to make it more familiar to the users subsequently utilized to support the execution of enormous data intensive applications. This survey paper highlights and investigates various applications using recent MapReduce models.
2.6 HADOOP DISTRIBUTED FILE SYSTEM
Harshawardhan S. Bhosale et al. 5 proposed that Big data can be structured, unstructured or semi-structured, resulting in incapability of conventional data management methods. Hadoop is the core platform for structuring Big Data and solves the problem of making it useful for analytics purposes. Hadoop is an open-source software project which is designed to scale up from a single server to thousands of machines, with a high degree of fault tolerance using Hadoop Distributed File System (HDFS) and a processing pillar known as MapReduce.
Sumit Kumari 10 proposed that larger volumes and new assets of data are known as Big Data. Technologies such as MapReduce & Hadoop are used to extract value from Big Data. Hadoop is well adopted, standard-based, open source software framework build on the foundation of Google’s MapReduce. HDFS is used to support these new architectures, including very large file system running on commodity hardware.
Gayathri Ravichandran 12 proposes that the rate of data generation is so alarming, that there is a need to implement easy and cost-effective data storage and retrieval mechanisms. Furthermore, big data needs to be analysed for insights and attribute relationships, which can lead to better decision-making and efficient business strategies. In this paper, a formal definition of Big Data and its industrial applications along with Hadoop architecture, components and its underlying functionalities were discussed.
Hiral M. Patel 64 proposed that “Big Data” is the massive amount of volume, variety and velocity of data. Hadoop is framework that supports HDFS for storing and MapReduce for processing large data sets in a distributed computing environment. Schedulers provide the fair allocation of resources among users. Proper assignment of tasks will reduce job completion time and can improve performance of the job. In this paper Map Reduce model and task scheduling algorithms such as FIFO, Fair share, Capacity, Delay is discussed.
2.7 SCHEDULING ALGORITHMS
Matei Zaharia et al. 65 state that MapReduce has proven a popular execution model for large batch jobs. FAIR scheduler has been proposed that provides isolation, guarantees a minimum share to each user (job) and achieves statistical multiplexing. During its initial deployment, two aspects of MapReduce were identified: data locality and task interdependence which considerably affects FAIR’s throughput. To address the issue two techniques were proposed: delay scheduling and copy-compute splitting. Results have shown that FAIR achieves isolation, low response time and high throughput.
Anjana Sharma 66 presented that Hadoop framework has been widely used to process large-scale datasets on computing clusters. In Hadoop, MapReduce framework is a programming model which processes terabytes of data in very less time. This framework uses a task scheduling method to schedule task. There are various methods available for scheduling task in MapReduce framework. Scheduling of jobs in parallel across nodes is a major concern in distributed file system. The objective is to study MapReduce and analyse different scheduling algorithms that can be used to achieve better performance in scheduling.
Seyed Reza Pakize 52 proposed that Hadoop is a Java-based programming framework that supports the storing and processing of large data sets in a distributed computing environment. The main objective of MapReduce programming model is to parallelize the job execution across multiple nodes for execution. Due to this, scheduling algorithms have been proposed. There are three important scheduling issues in MapReduce such as locality, synchronization and fairness. In this paper, different scheduling algorithms were discussed along with their advantages and disadvantages.
J. V. Gautam et al. 14 proposed that the Apache Hadoop framework has emerged as most widely adopted framework for distributed data processing because of open source and allowing use of commodity hardware. Job scheduling has become an important factor to achieve high performance in Hadoop cluster. Several scheduling algorithms have been developed for Hadoop MapReduce model which vary in design and behavior, handling different issues such as locality of data, user share fairness and resource awareness. This paper highlights fundamental issues in job scheduling, classification of Hadoop schedulers and presented survey of existing scheduling algorithms. It also discusses features, advantages and limitations of the scheduling algorithms.
Praveen T et al. 41 proposed that huge amount of data is being generated with the arrival of new updates in various fields like social networks, e-commerce, finance and education, etc. The retrieval of data becomes complex requiring efficient algorithms to manage the tasks. Map reduce framework involves task-level computation which is used in the analysis of big data. This involves re-computation which results in long time processing and less performance. Hence, this framework seems to be less efficient. Therefore, Bipartite graph is used to perform Map reduce for efficient and fast processing. Also, the main focus is on reducing time complexity to make the system competent by processing large data sets using Bipartite graph.
M. Senthilkumar et al. 67 proposed that Big Data Applications with Scheduling becomes an active research area in past years. The Hadoop framework becomes very popular and most used frameworks in a distributed data processing. Various scheduling algorithms of the MapReduce model using Hadoop vary with design and behavior and are used for handling many issues like data locality, awareness with resource, energy and time. This paper gives the outline of job scheduling, classification of the scheduler. To overcome the scheduling issues many job scheduling algorithms are presented: FIFO scheduler, Fair scheduler, Capacity scheduler and Deadline – Constraint scheduler. The advantages and disadvantages of respective algorithms are discussed.
Adhishtha Tyagi et al. 68 proposed in the study that scheduling has been an active area of research in computing systems since their inception. The main objective is to study MapReduce framework, MapReduce model, scheduling in Hadoop, various scheduling algorithms and various optimization techniques in job scheduling. Scheduling algorithms of MapReduce model using Hadoop vary with design and behaviour and are used for handling many issues like data locality, awareness with resource, energy and time.
Mohd Usama et al. 38 proposed a comprehensive survey on big data and job scheduling algorithms in Hadoop environment. Job scheduling is a key factor to acquire high performance in big data processing. Various issues in big data include Data volume, Data variety, Data velocity, Security and Privacy, Cost, Connectivity and Data sharing, etc. For handling these issues, various job schedulers have been designed. This paper presents a comparative study on various job schedulers for big data processing in Hadoop environment such as FIFO, Fair, Capacity and Deadline – Constraint scheduling algorithm, etc. Each scheduler considers resources such as CPU, Memory, user constraints, IO, etc. The features and drawbacks of each scheduler are also discussed. From the comparative study and experimental results, it was concluded that Fair and Capacity schedulers are designed for short jobs and equal utilization of resources while Deadline – Constraint schedulers can be used in both homogeneous as well as heterogeneous environments.
Chien – Hung Chen et al. 40 studied about Mapreduce framework for processing data-intensive applications with a parallel manner in cloud computing systems. In this paper, the Bipartite Graph modelling: a new MapReduce Scheduler is proposed known as the BGMRS. The BGMRS can obtain the optimal solution of the deadline-constrained scheduling problem by transforming the problem into a well-known graph problem: minimum weighted bipartite matching. The BGMRS minimizes the number of jobs with the deadline violation. The node group technique effectively decreases the computational time of BGMRS in a large-scale cloud computing system.
Vanika et al. 69 proposed an algorithm the existing Deadline – Constraint Scheduler problems – various node performance and dynamical task execution time. In this paper, the Bipartite Graph Modelling technique is used to propose a new Map Reduce Scheduler called the Bipartite Graph MapReduce Scheduler (BGMRS). The BGMRS transforms the problem into a graph problem i.e., minimum weighted bipartite matching. It is found that the BGMRS considers the heterogeneous cloud computing environment along with data locality to meet the deadline requirements and shortening the data access time of a job.
N. Deshai et al. 70 studied different schedulers in Apache Hadoop open source in cloud conditions and provided their highlights and issues. Different schedulers require processing and storage of data therefore MapReduce and HDFS are used respectively. The types of schedulers examined are: FIFO, Fair, Capacity, Deadline –Constraint, Delay etc. Also their features are discussed and a comparison table is drawn. The paper summarizes the characteristics of each scheduler and how different types of system can be scheduled by proper schedulers and the research gap between homogeneous and heterogeneous systems.
In this chapter literature survey about Big Data, Hadoop, MapReduce, HDFS and Scheduling Algorithms has been provided. Various researchers and authors have contributed well in this area.
CHAPTER – 3
CHAPTER – 3
In this chapter, scheduling of jobs, scheduling policies and scheduling models and algorithms are described extensively. The different types of scheduling algorithms are briefly discussed.
Scheduling has been a lively space of analysis in computing systems since their beginning. Several job programming algorithms are developed for Hadoop MapReduce model, which vary in style and behaviour for handling different problems like section of knowledge, fairness and resource awareness. MapReduce uses two terms job and task. A MapReduce job consists of many tasks, in which tasks are carried out with either map or reduce processes. The hardware runs on the JobTracker node which is useful while making a decision wherever the tasks of a specific job are going to be processed within the cluster. MapReduce makes use of runtime programming. The hardware assigns the information blocks to the existing nodes for processing. This will increase runtime value and reduce the execution of the MapReduce jobs.
3.2.1 Scheduling Policies
The scheduling policies are used to verify the relative ordering of requests. A huge amount of distributed systems with different domains can possibly have different resource utilization policies. A policy can take into consideration the priority, the deadlines, the budgets and also the dynamic behaviour 15.
For big data platforms, dynamic scheduling with soft deadlines and hard-budget constraints on hybrid clouds are an open issue. A general-purpose resource management approach in a cluster is used for big data processing to provide some assumptions about policies that are combined in service-level agreements. For example, interactive tasks, distributed and parallel applications, as well as non-interactive batch tasks all supported with high performance are part of it. But to some extent, it is quite difficult to achieve them. Because tasks have different attributes, requirements provided to the scheduler may contradict in a shared environment etc. For example, a real-time task requiring short-time response prefers space-sharing scheduling; a non-interactive batch task requiring high throughput may prefer time-sharing scheduling 16, 17. The scheduling method focuses on parallel tasks, while providing an acceptable performance to other kinds of tasks.
3.2.2 Scheduling Models and Algorithms
A scheduling model consists of a scheduling policy, an algorithm, a programming model and a performance analysis model. The design of a scheduler that follows a model should specify the design, the communication model between entities involved in scheduling, the process type (static or dynamic), the objective function and the state estimation 16.
It is important for all applications to be completed in the specified time and receive the necessary amount of resources and the one having a deadline should be given the priority over other applications that could be finished later.
Several approaches to the scheduling problem have been considered over time. These approaches have different scenarios, which take into account the applications’ types, the execution platform, the types of algorithms used and the various constraints that might be imposed.
Managing Hadoop cluster with multiple MapReduce tasks on multiple nodes needs effective and efficient scheduling policies to achieve the performance and resource utilization 17. Performance is affected by some issues like fairness, data locality and synchronization.
The different types of parameters related to Hadoop schedulers are:
The distance between the data node which holds the input and the task node is called locality. The data transfer rate depends on the distance between the input data and computation node; if the input data node is very near then data transfer rate becomes low.
It is defined as the process of allocating system resources to many different tasks by the system. Job scheduling is a key factor for achieving high performance in big data processing. The challenges related to job scheduling includes data volume (storage), data variety (format), data velocity (speed), connectivity and data sharing, cost and security and privacy.
The resources are shared among the users and fair measures are required in scheduling all jobs without starvation. MapReduce with heavy workload uses the entire resource in terms of the cluster. Workload must be fairly shared or distributed among the jobs in the cluster. Fairness deals with locality and MapReduce phases.
It refers to the process of making most of the resources available to Hadoop cluster in order to achieve the desired computational results of allocated job.
The environment includes two types of cluster: homogeneous and heterogeneous. The homogeneous cluster means that all nodes in the cluster have the same processing power and capability. Therefore, all nodes can finish the computation roughly at the same time. Whereas the heterogeneous cluster means that a high-performance node can complete processing local data faster than low-performance node. Hadoop lacks performance in heterogeneous clusters where the nodes have different computing capacity.
It is defined as a measure of how much data a computer file contains or how much storage it consumes. File sizes can be measured in bytes (B), kilobytes (KB), megabytes (MB), gigabytes (GB), terabytes (TB) and so on.
Number of Records
It is defined as the sequential number assigned to each physical record in a file. Record numbers change when the file is sorted or records are added and deleted.
It is defined as the time spent by the system executing the task, including the time spent executing run-time or system services on its behalf. The mechanism used to measure execution time is its defined implementation i.e. the execution time that is consumed by interrupt handlers and run-time services on behalf of the system.
3.4 CLASSIFICATION OF HADOOP SCHEDULERS
The Hadoop job schedulers 18, 19 can be classified in terms of the following aspects: environment, priority, resource awareness (such as CPU time, free slot, disk space, I/O utilization), time and strategies. The main idea behind scheduling is to minimize overhead, resources and completion time and to maximize throughput by allocating jobs to the processor 20.
A number of Hadoop schedulers exist which are categorically segregated into two categories such as: Static Schedulers and Dynamic Schedulers. Further these categories are subcategorised with number of scheduling algorithms as listed in Figure 3.1.
Figure 3.1 Hadoop Schedulers
3.5 STATIC SCHEDULERS
In static scheduling, the allocation of jobs to processors is done before the program execution begins. The information regarding job execution time and processing of resources is known at compile time. The aim of static scheduling is to minimize the overall execution time of current programs. Different types of static scheduling algorithms are:
3.5.1 FIFO (FIRST IN FIRST OUT) SCHEDULER
FIFO is also known as default Hadoop scheduler. The job submitted first is given preference over jobs submitted later. Whenever a job arrives the JobTracker pulls the oldest job first from the job queue and processes it without considering the priority or size of the job. This scheduler is mostly used when the execution order of job is not important.
3.5.2 FAIR SCHEDULER
Fair scheduler is developed by Facebook. The main idea behind this is to assign resources to each job such that each job gets equal share of available resources. Fair scheduler groups jobs into named pools based on different attributes.
3.5.3 CAPACITY SCHEDULER
Capacity scheduler is developed by Yahoo. It is developed for multiple organizations sharing a large cluster. In this scheduler, several queues are created instead of pools, each with defined map and reduce slots.
3.6 DYNAMIC SCHEDULERS
In dynamic scheduling, allocation of jobs to the processors is done during execution time. A little basic knowledge is known about the resource needs of a job. It is also unknown in what type of environment the job will execute during its lifetime. Decision is made when a job begins its execution in the dynamic environment of the system.
3.6.1 DEADLINE – CONSTRAINT SCHEDULER
Deadline – Constraint scheduler schedules jobs based on the deadline constraints mentioned by users. This type of algorithms ensures that the jobs whose deadlines can be met are scheduled for execution. It mainly focuses on increasing system utilization.
In this chapter, scheduling algorithms and its various categories were discussed. There are mainly two categories: Static schedulers and Dynamic schedulers which are further categorized into FIFO, FAIR, Capacity and Deadline – Constraint schedulers respectively.
CHAPTER – 4
CHAPTER – 4
In this chapter, Static Schedulers are described extensively. The different types of static scheduling algorithms: FIFO, FAIR and Capacity are deeply discussed. Also, the software description is given and the background of the algorithms along with their scope, advantages and disadvantages are described in this chapter.
In static scheduling, the allocation of jobs to processors is completed before the program execution begins. The information relating job execution time and processing of resources is known at compile time. The aim of static scheduling is to reduce the overall execution time of current programs. Different types of static scheduling algorithms are:
First In First Out (FIFO) Scheduler
4.3 SOFTWARE DESCRIPTION
4.3.1 Hadoop 2.4.1
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation 21. It also provides a software framework for distributed storage and processing of big data using the MapReduce programming model. The Apache Hadoop project develops open-source software for reliable, scalable and distributed computing.
The Apache Hadoop software library is a framework that allows distributed processing of large data sets across clusters of computers using simple programming models 22. It is designed to match up from single servers to thousands of machines, each offering local computation and storage. The library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
Hadoop Yet Another Resource Negotiator (YARN): A framework used for job scheduling and cluster resource management.
Hadoop MapReduce: A system for parallel processing of large data sets.
The different types of services provided by Hadoop are:
Java Virtual Machine Process Status Tool (JPS): It is a command that is used to check all the Hadoop services like NameNode, DataNode, ResourceManager, NodeManager etc. which are running on the machine.
NameNode: It is the centrepiece of HDFS. NameNode is also known as the Master. NameNode only stores the metadata of HDFS like namespace information, block information etc. and tracks the files across the cluster. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
Secondary NameNode: It is a dedicated node in HDFS cluster whose main function is to take checkpoints of the file system metadata present on NameNode. It is not a backup NameNode, instead it is just a helper to the primary NameNode but does not replace the primary namenode.
DataNode: It stores data in the Hadoop File System. A functional filesystem has more than one DataNode, with data replicated across them. During start-up it first connects to the NameNode, rotates until the service comes and then responds to requests from the NameNode for file system operations.
ResourceManager: It is known as the master. It knows exactly where the slaves are located and how many resources they contain. It runs several services; the most important is the Resource Scheduler which decides how to assign the resources.
NodeManager: It is known as the slave of the Hadoop infrastructure. When it starts, it announces its existence to the Resource Manager. It continuously sends a heartbeat signal to the Resource Manager. Each NodeManager offers some resources to the cluster.
Figure 4.1 Hadoop 2.4.1 Overview18.104.22.168 Hadoop 2.4.1 Start and Stop DFS Services
The $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop DFS and Hadoop MapReduce daemons. These are:
start-dfs.sh – It starts the Hadoop DFS daemons, the namenode and datanodes. Use this command before start-mapred.sh.
stop-dfs.sh – It stops the Hadoop DFS daemons.
start-mapred.sh – It starts the Hadoop MapReduce daemons, the jobtracker and tasktrackers.
stop-mapred.sh – It stops the Hadoop MapReduce daemons.
start-all.sh – It starts all Hadoop daemons that are namenode, datanodes, the jobtracker and tasktrackers. Firstly, use start-dfs.sh then start-mapred.sh.
stop-all.sh – It stops all Hadoop daemons. Firstly, use stop-mapred.sh then stop-dfs.sh.
Figure 4.2 Hadoop 2.4.1 Start DFS
Start ResourceManager daemon and NodeManager daemon:
$ sbin/start-yarn.sh – It starts ResourceManager daemon and NodeManager daemon.
22.214.171.124 How to Run MapReduce Job
The following instructions are used to run a MapReduce job locally:
$ bin/hdfs namenode – It formats the file system.
$ sbin/start-dfs.sh – It starts NameNode daemon and DataNode daemon.
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
NameNode – http://localhost:50070/ – It browses the web interface for the NameNode.
Figure 4.3 Hadoop 2.4.1 Directory Overview$ bin/hdfs dfs -mkdir /user and $ bin/hdfs dfs -mkdir /user/;username; – It makes the HDFS directories required to execute MapReduce jobs.
$ bin/hdfs dfs -put etc/hadoop input – It copies the input files into the distributed filesystem.
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples 2.9.1.jar grep input output ‘dfsa-z.+’
To examine the output files: Copy the output files from the distributed filesystem to the local filesystem.
$ bin/hdfs dfs -get output output$ cat output/*
126.96.36.199 YARN on a Single Node
To run a MapReduce job on YARN in a pseudo-distributed mode: by setting few parameters and by running ResourceManager daemon and NodeManager daemon.
Configure parameters –
Run ResourceManager daemon and NodeManager daemon:
$ sbin/start-yarn.sh – It starts ResourceManager daemon and NodeManager daemon.
http://localhost:18088/ – By default it browses the web interface for the ResourceManager.
Run a MapReduce job.
$ sbin/stop-yarn.sh – It stops the ResourceManager daemon and NodeManager daemon.
4.4 FIRST IN FIRST OUT (FIFO) SCHEDULER
FIFO is also known as default Hadoop scheduler operated using a FIFO queue. The job submitted first is given preference over the jobs submitted later. This scheduler is mostly used when the execution order of job is not important.
4.4.1 Background of FIFO Scheduler
Earlier Hadoop had a very simple approach to scheduling jobs provided by users: they ran in order of submission of jobs, using a FIFO scheduler. Each and every job uses the whole cluster, so jobs had to wait for their turn. Although a shared cluster offers great capacity for offering large resources to many users, but the problem of sharing resources fairly between users’ needs a more robust scheduler. Production jobs need to complete in a timely manner, while allowing users who are making smaller specific queries to get results back in a reasonable time.
4.4.2 Execution Process
In the First in First out approach, job is first partitioned into individual tasks and subsequently loaded into the queue and assigned to free slots on TaskTracker (slave). Each job makes use of the complete cluster and as a result jobs have to wait for their turn. FIFO can be used in homogeneous systems, which did not consider the priority of the task whereas there is a support for assigning priorities to jobs, but by default this facility is not active. Jobs wait for their turns as each single job can use the entire cluster. Smaller user queries have to wait for unfair time to get a result.
To use the FIFO Scheduler 23, firstly configure the ResourceManager in the conf/yarn-site.xml:
Once the installation and configuration is completed, start the YARN cluster from the Web-UI to review the status of the selected scheduler 28.
Start the YARN cluster.
Open the ResourceManager web UI.
The FIFO scheduler web-page shows the resource usages of individual queues.
Figure 4.4 FIFO Scheduler Application Overview
Figure 4.5 FIFO Scheduler Job Execution time
4.4.3 Advantages of FIFO Scheduler
FIFO scheduling technique is the simplest and most efficient among all the schedulers 24.
The jobs are executed in the same order in which they are submitted.
It is only suited for single type of job.
It can only be used in homogenous systems.
4.4.4 Disadvantages of FIFO Scheduler
A major drawback of FIFO scheduler is that it is not preemptive. Therefore, it is not suitable for interactive jobs.
A long-running process will delay all the jobs behind it.
It has poor response time for short jobs in comparison to large jobs.
It does not consider the balance of resource allocation between long jobs and short jobs.
Low performance when run multiple types of jobs.
4.5 FAIR SCHEDULER
Fair scheduler is developed by Facebook. The main idea is to assign resources to each job such that each job gets an equal share of available resources with respect to time.
4.5.1 Background of FAIR Scheduler
The Fair Scheduler aims to give every user a fair share of the cluster. Fair scheduling is a method of assigning resources to job such that all jobs get an equal share of resources over time. The Fair Scheduler scheduling decisions can be configured to schedule with both memory and CPU. When there is a single job running, it uses the entire cluster. When other jobs are submitted, resources that free up are assigned to the new jobs, so that each job eventually gets the same amount of resources.
Unlike the default Hadoop scheduler (FIFO), which forms a queue of jobs, Fair lets short jobs finish in considerable time while not starving long-lived jobs. It is also a reasonable way to share a cluster between a numbers of users. Finally, fair sharing can also work with job priorities – the priorities are used as weights to determine the fraction of total resources that each job should get.
4.5.2 Execution Process
Fair scheduler groups jobs into named pools based on different attributes. If there is a single job running, the job uses the entire cluster. When other jobs are submitted, free task slots are assigned to the new jobs, so that each job gets the same amount of CPU time. It lets short jobs complete within a reasonable time while not starving long jobs 25. The objective of Fair scheduling algorithm is to provide an equal distribution of resources among the users/jobs in the system 22, 26. In reality, the scheduler organizes jobs by resource pool and shares resources fairly between these pools. By default, there is a separate pool for each user.
To use the Fair Scheduler 27, firstly assign the ResourceManager scheduler class in yarn-site.xml:
188.8.131.52 Placing Applications in Queue
The Fair Scheduler allows administrators to configure policies that automatically place the already submitted applications into appropriate queues. The placement can depend on the user and groups of the submitter and the requested queue passed by the application. A set of rules are applied sequentially to classify an incoming application. Each rule either place the application into a queue, rejects it, or continues on to the next rule.
Figure 4.6 Fair Scheduler Application Overview
Figure 4.7 Fair Scheduler Job Execution Time4.5.3 Advantages of FAIR Scheduler
It supports the scheduling of divided work. The different type of tasks will get the different assignment of resources.
This scheduler makes a fair and dynamic resource reallocation.
It provides faster response time to small jobs.
It has the ability to fix the number of concurrent running jobs from each user and pool.
Disadvantages of FAIR Scheduler
It ignores the node of the balance states and it will result in imbalance.
This scheduler does not consider the weight of each job, which leads to unbalanced performance in each pool/node.
Pools have a limitation on the number of running jobs.
4.6 CAPACITY SCHEDULER
Capacity scheduler is developed by Yahoo. It is developed for multiple organizations sharing a large cluster. In this scheduler, several queues are created instead of pools, each with defined map and reduce slots.
4.6.1 Background of CAPACITY Scheduler
The Capacity Scheduler takes a slightly different approach to multiuser scheduling. The design of capacity scheduling algorithm is very similar to fair scheduling. But this scheduler makes use of queues instead of pools. Each queue is assigned to an organization and resources are divided among these queues. A cluster is made up of a number of queues, which may be hierarchical and each queue has an allocated capacity associated with it. The Capacity Scheduler allows users or organizations to reproduce a separate MapReduce cluster with FIFO scheduling for each user or organization.
4.6.2 Execution Process
Capacity scheduler puts jobs into multiple queues in accordance with the conditions and allocates certain system capacity for each queue. If a queue has heavy load, it seeks unallocated resources, then makes redundant resources allocated evenly to each job 28, 22. For maximizing resource utilization, it allows re-allocation of resources of free queues to queues using their full capacity. When jobs arrive in that queue, running tasks are completed and resources are given back to original queue. It also allows priority-based scheduling of jobs in an organization queue 29.
To use the Capacity Scheduler 30, firstly configure the ResourceManager in the conf/yarn-site.xml:
184.108.40.206 Setting up Queues
The Capacity Scheduler has a pre-defined queue known as root. All queues in the system are children of the root queue. Further, the queues can be placed by configuring yarn.scheduler.capacity.root.queues with a list of comma-separated child queues. The configuration for Capacity Scheduler uses queue path to configure the hierarchy of queues. The queue path is the full path of the queue’s hierarchy, starting at root, with a dot (.) as the delimiter. A given queue’s children can be defined with the configuration: yarn.scheduler.capacity.<queue-path>.queues.
Figure 4.8 Capacity Scheduler Application Overview
Figure 4.9 Capacity Scheduler Job execution Time4.6.3 Advantages of Capacity Scheduler
It maximizes resource utilization and throughput in cluster environment.
This scheduler guarantees the reuse of the unused capacity of the jobs within queues.
It also supports the features of hierarchical queues, elasticity and operability.
It reuses unused jobs in the queue.
Disadvantages of Capacity Scheduler
Capacity scheduler is the most complex among the three schedulers described above.
There is difficulty in choosing proper queues.
With regard to pending jobs, it has some limitations in ensuring stability and fairness of the cluster from a queue and single user.
4.7 EVALUATION OF ALGORITHM
First in First out (FIFO) scheduler has low data locality and low resource utilization. It only works in homogeneous cluster environment and performs non-preemptive scheduling. It has high performance for small clusters. It provides no fairness.
FAIR scheduler has high resource utilization but low data locality. It also works in homogeneous cluster environment and provides great fairness including high performance for both large and small clusters. It performs preemptive scheduling.
Capacity scheduler also has high resource utilization but low data locality. It performs non-preemptive scheduling when job fails. It performs parallel execution and provides high performance for large clusters.
In this chapter, static schedulers and its different types FIFO, FAIR and Capacity are deeply discussed. Also the software description is provided in this chapter. The backgrounds, examples, algorithms, advantages and disadvantages of different schedulers are also explained.
CHAPTER – 5
CHAPTER – 5
In this chapter, Dynamic Schedulers are explained extensively. The Deadline – Constraint scheduler, its background along with advantages and disadvantages is explained in this chapter.
In dynamic scheduling, allocation of jobs to the processors is done during execution time. A basic knowledge about the resource needs of a job is required. It is unknown in what type of environment the job will execute during its lifetime. Decision is made when a job begins its execution in the dynamic environment of the system. Type of dynamic scheduling algorithm:
Deadline – Constraint Scheduler
5.3 DEADLINE – CONSTRAINT SCHEDULER
Deadline – Constraint scheduler schedules jobs based on the deadline constraints mentioned by users 31. This type of algorithm ensures that the jobs whose deadlines can be met are scheduled for execution. It mainly focuses on increasing system utilization.
5.3.1 Background of DEADLINE – CONSTRAINT Scheduler
In this scheduling strategy 32, 33, the user specified deadline constraints at the time of scheduling the jobs to ensure that the jobs scheduled for execution meets the deadline 34, 35.
It deals with the deadline requirement by the cost model of job execution, which considers parameters such as input size of data, data distribution, map and reduce runtime etc. Whenever any job is scheduled, it is checked by the scheduler whether it will be completed within the time specified by the deadline or not.
Deadline – Constraint 32 scheduler works on deadline constraints specified by the user, which tries to meet the jobs deadlines and increasing the utilization of the system. The scheduler will compute the minimum number of map and reduce slots required for the job completion (cost). If schedulability test fails, user will be notified to enter another deadline value and if passed, the job will be scheduled. Dealing with deadline requirement, data processing is done by:
1) A job execution cost model – it considers various parameters like map and reduce tasks runtimes, input data sizes, data distribution, etc.
2) A Constraint-Based Hadoop Scheduler – it takes user deadlines as part of its input.
When a job is submitted 36, schedulability test is performed to determine whether the job can be finished within the specified deadline or not. A job is schedulable, if the minimum number of tasks for both map and reduce is less than or equal to the number of available slots. This scheduler shows that when a deadline for job is different, then the scheduler assigns different number of tasks to TaskTracker and makes sure that the specified deadline is met.
220.127.116.11 Steps of Constraint Based Scheduler
1: a TaskTracker reports N free slots
3: select the next job J in the priority queue
4: if no. of J’s map tasks running < J’s minimum no. of map tasks then
5:launch J’s map task on a free slot
6:reduce number of available free map slots
8:if (J’s all map tasks are completed) AND (no. of J’s reduce tasks running < J’s minimum no. of reduce tasks) then
9: launch J’s reduce task on a free slot
10:reduce number of available free reduce slots
11: end if
12:until (map/reduce slots are available) OR (Jobs present in priority queue)
5.3.3 Advantages of Deadline – Constraint Scheduler
Deadline scheduler focuses more on the optimization of Hadoop implementation 33.
This scheduling technique also increases system utilization.
5.3.4 Disadvantages of Deadline – Constraint Scheduler
There is a restriction that the nodes should be uniform in nature, which incurs cost.
There are some restrictions or issues of deadline, which are specified by the user for each job.
5.4 EVALUATION OF ALGORITHM
Deadline – Constraint scheduler has high resource utilization. But it has low data locality and it works in both homogeneous and heterogeneous cluster environment. It performs non-preemptive scheduling of jobs.
In this chapter, dynamic schedulers and its type deadline – constraint is deeply discussed. The background, working, algorithm, advantages and disadvantages are also explained.
CHAPTER – 6
CHAPTER – 6
In this chapter, Research Methodology is described extensively. The mechanism opted for the study is explained in this chapter.
The chapter provides different types of Hadoop Schedulers, tools and parameters. The scope of the study is categorized into different areas:
Selection of Tool for evaluation
6.2.1 Hadoop Schedulers
The scope of the study is limited to Hadoop Schedulers namely:
The Static scheduler includes FIFO scheduler, Fair scheduler and Capacity scheduler while the Dynamic scheduler includes Deadline – Constraint scheduler.
The mechanism is based on literature survey to study schedulers, to evaluate them and to improve them. Also tool and simulator is used for evaluation.
6.3.1 Literature Survey
The literature survey is based on reports studied, research papers contributed by different authors, thesis, websites etc. to study different types of Hadoop schedulers, to evaluate them and to improve them.
6.3.2 Hadoop 2.4.1
the tool Hadoop 2.4.1 is used to run static schedulers namely FIFO, Fair and Capacity schedulers. The tool can be downloaded from Hadoop Official repository, i.e. https://hadoop.apache.org/releases.html. The system generated log files, for example: btmp100, btmp500, btmp2000 etc. is used to evaluate these schedulers.
6.3.3 Simulator Design
A java based simulator is designed to run dynamic scheduler, i.e. deadline – constraint scheduler where a user can submit a job along with its deadline based on the available resource slots and map and reduce phases are performed. The system generated log files are used to evaluate the results and improve the deadline – constraint scheduler in heterogeneous cloud computing systems.
Deadline – Constraint scheduler is the best scheduler among all the schedulers FIFO, Fair and capacity and can be improved for even better results. It is concluded from the experiments performed by using system generated log files.
CHAPTER – 7
IMPROVED DEADLINE – CONSTRAINT SCHEDULER
CHAPTER – 7
IMPROVED DEADLINE – CONSTRAINT SCHEDULER
In this chapter, Improved Deadline – Constraint Scheduler is explained extensively. Also the existing system with its disadvantages and the proposed system along with its advantages and execution process are explained in this chapter.
In dynamic scheduling, allocation of jobs to the processors is done during execution time. Deadline constraint scheduler schedules jobs based on the deadline constraints mentioned by users 31. This type of algorithm ensures that the jobs whose deadlines can be met are scheduled for execution.
7.3 EXISTING DEADLINE – CONSTRAINT SCHEDULER
In Deadline – Constraint MapReduce scheduler, the job deadline is divided into task deadline using a static approach 36. Instead of the static approach, a dynamical deadline approach is used in the existing system to improve the slot resource utilization and reduce the deadline violation ratio. Initially, the job deadline is divided as a map task deadline and a reduce task deadline using the estimated map and reduce execution time. To complete a map or reduce task, the real execution time of that task is provided to delay or shorten the original deadlines of pending map or reduce tasks with the adaptive deadline setting 37.
7.3.1 Advantages of Deadline – Constraint Scheduler
Deadline scheduler focuses more on the optimization of Hadoop implementation 33.
This scheduling technique also improves resource utilization.
7.3.2 Disadvantages of Deadline – Constraint Scheduler
During the execution of a short-deadline job, if its map or reduce tasks are allocated in the small number of resource slots, the job cannot be completed within its specified deadline.
The less resource indicates the slot with low CPU performance and memory size.
Considering different parameters like data locality, job allocation, fairness etc., different Hadoop schedulers are compared to evaluate the best scheduling algorithm as shown in table 7.3.1.
Table 7.3.1 Comparison of Hadoop Schedulers 38, 39
Table – 7.3.1 shows the comparison of the Hadoop scheduling algorithms across various parameters already mentioned. In the default scheduler First In First Out (FIFO) JobTracker pulls oldest job first from job queue without considering priority or size of the job. To address issues as discovered in FIFO scheduling algorithms such as FAIR and Capacity schedulers were introduced. FAIR assigns fair share of resources to all jobs and Capacity scheduler works on the same principle as FAIR dividing resources among queues. It puts jobs into multiple queues according to the conditions and allocates certain system capacity for each queue. Deadline – Constraint scheduler focuses more on increasing system utilization. In this, jobs are only scheduled if specified deadline can be met. A waiting queue is used by the scheduler to assign the newly joined jobs and test run of these jobs predict the workload requirement of the jobs based on the results. Deadline – Constraint is the most optimal among all scheduling algorithms as it provides high resource utilization and executes jobs in homogeneous as well as heterogeneous cluster environment.
7.4 IMPROVED DEADLINE – CONSTRAINT SCHEDULER
A new scheduler is proposed that utilizes the Bipartite Graph Modelling technique to integrate the above discussed points in the Map Reduce Scheduling. The proposed Map Reduce scheduler is called the BGMRS (Bipartite Graph MapReduce Scheduler) 40.
Figure 7.1 System Architecture
In comparison to the previously used schemes, the BGMRS can dynamically adjust the map and reduce task deadlines of a job according to the execution time of already running map and reduce task. The given job deadline is divided into a map deadline and a reduce deadline. There are no slots that are assigned for any tasks of the job. When entering the map phase, the map tasks of the job are allocated to appropriate slots according to the associated map deadline. The number of appropriate slots is not sufficient; therefore it will have two or more map rounds in the map phase. In each map round, when map tasks are completed, the task completion time is used to adjust the map deadline for the pending map tasks in next map round and so on.
7.4.1 The BGMRS Algorithm:
Input: A set J of jobs with different deadline requirements and a set S of slots with heterogeneous performance.
Output: Deadline-constrained scheduling for the tasks of J.
while jobs run their map or reduce tasks do
/* Deadline partition. */
for each job j of J in the ready queue do
Perform the deadline partition to set the deadlines for the map and reduce tasks of j.
/* Bipartite graph modelling */
T ;- Collect the running map and reduces tasks of such jobs.
for each task t of T do
Find the feasible slots of t.
Collect feasible slots in F.
Form a weighted bipartite graph based on the T and F sets.
/* Scheduling problem transformation */
Apply the MWBM (Minimum Weighted Bipartite Matching) to obtain the optimal task scheduling of T.
When a job is submitted in the ready queue, the job is provided with a deadline j that specifies the expected execution time and a data retrieval limit that restricts the distance between the processing node and the storage node 41. After the deadline partition, the job deadline is divided into two sub-deadlines: a map deadline and a reduce deadline. If the execution time of a map or reduce task takes more than already running map or reduce deadline, it is known as deadline-over task. When one or more jobs in the ready queue run their map or reduce tasks simultaneously, such tasks are collected in a task set T. If a slot S is the required slot of a task t, the slot S can provide the appropriate execution performance to meet the deadline of the task t and the data retrieval limit of the original job 42. After then, the BGMRS finds the possible slots of each running task. For all the running tasks, all the feasible slots are kept in the set F. Based on the sets T and F, a weighted bipartite graph is formed to represent the scheduling of jobs between T and F. By transforming the Deadline Constrained MapReduce Scheduling (DCMRS) problem to the Minimum Weighted Bipartite Matching (MWBM) problem, optimal task scheduling is obtained to minimize both the number of deadline-over tasks and the total task execution time 43.
7.4.2 Software Description
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems’ Java platform 44. The language derives much of its syntax from C and C++ but has a simpler object model with low-level facilities. Java applications are typically compiled to byte code (class file) that can run on any Java Virtual Machine (JVM) regardless of computer architecture. Java is a general-purpose, concurrent, class-based, object-oriented language that is specifically designed to let application developers “write once, run anywhere” 45. Java is currently one of the most popular programming languages in use, particularly for client-server web applications.
End-users commonly use a Java Runtime Environment (JRE) installed on their own machine for standalone Java applications, or in a Web browser for Java applets 46. Standardized libraries provide a generic way to access host-specific features such as graphics, threading and networking.
18.104.22.168 Net Beans
The Net Beans Platform is a reusable framework for simplifying the development of Java Swing desktop applications. The Net Beans Integrated Development Environment (IDE) bundle for Java SE contains what is needed to start developing Net Beans plug-in and Net Beans Platform based applications; no additional Software Development Kit (SDK) is required 47. Applications can install modules dynamically that include the update centre module to allow users of the application to download digitally-signed upgrades and new features directly into the running applications.
The main features of the platform are:
User interface management (e.g. menus and toolbars)
User settings management
Storage management (saving and loading any kind of data)
Wizard framework (supports step-by-step dialogs)
Net Beans Visual Library
Integrated Development Tools
22.214.171.124 WAMP Server
WAMPs are packages of independently-created programs installed on computers that use a Microsoft Windows operating system 48. Apache is a web server. MySQL is an open-source database. PHP is a scripting language that can manipulate information held in a database and generate web pages dynamically each time content is requested by a browser. Other programs included in a package are phpMyAdmin which provides a graphical user interface for the MySQL database manager, or the alternative scripting languages Python or Perl.
The MySQL development project has made its source code available under the terms of the General Public License, as well as under a variety of proprietary agreements 49. MySQL was owned and sponsored by a single for-profit firm, the Swedish company MySQL AB, now owned by Oracle Corporation 50. Free software and open source projects that require a full featured database management system often use MySQL. Applications which use MySQL databases include: Joomla, WordPress, Drupal etc.
7.4.3 Execution Process of Improved Deadline – Constraint Scheduler
The execution process of deadline – constraint scheduler begins in different stages:
126.96.36.199 Cloud User Login
User login is based on the user first register and login into the cloud storage. Cloud storage is sharing of resources among multiple jobs. This scheduler takes the job deadline into cloud user. Cloud users upload the job and deadline into the cloud storage. Firstly, divide the job deadline as a map task deadline and a reduce task deadline using the estimated map and reduce execution time.
Figure 7.2 Registering into the Cloud
Figure 7.3 Login Window
188.8.131.52 Map and Reduce Phase:
In this map and reduce phase, divide the job deadline as a map task deadline and a reduce task deadline using the estimated map and reduce execution time. On completion of map or reduce task, the real execution time of such task is provided in response to delay or shorten the original deadlines of pending map or reduce tasks.
Figure 7.4 Uploading and Checking the File
Figure 7.5 Tasks Sent to Scheduler and Deadline Specified
184.108.40.206 Deadline Partition:
The deadline partitioning is based on map and reduce task deadlines of a job according to the execution time of already running map and reduce tasks. The given job deadline is divided into a map deadline and a reduce deadline. In this ready phase, no slots are assigned for any tasks of the job. When entering the map phase, the map tasks of the job are allocated to appropriate slots according to the associated map deadline of the task.
Figure 7.6 Mapping of the Task
Figure 7.7 File Sent and File Received
Figure 7.8 Scheduling the Task
220.127.116.11 Bipartite Graph Modelling:
To dynamically adjust the map and reduce task deadlines of a job, this graph technique is used. A map deadline is used to find slots for pending map tasks. After completing all map rounds in the map phase, the final map task completion time is used to adjust the reduce deadline for all reduce tasks. This graph model is basically used to find the minimum time to finish the job based on deadline.
Figure 7.9 Reducing the Task (File: btmp100)
Figure 7.10 Storage of the Task in Cloud (File: btmp100)
Figure 7.11 Reducing the Task (File: btmp500)
Figure 7.12 Storage of the Task in Cloud (File: btmp500)
7.4.4 Advantages of Improved Deadline – Constraint Scheduler
The new map deadline is used to find appropriate slots for pending map tasks.
During all map rounds in the map phase, the final map task completion time is used to regulate the reduce deadline for all reduce tasks.
7.4.5 Disadvantages of Improved Deadline – Constraint Scheduler
The new deadline – constraint scheduler only works in heterogeneous environment.
No slots are assigned for any tasks of the job. Map phase have to take multiple rounds to find appropriate slot.
In this chapter, improved deadline – constraint is deeply discussed. The existing system, the improved system along with the software description and the execution process is explained extensively.
CHAPTER – 8
CHAPTER – 8
In this chapter experimental analysis of Hadoop schedulers is done extensively. Also, comparative analysis is performed based on different parameters used and results are evaluated.
Scheduling plays an important role in big data optimization, which helps in reducing the processing time. Scheduling in big data platforms involves the processing and completion of multiple tasks by handling and changing data in an efficient way with a minimum number of migrations. The main objective of scheduling is to maximize throughput, minimize the completion time; overhead and available resources must be balanced by allocating jobs to processors 51.
Mainly three scheduling issues were taken into consideration: fairness, data locality and synchronization. Fairness finiteness has trade-offs between the locality and dependency between the map and reduce phases of processing jobs. Locality is defined as the distance between the input data node and task-assigned node. Synchronization is the process of transmitting the intermediate output of the map processes to the reduce processes as input is also considered as a factor which affects the performance 52.
8.3 EVALUATION CRITERIA
The different scheduling algorithms are evaluated using different parameters. Different parameters are used like data locality, job allocation, fairness, resource utilization, environment, file size, number of records and execution time.
Figure 8.1 Hadoop Scheduler Parameters
Data Locality: it is defined as the distance between the data node which holds the input and the task node.
Job allocation: it is defined as the process of allocating system resources to many different tasks by the system.
Fairness: it deals with providing fair or equal share of resources to all the available jobs.
Resource Utilization: it refers to the process of making resources available to Hadoop cluster to achieve the desired computational results of allocated job.
Environment: it includes both homogeneous and heterogeneous cluster environment.
File Size: it is defined as a measure of how much data a computer file contains or how much storage it consumes.
Number of Records: it is defined as the sequence of numbers assigned to each physical record in a file.
Execution Time: it is defined as the time spent by the system executing a particular task.
8.4 THE VALIDATION
Comparison of scheduling algorithms is performed to analyse the working of different Hadoop schedulers. Task Scheduling is a factor that directly affects the overall performance of Hadoop platform and utilization of system resources. There are various algorithms designed to resolve this issue with different techniques and approaches. A number of them improves data locality and some provides synchronization processing. Also, numerous of them were designed to minimize the total completion time. While some other schedulers allocates capacity fairly among users and jobs. Also some provide resource utilization like CPU utilization, IO utilization, System utilization etc.
The table shows the comparison of different algorithms along with parameters.
Table 8.4.1 Comparison of Hadoop Schedulers Based on Performance using Dataset
Table – 8.4.1 shows the analysis of the Hadoop scheduling algorithms based on performance using various parameters and dataset (system log files) already mentioned.
Figure 8.2 Graphical Representation of Hadoop Schedulers Based on Performance using Dataset
Here for comparison, a common dataset i.e. system log file containing huge amount of data like system errors, warnings, authentication logs etc. is chosen. To test cluster and scheduling algorithm, sample datasets of varying sizes from 100MB to 3000MB are used which gives out total execution time to process these datasets. Resulting graph in Fig. 8.2 clearly shows the performance difference between schedulers and concluding Deadline – Constraint as the most optimal scheduling algorithm.
In this chapter analysis of the Hadoop schedulers is performed. Also, the parameters on which the scheduling algorithms were evaluated were explained. And after the comparison the results are evaluated and Deadline – Constraint is chosen as the best algorithm among all other algorithms.
CHAPTER – 9
CONCLUSIONS AND FUTURE SCOPE
CHAPTER – 9
CONCLUSIONS AND FUTURE SCOPE
In this chapter, conclusion about Big Data, Hadoop, MapReduce, HDFS and Scheduling algorithms is discussed. Also, the best Hadoop scheduling algorithm is chosen by analysing it using different parameters. Also, the future work is discussed.
With the expansion of internet the size of data being stored on servers is increasing at a very high rate which leads to better computational methods to process such large amount of data. To solve these data processing problems big data technology was introduced. In this, cluster of processing units is used and the task is distributed or executed on these nodes to achieve optimal results. Data distribution or execution among nodes is controlled using different scheduling algorithms. These algorithms are analysed based on different parameters such as file size, number of records.
9.3 CONCLUSIONS AND FUTURE SCOPE
Big Data is a term used to depict the voluminous amount of data that grow large so fast that they are not manageable by traditional Relational Database Management System (RDBMS) or conventional statistical tools. Big data analytics is a process of collecting, organizing ad analyzing large sets of data to discover patterns and useful information. Hadoop is an open-source java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. The core of Hadoop consists of a storage part known as HDFS and a processing part known as MapReduce.
Scheduling plays an important role in big data optimization, which helps in reducing the processing time. The main aim of scheduling in big data platforms involves the processing and completion of multiple tasks by handling and changing data in an efficient way with a minimum number of migrations. The study is conducted on different types of Hadoop schedulers on the basis of different parameters used.
The analysis of Hadoop scheduling algorithms shows that Deadline – Constraint scheduler is the most optimal scheduling algorithm among all other scheduling algorithms. It has high resource utilization, executing jobs in parallel among all nodes in a non-preemptive scheduling mode. This algorithm results each task getting fair share of all the available resources in both homogeneous and heterogeneous cluster environment. To test cluster and scheduling algorithm, sample datasets of varying sizes from 100MB to 3000MB are used which gives out total execution time to process these datasets.
Deadline – Constraint algorithm gives users an option to submit deadlines in case previously specified deadline cannot be met. To verify this a java based scheduling simulator was designed executing multiple jobs depending upon deadline provided by the end user and then mapping resources to get the process executed across the node. The algorithm maximizes the number of jobs that can be run on a cluster satisfying time requirements of all running jobs.
The future work may include Deadline – Constraint scheduling algorithm validation using tool and experiments may be constructed for the same for a cluster of nodes processing real-time data and reducing computational time spent while waiting for a map or reduce slot.
1Dr. M. Moorthy, R. Baby and S. Senthamaraiselvi, “An Analysis for Big Data and its Technologies”, International Journal of Computer Science Engineering and Technology (IJCSET), Vol. 04, Issue 12, pp. 412-418, 2014.2A. A. Pandagale and A. R. Surve, “Big Data Analysis Using Hadoop Framework”, International Journal of Research and Analytical Reviews (IJRAR), Vol. 03, Issue 01, pp. 87-91, 2016.
3J. Dean, “Big data, Data Mining and Machine Learning: value creation for business leaders and practitioners”, John Wiley & Sons, ISBN: 978-1-118-61804-2, 2014.
4J. Venner, “Pro Hadoop”, Apress, ISBN 978-1-4302-1943-9, 2009.
5H. S. Bhosale and Prof. D. P. Gadekar, “A Review Paper on Big Data and Hadoop”, International Journal of Scientific and Research Publications (IJSRP), Vol. 04, Issue 10, pp. 1-7, 2014.6DT Editorial Services, “Big Data”, Black Book, Dreamtech Press, ISBN 978-93-5119-931-1, 2016.7J. Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters”, Communications of the ACM, Vol. 51, Issue 01, pp. 107-113, 2008.
8J. Dittrich, J. Arnulfo and Q. Ruiz, “Efficient big data processing in Hadoop MapReduce”, Proceedings of the VLDB Endowment, Vol. 05, No. 12, pp. 419-429, 2012.
9S. Ghemawat, H. Gobioff and S. Leung, “The Google File System”, ACM SIGOPS operating systems review, Vol. 37, Issue 05, pp. 20-43, 2003.
10S. Kumari, “A Review Paper on Big data Analytics”, International Journal of Recent Advances in Engineering & Technology (IJRAET), Vol. 04, Issue 01, pp. 139-142, 2016.
11K. Shvachko, H. Kuang, S. Radia and R. Chansler, “The Hadoop Distributed File System”, IEEE 26th Symposium on Mass Storage Systems and Technologies, Vol. 35, No. 02, pp. 6-16, 2010.12G. Ravichandran, “Big Data Processing with Hadoop: A Review”, International Research Journal of Engineering and Technology (IRJET), Vol. 04, Issue 02, pp. 448-451, 2017.
13I. Polato, R. Ré and A. Goldman, “A comprehensive view of Hadoop research — A systematic literature review”, International Journal of Computer Networks and Communications Security, Vol. 46, pp. 308–317, 2014.
14J. V. Gautam, H. B. Prajapati, V. K. Dabhi and S. Chaudhary, “A Survey on Job Scheduling Algorithms in Big Data Processing”, IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT’15), Coimbatore, pp. 1–11, 2015.
15R. V. Bossche, K. Vanmechelen and J. Broeckhove, “Online cost-efficient scheduling of deadline-constrained workloads on hybrid clouds”, Future Generation Computer Systems”, pp. 973–985, 2013.16V. Cristea, C. Dobre, C. Stratan, F. Pop and A. Costan, “Large-Scale Distributed Computing and Applications: Models and Trends”, IGI Global, Hershey, ISBN 978-1-6152-0703-9, pp. 1–276, 2010.
17H. Karatza, “Scheduling in distributed systems: In Performance Tools and Applications to Networked Systems”, Springer Berlin, Heidelberg, ISBN 978-3-5402-1945-3, pp. 336–356, 2004.
18J. Xie, F. Meng, H. Wang, H. Pan, J. Ceng and X. Qin, “Research on scheduling scheme for Hadoop Custers”, International Conference on Computational Science, SciVerse ScienceDirect, pp. 2648-2471, 2013.
19S. Suresh and N.P. Gopalan, “An optimal Task Selection Scheme for Hadoop Scheduling”, International Conference on Future Information Engineering, pp. 70-75, 2014.
20L. S. Dias and M. G. Ierapetritou, “Integration of scheduling and control under uncertainties: review and challenges”, pp. 98-113, 2016.
21Apache Hadoop, https://en.wikipedia.org/wiki/Apache_Hadoop accessed on 22/07/2018 at 1400 hrs.
22The Apache Hadoop Project, http://hadoop.apache.org/ accessed on 25/07/2018 at 1100 hrs.
23 Resource Manager API, https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarnsite/ResourceManagerRest.html accessed on 04/08/2018 at 1100 hrs.
24S. Divya, R. K. Rajesh, R. M. Nithila and I. Vinothini, “Big data analysis and its scheduling policy – Hadoop”, IOSR Journal Computer Engineering (IOSR-JCE), Vol. 17, Issue 01, pp. 36-40, 2015.
25B. P. Andrews and A. Binu, “Survey on Job Schedulers in Hadoop Cluster”, IOSR Journal of Computer Engineering (IOSR-JCE), Vol. 15, Issue 01, pp. 46-50, 2013.26M. Pastorelli, A. Barbuzzi, D. Carra, M. Dell’Amico and P. Michiardi, “Practical size based scheduling for MapReduce workloads”, CoRR, 2013.27Hadoop: Fair Scheduler, https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/FairScheduler.html accessed on 06/08/2018 at 900 hrs.
28J. Chen, D. Wang and W. Zhao, “A Task Scheduling Algorithm for Hadoop Platform”, Journal of Computers, Vo. 08, Issue 08, pp. 929-936, 2013.29N. Tiwari and U. Bellur, “Scheduling and Energy Efficiency Improvement Techniques for Hadoop Mapreduce: State of Art and Directions for Future Research”, Doctoral dissertation, Indian Institute of Technology, Mumbai, 2014.
30Hadoop: Capacity Scheduler, https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html accessed on 09/08/2018 at 1500 hrs.
31K. Kc and K. Anyanwu, “Scheduling Hadoop jobs to meet deadlines”, IEEE 2nd International Conference, pp. 388-392, 2010.
32D. Cheng, J. Rao, C. Jiang and X. Zhou, “Resource and deadline aware job scheduling in dynamic Hadoop Clusters”, IEEE 29th International Parallel and Distributed Processing Symposium, pp. 956-965, 2015.
33 X. Dai and B. Bensaou, “Scheduling for response time in Hadoop MapReduce”, IEEE International Conference on Communications (ICC), 2016.
34S. Rashmi and A. Basu, “Deadline Constrained Cost Effective Workflow scheduler for Hadoop clusters in cloud datacentre”, International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), pp. 409–415, 2016.
35N. Lim, S. Majumdar and P. A. Smith, “MRCP-RM: a technique for resource allocation and scheduling of MapReduce jobs with deadlines”, IEEE Trans. Parallel Distributed System, Vol. 28, Issue 05, pp. 1375-1389, 2016.
36Z. Tang, J. Zhou, K. Li and R. Li, “A MapReduce task scheduling algorithm for deadline constraints”, Springer Science Business Media, pp. 651-662, 2012.
37 M. Mattess, R. N. Calheiros and R. Buyya, “Scaling MapReduce Applications Across Hybrid Clouds to Meet Soft Deadlines”, IEEE 27th International Conference on Advanced Information Networking and Applications (AINA), Spain, pp. 629–636, 2013.38M. Usama, M. Liu and M. Chen, “Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark programs”, Digital Communications and Networks, pp. 260–273, 2017.
39Vanika and A. K. Sharma, “A Speculative Study on Hadoop Scheduling Algorithms”, International Journal of Computer Sciences and Engineering (IJCSE), Vol. 06, Issue 06, pp. 1171-1176, 2018.
40C. Chen, J. Lin and S. Kuo, “MapReduce Scheduling for Deadline-Constrained Jobs in Heterogeneous Cloud Computing Systems”, IEEE Transactions on Cloud Computing, Vol. 06, Issue 01, pp. 127-140, 2015.41P. T, B. Priyatharsini, M. N. Preethi and R. Sujatha, “Dynamic Processing for large data sets using Bipartite Graph”, Journal of Chemical and Pharmaceutical Sciences, Vol. 09, Issue 03, pp. 1197-1200, 2016.
42Y. Yao, J. Wang, B. Sheng, C. C. Tan and N. Mi, “Self-Adjusting Slot Configurations for Homogeneous and Heterogeneous Hadoop Clusters”, Vol. 05, Issue 02, pp. 344-357, 2015.
43B. T. Rao and L. S. S. Reddy, “Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments”, International Journal of Computer Applications (IJCA), Vol. 34, Issue 09, pp. 29–33, 2011.
44Herbert Schildt, “The Complete Reference Java”, 7th Edition, ISBN 978-00-7063-677-4, pp. 13, 2007.45Java (Programming Language), https://en.wikipedia.org/wiki/Java (programming_language) accessed on 14/08/2018 at 1500 hrs.
46 Java (Software Platform), https://en.wikipedia.org/wiki/Java (software_platform) accessed on 16/08/2018 at 1200 hrs.
47Net Beans, https://en.wikipedia.org/wiki/NetBeans#NetBeans_platform accessed on 20/08/2018 at 0400 hrs.
48Wamp Server, https://en.wikipedia.org/wiki/WampServer accessed on 21/08/2018 at 0900 hrs.
49MySQL, https://en.wikipedia.org/wiki/MySQL accessed on 23/08/2018 at 0800 hrs.
50MySQL on Windows, https://www.mysql.com/why-mysql/windows/ accessed on 25/08/2018 at 1100 hrs.
51L. Liu, Y. Zhou, M. Liu, G. Xu, X. Chen, D. Fan and Q. Wang, “Preemptive Hadoop Jobs Scheduling under a Deadline”, 8th International Conference on Semantics, Knowledge and Grids, pp. 72-89, 2012.
52 S. R. Pakize, “A Comprehensive View of Hadoop Map Reduce Scheduling Algorithms”, International Journal of computer networks and communications security, Vol. 95, Issue 23, pp. 308-317, 2014.
53J. S. Ward and A. Barker, “Undefined By Data: A Survey of Big Data Definitions”, Stamford, CT: Gartner, 2012.
54B. Purcell, “The emergence of Big Data technology and Analytics”, Journal of Technology Research, pp. 1-7, 2013.55S. D. Vidyasagar, “A Study on “Role of Hadoop in Information Technology Era”, Global Research Analysis (GRA), Vol. 02, Issue 02, 2013.
56M. Thomas, “A Review Paper on Big Data and Hadoop”, International Research Journal of Engineering and Technology (IRJET), Vol. 02, Issue 09, 2015.
57N. Koseleva and G. Ropaite, “Big data in building energy efficiency: understanding of big data and main challenges”, Science Direct, pp. 544-549, 2017.
58 Chuck Lam, “Hadoop in Action”, Manning publications, ISBN 978-1-9351-8219-1 2011.59C. Lee, K. Hsieha, S. Hsieha and H. Hsiao, “A Dynamic Data Placement Strategy for Hadoop in Heterogeneous Environments”, Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, Vol. 01, Issue 01, pp. 14-22, 2014.
60R. Pawar, R. Bhosale, A. Panhalkar, “Different type of log file using Hadoop MapReduce Technology”, International Journal of Technical Research and Applications (IJTRA), Vol. 04, Issue 03, pp. 5-8, 2016.
61A. Elsayed, O. Ismail and M. E. Sharkawi, “MapReduce: State-of-the-Art and Research Directions”, International Journal of Computer and Electrical Engineering (IJCEE), Vol. 06, Issue 01, pp. 34-39, 2014.
62 S. Agarwal, “Map Reduce: A Survey Paper on Recent Expansion”, International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 06, Issue 08, pp. 209-215, 2015.
63 P. Sudha, Dr. R. Gunavathi, “A Survey Paper on Map Reduce in Big Data”, International Journal of Science and Research (IJSR), Vol. 05, Issue 09, pp. 1103-1107, 2016.64 H. M. Patel, “A Comprehensive Analysis of MapReduce Scheduling Algorithms for Hadoop”, International Journal of Innovative and Emerging Research in Engineering (IJIERE), Vol. 02, Issue 02, pp. 27-31, 2015.65 M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker and I. Stoica, “Job Scheduling for Multi-User MapReduce Clusters”, Electrical Engineering and Computer Sciences, University of California, pp. 1-18, 2009.
66A. Sharma, “Hadoop MapReduce Scheduling Algorithms – A Survey”, International Journal of Computer Science and Mobile Computing (IJCSMC), Vol. 04, Issue 12, pp. 171-176, 2015.67M. Senthilkumar and P. Ilango, “A Survey on Job Scheduling in Big Data”, Cybernetics and Information Technologies, Vol. 16, Issue 03, pp. 35-51, 2016.68A. Tyagi and S. Sharma, “A Brief Review of Scheduling Algorithms of MapReduce Model using Hadoop”, International Journal of Engineering Trends and Technology (IJETT), Vol. 45, Issue 01, pp. 37-42, 2017.
69Vanika, A. K. Sharma and K. Thakur, “MapReduce Scheduling for Deadline – Constraint jobs in Heterogeneous Cloud Computing Systems”, International Journal of Research and Analytical Reviews (IJRAR), Vol. 05, Issue 03, pp. 282-287, 2018.
70 N. Deshai, S. Venkataraman and Dr. G. P. Saradhi Varma, “Research Paper on Big Data Hadoop Map Reduce Job Scheduling”, International Journal of Innovative Research in Computer and Communication Engineering (IJIRCCE), Vol. 06, Issue 01, pp. 103-114, 2018.