1Source: effects on passengers, airlines, and airports.

1Source: http://hadoop.apache.org/ accessed 16 Feb 2018.2Source: https://spark.


We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

0/index.html accessed 16 Feb2018.ABSTRACTOver the past years Flight delays have negative effectson passengers, airlines, and airports. Now it is possible topredict that a flight will be delayed based on the statistics ofpast flights. This paper is focusing on passengersatisfaction unlike most of the previous researches whichare concerned about airlines and airports.

In this work anew Dynamic Double Delay Flight Predicting Web(3DFPW) model is created to help a passenger to get theprediction and the probability of delay status in origin anddestination airports using certain airline through a websiteeven before booking an airline ticket. Most of the previousstudies focused on flights departure delay only or arrivaldelay only. This work focused on both delays at the sametime. Spark is used as an ecosystem cluster over Hadoopcluster, it is handled through a SparkR library from R. Thiswork answers two questions. The first question is what isthe best classification algorithm to use from SparkR MLib?The second question is what is the best caching level ofSparkr which makes the best performance and robustnessand why?Keywords: Machine Learning, Big data, Sparkr, Caching,Flight Delay, R, Classification, Prediction, Naïve Bayes.1- INTRODUCTIONDelays in air travel can be very expensive for bothpassengers and airlines. While many delays are due toweather or mechanical failures are unpredictable, it may bepossible to predict that a flight will be delayed based on thestatistics of past flights.

Flight delays have adverse effectson passengers, airlines and airports, especially economic.Estimated flight delays can increase tactical and operationaldecisions by airports and airline executives, and can alertpassengers to their plans 1.A passenger is the one who pays the money so he is theclient, consequently, if he is dissatisfied with any airport orany airline he will not use it, he will use only the flight he issatisfied with. Therefore there will be a competitionbetween airlines and airports to make a better flight thatsatisfies the passenger, therefore this paper is focusing onpassenger satisfaction despite most of the previousresearches which are concerned about airlines and airports.A new Dynamic Double Delay Flight Predicting Web(3DFPW) model is proposed to handle how to help apassenger knows the prediction and the probability of hisflight before booking an airline ticket. Most of the previousstudies focused on flights departure delay only or flightsarrival delay only.

This work focuses on both departure andarrival delays at the same time to give the passenger fullinformation about delays in the origin airport anddestination airport. 3DFPW model is built on big datamachine learning predicting algorithms hence it is needed tolearn from a wide range of years of historical flights tomake a good prediction. Therefore, Hadoop cluster is usedas a store for big data and for fast predicting. Spark used asan ecosystem cluster over Hadoop and it is handled throughSparkR library from RStudio.SparkR is a distributed system. It’s simpler and lesscomplicated than Hadoop, easier to read. The high speedand scalability of the algorithms created in this system aregood because it is inserted into the Spark memory. SparkRcan run faster for large-scale data files projects that requireparallel solutions 2.

For implementing a 3DFPW model which is built on acommodity cluster two questions need to be answered. TheFirst question: what is the best classification algorithm touse from SparkR MLib? The Second question: what is thebest caching level of Sparkr which make the bestperformance and robustness and why? This work answersthese questions.The rest of this paper is organized as follows: Insection 2, Background about technologies which were usedin the 3DFPW model is explored. A brief review of somerelated work on flight delay is in section 3. Methods anddesign are presented in Section 4. Results and discussionsare offered in Section 5.

Conclusion and future work ispresented in Section 6.2- BACKGROUNDMachine learning is research that explores thedevelopment of algorithms that can learn from data andprovide predictions based on them. Work exploring flightsystems increases the use of machine learning methods 1.Hadoop is an open-source software framework for storingdata and running applications on commodity hardwareclusters. It provides huge storage space for any kind of data,tremendous processing power, and virtually unlimitedconcurrent tasks or the ability to process jobs.1 ApacheSpark is a fast and general cluster computing system. Itoffers high-level APIs in Java, Scala, Python, and R, as wellas an optimized engine that supports common executiongraphs. It also supports a number of higher-level tools,including Spark SQL for SQL and structured dataBest Caching Storage Technique Using Sparkr for Big DataAhmed ElsayedCollege of Computing and InformationTechnology, AAST, [email protected]

comProf. Dr. Mohamed ShaheenCollege of Computing and InformationTechnology, AAST, [email protected]

eduProf. Dr. Osama BadawyCollege of Computing and InformationTechnology, AAST, [email protected]: https://www.r-project.org/about.html accessed 16 Feb 2018.4Source: https://www.

rstudio.com/ accessed 16 Feb 2018.processing, MLlib for machine learning, GraphX forgraphics processing, and Spark Streaming.2 R: is an opensource programming language and software environmentwidely used for statistical computation in data-intensiveroles such as data mining and statistics.3 RStudio is anintegrated development environment (IDE) for R. Itincludes a console that supports direct code execution, asyntax highlighting editor, as well as tools for plotting,history, debugging, and workspace management.4 SparkRis an R package that provides a lightweight interface to useSpark from R. Apache Spark.

SparkR provides a distributedimplementation of data frameworks that support operationssuch as selection, filtering, aggregation, etc. (similar to Rdata frames, dplyr) but in large data sets. SparkR alsosupports distributed machine learning using MLlib.5SparkDataFrame is a collection of data that is distributedand organized into named columns. Conceptually, it isequivalent to a table in a relational database or a data framein R, but more optimizations are made under the hood.SparkDataFrames can be created from a wide variety ofsources, such as structured data files, Hive tables, externaldatabases, or existing local R data frames.

5 Shiny is an Rpackage that makes it easy to create interactive webapplications directly from R. including standaloneapplications on a web page, or embedding them in RMarkdown documents or creating display tables arepossible. And also extending Shiny applications with CSSthemes, HTML widgets, and JavaScript actions.4 ApacheZeppelin: is a Web-based notebook that provides datadriven, interactive data analysis and collaborationdocumentation with SQL, Scala and more.63- RELATED WORKThe flight delay has led to significant costs forpassengers, airlines, and society. Such high delay costsmotivate the analysis and prediction of air traffic delays andthe development of better delay mechanisms.

Predictingflight delays has been the topic of several previous efforts.Sternberg, Soares, Carvalho ; Ogasawara have in 2017developed a taxonomy scheme and classified models withregard to detailed components based on previousresearchers of flight delay models to predict delays. Thatwork contributes to the analysis of these models from aData Science perspective, based on arrival delay 1.Mazzeo in 2003 examined the hypothesis that the marketpower enjoyed by dominant airlines allows them to providea lower service quality through increased flight delays,based on arrival delay 3.

Yi Ding in 2017 executedregression and ordinal classification task based on themultiple linear regression models to predict the delay. Theyimplemented the model and compared it with Naïve-Bayesand C4.5 approach, based on arrival delay 4. Ugwu1,Ntuk ; Ekaete in 2016 observed that airline carriers had thehighest impact on predicting for on-time and delay for flightstatus. Following therefore the research aimed to predict ontime and delay for flight status based on using extensivepotentials of interpretability of decision tree model forflights delays, the percentage accuracy of the system is74.3%, based on Departure delay 5. Tu, Ball ; Jank in2006 estimated a flight departure delay distribution,Focused exclusively on downstream delays caused byfactors such as weather conditions, estimates of airportsurface congestion as well as others. Specifically, a model,which is responding to changes in real time parametermeasurements, based on Departure delay 6.

Montforta; Berg in 2017 used two measures of delays, delays inminutes later than scheduled and if the delay was more than15 minutes and the results suggest that the larger thenationwide size of an airline is, the shorter and less frequentthe delays. This result seems robust to the choice ofspecification, controls and variable set-up. Larger airlineshave more resources, and the efficient use of these maydecrease delays, based on arrival delay 7. Cole ;Donoghue in 2017 aimed to training a logistic regressionmodel to predict if a flight will be delayed by more than 15minutes, based on departure delay 8. Venkataraman, etal in 2016 found that their results are in line with previousstudies that measured the importance of caching in Spark,benefits come not only from using faster storage media, butalso from avoiding CPU time in decompressing data andparsing CSV files, caching helps to achieve low latenciesthat make SparkR suitable for interactive query processingfrom the R shell, caching the data can improve performanceby 10x to 30x for this workload 9.4- METHODS AND DESIGN4.1- System ComponentsHadoop Cluster Specs: (version 2.6 on one namenode and5 datanodes) 1 Master: Processor: AMD Phenom(TM)8600B, Cores: 3, Memory: 8 GB, Hard disk: 120 GB,Network card: Gigabit, OS: Linux (Ubuntu 14) Systemtype: 64-pit.

5 Slaves: Processor: Intel Core 2 Duo CPUE8400 3.00GHz, Cores: 2, Memory: 4 GB, Hard disk: 40GB, Network card: Gigabit, OS: Linux (Ubuntu 14),System type: 64-pit. 6 Machines: connected together on 1switch (Gigabits), speed approximately 600 Mbit.

SparkCluster Specs: 1 Driver and 6 Workers on 6 machines over13 cores, Memory in use: 19.6 GB total, 14.9 GB used,Spark Master at spark://hdmaster:7077, spark version 2.1.0installed on the same cluster of Hadoop, standalone level.

Dataset: The data were obtained from the Bureau ofTransportation Statistics, a Federal Agency of the UnitedStates of America7. The dataset made up of records of allUSA domestic flights of major carriers, Airline on-timeperformance dataset downloaded as CSV file. It is based ondetails of the arrival and departure of all commercial flightsin the US, from October 1987 to April 2008. This is anextensive dataset: a total of nearly 123 million records and12 gigabytes of unpacked data.8 Variables descriptions(29variables): Year : 1990-2008, Month: 1-12, DayofMonth:1-31, DayOfWeek: 1 (Monday) – 7 (Sunday),5Source: https://spark.apache.


html accessed 16Feb 2018.6Source: https://zeppelin.apache.

org/ accessed 16 Feb 2018.Figure 1: 3DFPW modelDepTime:actual departure time, CRSDepTime: scheduleddeparture time, ArrTime: actual arrival time, CRSArrTime:scheduled arrival time, UniqueCarrier: unique carrier codeLookup csv.7 FlightNum: flight number, TailNum: planetail number, ActualElapsedTim: in minutes,CRSElapsedTime: in minutes, AirTime: in minutes,ArrDelay: arrival delay, in minutes, DepDelay: departuredelay in minutes, Origin: origin IATA airport code Lookupcsv7, Dest: destination IATA airport code Lookup csv.

7Distance: in miles, TaxiIn: taxi in time in minutes, TaxiOut:taxi out time in minutes, Cancelled: was the flightcancelled?, CancellationCode: reason for cancellation (A =carrier, B = weather, C = NAS, D = security), Diverted: 1 =yes 0 = no, CarrierDelay: in minutes, WeatherDelay: inminutes, NASDelay: in minutes, SecurityDelay: in minutes,LateAircraftDelay: in minutes.84.2- Dataset Preparing and PreprocessingSparkR initiating: initiating sparkr by calling the libraryof sparkr and determining the sparkr cluster IP and port andrunning a new session using R and RStudio. Readingdataset: reading a CSV file from Hadoop cluster andconverting this file to a sparkr dataframe as partitions whichare distributed on all spark cluster machines and cores, thefull dataset row numbers is 123534969 rows and 12gigabytes. Preprocessing: by using a sparkr SQL it’s noweasy to preprocess the dataset and preparing it. This work isfocused on both departure flights delay and arrival delay.

The dataset contained many attributes of which some areirrelevant, the irrelevant attributes were pruned duringextensive preprocessing. The resulting data was partitionedinto training and test sets. SQL select statement: Usingselect statement a new columns were created from datasetto make data more meaningful to an ordinary passengerwho wants to know if his travel selection will be ontime ordelayed as following. Month: a Month column was createddepending on Month column and the months wereconverted into nominal Names (Jan, Feb, Mar…etc.).Weekday: a weekday column was created depending onDayOfWeek column and a number of days were convertedinto short names of days like (1=’Mo’, 2=’Tu’…etc.).UniqueCarrier: from the dataset, it is a unique code for eachairline company.

Origin: origin airport code. Dest:destination airport code. CRSDepTime: a CRSDeptimecolumn was created depending on dataset CRSDeptimecolumn and numbers were collected into three shortmeaningful names, time between 0001 and 1159 intoMorning (‘MO_01_to_12’), time between 1200 and 1759into Afternoon (‘AN_12_to_18’) and time between 1800and 2359 into Night (‘NI_18_to_24’).9 in the meantimecanceled flights have no actual Deptime so CRSDepTime:(scheduled departure time) was used instead as Deptime.CRSArrTime: same thing as CRSDepTime.

Canceled (0/1):canceled flights were considered as a delayed flight. Class:the dataset has no class so a class was built depending onU.S. Department of transportation federal aviationadministration (FAA) air traffic organization policy, delaysto instrument flight rules (IFR), Airborne delays arereported for all aircraft which incur 15 minutes or more.

11Ontime binary class: if departure delay 15 or is canceled then ‘no’. Criteria: Some recordswhich have the wrong CRSDepTime were neglected, onlyCRSDepTime rows which are less than 2401 are selectedand so CRSArrTime is the same as CRSDepTime.4.3- Dynamic Double Delay Flight Predicting Web(3DFPW) modelCriteria: the range of years selected are from 1989 to2008 (19 years) nearly 112M rows. Splitting Dataset: Afterpreprocessing the dataset which was resulted from SQLselect statement it was separated into training and test sets.80% for the training (89585408 rows) and 20% for the test(22390987 rows), training and test sets are cashed in sparkdataframe cluster. Naïve Bayes Algorithm: Sparkr NaïveBayes model was ran based on training set and the ontimecolumn as a class on a correlation of columns for departuredelay (Month, WeekDay, UniqueCarrier, Origin, Dest,Cancelled, and CRSDepTime) in iteration1 and for arrivaldelay (Month, WeekDay, UniqueCarrier, Origin, Dest,Cancelled, and CRSArrTime) in iteration2. Each iterationhas the same select statement and the same criteria eachiteration was implemented from beginning to the endseparated from each other.

Prediction: after learning from a7Source: https://www.transtats.bts.gov/Fields.asp?Table_ID=236accessed 17 Feb 2018.8Source: http://stat-computing.org/dataexpo/2009/the-data.html accessed17 Feb 2018.

9Source: https://www.fluentu.com/blog/english/how-to-tell-time-inenglish/ accessed 17 Feb 2018.10Source: http://spark.apache.org/docs/latest/rdd-programmingguide.

html accessed 19 Feb 2018.11Source: https://www.faa.gov/documentlibrary/media/order/7210.55fbasic.pdfAccessed 17 Feb 2018.

training set the prediction was implemented on the test setfor achieving class prediction from a random combinationof columns features also for each iteration separately.Confusion matrix: R confusion matrix library is compatibleonly with R dataframe (RDD) which is working as astandalone machine only and can’t work with spark cluster(spark dataframe) and it can’t read data bigger thanmachine’s ram. Therefore, in this work a confusion matrixhas been written using R language to read from big data toget results like (accuracy, recall, precision, and f-score)based on related research 2.Shiny: as shown in figure 1 for interacting online withpassengers a web site had to be designed and dynamicallycan deal with R and sparkr machine learning to achieve thegoal of the 3DFPW Big data model, Therefore shiny hasbeen used for doing this. The shiny file has two sections UIand Server. The select statement of the model used as adataset. Naïve Bayes algorithm was executed using a fulldataset as a training set without splitting it to training andtest. Both iteration of DepDelay and iteration of ArrDelaywere implemented respectively.

Likewise, both predictionswere done in server section depending on incoming inputdata from UI section which entered by the passenger. Theinput data used as a test set for both predictions.4.

4- Classification Algorithms ComparisonIn order to choose the best classifier algorithm forimplementing the 3DFPW model, three classificationalgorithms from standard MLib of Sparkr have been testedand matched (Naïve-Bayes(NB), Random Forest(RF) andGradient Boosted Tree(GBT)). Also, accuracy was matchedwith another related research 5 to increase theconfirmation of the process. Criteria: January 2004instances were selected (583944 rows). Splitting Dataset:Same SQL select statement which used in the 3DFPWmodel was separated into training and test sets. 70% for thetraining (407761 rows) and 30% for the test (176183 rows),training and test sets are cashed in spark dataframe cluster.Columns: the three classification algorithms were ran basedon training set and the ontime column as a class on acorrelation of columns (ontime, Month, WeekDay,UniqueCarrier, Origin, Dest, Cancelled and CRSDepTime).

Related research 5: the author in this work focused on thesame criteria and same terms of columns which were usedin this paper model he used a C4.5 algorithm.4.5- Persisting Performance EvaluationOne of the most important options in Spark is thepersisting (caching) of a dataset in memory acrossoperations.

When you persist an RDD, each node stores anypartitions of it that it calculates in memory, and reuses themin other actions on the dataset (or datasets resulting from it).This allows future actions to be much faster (often withmore than 10x). Caching is a key tool for iterativealgorithms and fast interactive use. You can label RDD aspersistent using persist () method or cache () method.

Thefirst calculation in action is stored in the nodes. The Sparkcache is fault-tolerant if any RDD is lost, it will beautomatically recalculated using the transformations thatoriginally created it.10 Caching storages are (Memory_Only,Memory_And_Disk, Disk_Only, Memory_Only_Ser, andMemory_And_Disk_Ser).10 In order to choose the bestcaching level in the best case (fully functional Hadoop andspark clusters) and worse case (low numbers of spark coresor any case of dead cores) caching storages had to be testedfor selecting the best persisting. Performance evaluation:The test was done by calculating the time of processing ofNB algorithm on all columns of dataset (29 variables) andfor all the range of years (21 years) and this is to maximizethe overload on the algorithm. Using each caching levelindividually on five stages, the first stage is running 6Hadoop datanodes and 13 spark cores (executors), secondstage is running 5 Hadoop datanodes and 11 spark cores andso on until the last stage of 2 Hadoop datanodes and 5 sparkcores. And by decreasing or increasing the number of nodesand number of rows on a variety of machine learningalgorithms the cause was identified.5- RESULTS AND DISCUSSIONSAs shown in figures (2, 3) some airlines selected assamples for matching between the class label and predictionclass for illustrating the difference.

As a result of the 3DFPW model and as shown in thetables (1, 2) the true positive in DepDelay is better than atrue positive in ArrDelay. as shown in table 3 the Modeltime in both iterations nearly 2.5 minutes, however, theaccuracy in DepDelay (82%) better than accuracy inResults insection 5Figure 2: Actual flight status against the carriersFigure 3: Prediction flight status against the carriersExplained insection 5Table 3: the test for DepDelay and ArrDelay predictionTable1- DepDelay Table2- ArrDelayArrDelay (78%) Likewise precision and f-score inDepDelay are better.

The Apriori for DepDelay iteration is0.82 for Ontime and 0.18 For Delayed and the Apriori forArrDelay iteration is 0.78 for Ontime and 0.22 For Delayed.The whole dataset is prepared as training set to takeadvantage of the full knowledge, and to increase the chanceof predicting the incoming data from the passenger which isin the form of one row, this row is considered as a test setfor prediction process in both iterations.

Prediction time isalmost the same in two iterations. Consequently, iteration1results are better.Once the passenger inputting his combinations of flightdata and pressing the button ‘predict ontime status’ thestatus (ontime or delayed will be displayed and theprobability of this status also will be displayed on thebrowser for both delays as shown in figure (4).

As an answer to the first question. And as shown in table(4) NB model has the less time 8 seconds and higheraccuracy (79.8%). RF and GBT are in the same level ofaccuracy with (79.6%). and higher precision 79.3%.

RF is10 times more than the time of the NB. GBT is the worstmodel in time 505 seconds, GBT is 63 times more than thetime of the NB. When trying to select a full range of yearsor even more than two years using the current clusterhardware configuration on RF and GBT it couldn’tcomplete the algorithms processing and it threw errorsabout connection and executors. However, NB did it wellwith a range of 19 years. Sparkr NB algorithm has anaccuracy (79.8%) better than C4.5 algorithm accuracy(74.3%) of the related work 5.

Prediction time is almostthe same in all tests and it is calculated for one row only.Therefore, NB is the best algorithm to use in the 3DFPWmodel because of its accuracy and its time.As an answer to the second question.

And as shown inTable 5, when running naive Bayes algorithm over a fullrange of dataset (123m rows). in the first stageMemory_Only, Disk_Only, and Memory_And_Disk arealmost the same time, however, Memory_Only_Ser andMemory_And_Disk_Ser and Uncached almost the sametime which was 3.3 times more than first 3 caching levels,Memory_Only_Ser and Memory_And_Disk_Ser testunneeded any more as a caching level because their time isalmost as Uncached time.

In the second stage Memory_Only, Disk_Only, andMemory_And_Disk are almost the same time, however,uncached time is 3 times more than first 3 caching levels.In the third stage Memory_Only and Memory_And_Diskare almost the same time however they are 1.9 times morethan Disk_Only and their time is 2.6 times more than theirtime in the second stage in the meantime Disk_Only time isjust 1.

4 times more than their time in the second stagewhich mean that there is an overload when usingMemory_Only and Memory_And_Disk and that because ofmemory when dataset processing it was reached the limit ofmemory of the cluster. Uncached time is 1.4 times morethan (Memory_Only, Memory_And_Disk) and 2.6 timesmore than Disk_Only.

In the fourth stage Memory_Only became not availablebecause it exceeded the limit of cluster memory so it madeerrors and did not complete the processing,Memory_And_Disk is 2.2 times more than Disk_Only andit is 1.5 of the fourth stage Memory_And_Disk time,Disk_Only is 1.3 of fourth stage Disk_Only time, UncachedFigure 4: 3DFPW model webpageTable5: Persisting performance comparison (time in minutes)Figure 6: Persisting performance comparison (time in minutes)Table 4: Performance classification comparisontime is 1.2 times more than Memory_And_Disk and 2.7times more than Disk_Only. In the fifth stage,Memory_And_Disk became not available also likeMemory_Only from the fourth stage Uncached either notcompleted.

As shown in the figure (6) the only lastedcaching level in the worst case was Disk_Only. Followingtherefore Disk_Only had the best time in all stages and it isthe best caching level to use in the 3DFPW model whichmakes the best performance and robustness.In order to know why when reaching the overload limit innaive Bayes algorithm the Disk_Only was the best cachingstorage. A test on a variety of other algorithms had to bedone. By running ML algorithms on divided datasets ashalves and quarters it was observed that.

If the part of thedataset is making no overload on ML and clusterconfiguration the three caching storage (memory, memory& desk, desk) make the same time. Therefore ML had to berunning in the overload limit, this limit is when each andevery of the three caching storage (memory, memory &desk, desk) running together and reaching the greatest timewith succeeded process without any fail. It means that eachalgorithm had to be running many times to reach theoverload limit, by decreasing or increasing the number ofnodes and number of rows on a variety of machine learningalgorithms many times to achieve these results as shown inthe table(6) :Logit reached the overload limit running on 6 nodes using43.

8M rows over that it fails, the Memory_Only is the bestcaching storage time. Naive Bayes reached the overloadlimit running on 4 nodes using 123.4M rows, theDisk_Only is the best caching storage time. Random forestreached the overload limit running on 6 nodes using 3.5Mrows over that it fails, the Memory_And_Disk is the worsecaching storage time. Kmeans reached the overload limitrunning on 5 nodes using 61.

7M rows over that it fails, thethree caching almost have the same time. Consequently, itis obvious that the best caching storage depends on MLtechnique and how it accesses the data when it isoverloaded.It was observed that some critics found in SparkR version2.1.0 during experiments of this paper, for instance, aConfusion matrix of R does not support spark dataframeTherefore it was made programmatically instead.

Ggplot2library does not support spark dataframe and for making barcharts apache zeppelin used to do that instead. Likewise,Plot library does not support spark dataframe. Some of thefamous classification algorithms like C4.5 (decision tree)not supported in sparkr however it supported in pyspark andScala.CONCLUSIONFlight delays are a hot topic for the passengerNevertheless this research introduce a model using thedeparture delay and arrival delay prediction status at thesame time to the passenger through a website unlike most ofthe previous studies that focused on flights departure delayonly or on flights arrival delay only.

Experiments in thispaper achieved that predicting departure delay has betteraccuracy than arrival delay although they both are used inthe 3DFPW model. After testing RF, GBT, NB and relatedresearch results of C4.5 algorithms, the NB classificationalgorithm was the best in SparkR MLib. Disk_Only had thebest time and robustness in all test stages of Naive Bayesalgorithm and it is the best caching level to use in a 3DFPWmodel for best performance.By reaching the overload limit of a variety of machinelearning algorithms to know why Disk_Only is the bestcaching storage for Naive Bayes algorithm, it is obviousthat the best caching storage depends on ML technique andhow it accesses the data when it is overloaded.In future giving the passenger alternates of top ten ofontime carriers and airports will be considered. Using Sparkfrom Pyspark instead of sparkr for more efficiency,flexibility and spreading.

REFERENCES1 Alice Sternberg, Jorge Soares, Diego Carvalho, EduardoOgasawara. A Review on Flight Delay Prediction arXiv:1703.06118v1 cs.

CY 20172 Udeh Tochukwu Livinus, Rachid Chelouah, and HoucineSenoussi. Recommender System in Big Data EnvironmentIJCSI ISSN (Online): 1694-0784 20163 Michael J. MAZZEO. Competition and Service Quality in theU.S. Airline Industry Kluwer Academic Publishers.

20034 Yi Ding. Predicting flight delay based on multiple linearregression Earth and Environmental Science 10.1088/1755-1315/81/1/012198 20175 C. Ugwu1, Ntuk, Ekaete2. Dynamic Decision Tree BasedEnsembled Learning Model to Forecast Flight Status EuropeanCentre for Research Traininging and Development UK Vol.

4,No.6, pp.15-24 20166 Yufeng Tu, Michael Ball, Wolfgang Jank.

Estimating FlightDeparture Delay Distributions —A Statistical Approach WithLong-term Trend and Short-term Pattern Robert H. SmithSchool Research Paper No. RHS 06-034 20067 Joep van Montforta & Vincent A.C.

van den Berg. The totalsize of an airline and the quality of its flights 20178 Scott Cole, Thomas Donoghue. Predicting departure delays ofUS domestic flights S Cole, T Donoghue 20179 Shivaram Venkataraman, Zongheng Yang, Davies Liu2, EricLiang, Hossein Falaki Xiangrui Meng, Reynold Xin, AliGhodsi, Michael Franklin, Ion Stoica, Matei Zaharia, AMPLabUC Berkeley, Databricks Inc., MIT CSAIL. SparkR: ScalingR Programs with Spark SIGMOD San Francisco, CA andUSA ACM. ISBN 978-1-4503-3531-7/16/06 2016Table6: Persisting performance comparison in overload status


I'm Casey!

Would you like to get a custom essay? How about receiving a customized one?

Check it out