IntroductionThis project is to analyse tweeting events during some football matches that are recorded in a big dataset, since the dataset is comparatively large, it requires big data tools and techniques to be used such as Impala, Hive, Spark and Hadoop.
In the previous semester, we learned how R and SAS can be used to analyse datasets. In this unit, we extended this knowledge and learned how to analyse big data. Therefore, we can not only passively analyse the data, but also process it actively and be able to clean the data in order to analyse. In this project, we will need to apply the big data tools and techniques I mentioned above and that we learned in this module. As working on a given dataset, we were given information about tweeter. As it is showing in the below sections and that the dataset contains seven tables. Some of which need to be cleaned and prepared for a successful event analysesIn order for this project to be completed successfully, applying various tools and techniques had to be used to process the large arrays of data.
Of course, in real life datasets will be much larger than this one; however, all the principles will be the same. I will be running queries that sometimes take several minutes to execute – and that is much bigger than anything I have done so far.Analysis”Definition: Process Analysis can be understood as the rational breakdown of the production process into different phases that turns input into output. It refers to the full-fledged analysis of the business process, which incorporates a series of logically linked routine activities that uses the resources of the organisation, to transform an object, with the aim of achieving and maintaining the process excellence.”Data loading is the technique of copying and loading information or facts units from a source record, folder or software to a database or alike software. It’s also implemented with the support of copying virtual data from a source and pasting or loading the facts to records storage or processing application.Data loading is used in database-based extraction and loading techniques.
Typically, such data is loaded into the destination application as a different format than the original source location.For example, when data is copied from a word processing file to a database application, the data format is changed from .doc or .txt to a .CSV or DAT format. Usually, this process is performed through or the last phase of the Extract, Transform and Load (ETL) process. The data is extracted from an external source and transformed into the destination application’s supported format, where the data is further loaded.
“?Loading the dataFor the first time virtual machine is opened, it gives warning message saying that “The virtual machine appears to be in use”. Need to take ownership of the machine and then start it. On slow computers like mine, it may take several minutes to start.
After the first start (and only once), it is needed to set up the environment:/home/training/training_materials/dev1/scripts/training_setup_dev1.shBefore starting the import I needed to manage how to delete the already loaded data through batch file. I found it a little harder than I expected.
After referring to the lab notes I figured it out and that prior to the data imports, the virtual machine does not need to be wiped off since scripts can be run and executed several times and it erases the data first then importing starts. mysql -utraining -ptraining -e “DROP DATABASE IF EXISTS Assignment;”mysql -utraining -ptraining -e “CREATE DATABASE Assignment;”# remove from hive storage#hive -e “show tables”hive -e “show tables ‘*'” | xargs -I ‘{}’ hive -e ‘drop table {}’hive -e “show tables ‘*'” | xargs -I ‘{}’ hive -e ‘drop view {}’#hive -e “show tables”# remove from HDFShdfs dfs -rmr /Assignment &> /dev/nullprintf ‘..
…
…..
..e1;32m%-6sem
‘ ” done! “clearthe commands below copy files to a temporary folderrm -rf /home/training/Downloads/Ibrahim_Hawjeen/mkdir /home/training/Downloads/Ibrahim_Hawjeen/cp data.sql(1).
gz /home/training/Downloads/Ibrahim_Hawjeen/cp SQL/* /home/training/Downloads/Ibrahim_Hawjeen/cd /home/training/Downloads/Ibrahim_Hawjeen/mkdir OUTPUTprintf ‘e1;32m%-6sem
‘ ” done! “This part is to extract and rename data filegzip -d data.sql(1).gzmv -f data.sql(1) data.sqlprintf ‘e1;32m%-6sem
‘ ” done! “now we need to load data into MySqlmysql -utraining -ptraining Assignment /dev/nullRepeat this step for the remaining 6 tables. After that, repeat the same 7 import commands without the ‘–hive-import’ option as importing to hive removes it from HDFS. I do not include the code here to stay within the word limit, however, script file contains all those commands.
I am not sure how long this operation took as I left it overnight. To confirm that data has been successfully imported:hdfs dfs -ls /Assignment/hive -e “show tables”There should be 7 elements in the output – one directory for each table in the database.Cleansing the dataWhile cleaning data, I noticed that there were two football clubs, Barcelona and Sevilla, that, as my colleagues said, do not belong to the Premier League. However, I decided not to remove those records for the following reason. This analysis relates to security during football matches. Even if those commands do not take part in the Premier League – they still play, and therefore there will be concern regarding security during those games too, no matter if they belong to Premier League or not. Looking at the table games, I discovered that start and end times for game 17 were inconsistent: end time was before the start time.
Having examined table tweets, I found out that the end time for this game was wrong – and updated it accordingly. I did it immediately after importing into MySQL but before exporting to meta store.printf “Cleaning data.
…
……
……
…….
…….
……..
……..
…
…..
….”mysql -utraining -ptraining -e “use Assignment; update game set officialend=’2017-11-26 15:50:00′ where id = 17;”# any more cleaning to be done?printf ‘e1;32m%-6sem
‘ ” done! “Problem 1To solve this problem I have used several tables to create a multi-level query. Each table could therefore be seen as a stepping stone to solving the complex problem. To begin, I first needed to find all of the tweets that were about a particular game, but in order to do that I had to discover which tweets were; related to a particular team and during game. I decided to use all of the tweets that had a exact hash tag of an official football club only and those that were created between game date and times.
This could be improved upon by searching for tweets with hash tags of the actual team name, or partially contained an official football club hash tag or other nicknames, however this would involve searching through all the tweets hash tags with more queries and possibly the creation of a look up table (LUT) for all the alternate names.Now that this information is readily available, I needed to match all of the tweets with official team hash tags to games that involved the teams and they also needed to be between the game times. To do this I would need a couple more tables, one table that would show all of the games where all of the teams are on individual records (rows), and another table to match every single game to a tweet by its official football club hash tag and providing it was between its game time. The view ‘gamez’ was just 2 game tables placed on top of one another using a UNION. Hive did not support UNION so I created table first and inserted 48 records, then inserted the second part of UNION into the table using separate INSERT statement. The first select was to return a table with all of the team 1’s name and times, then the second was the same except with team 2’s. Then using the ‘gamez’ view, matching all of the hashtag id’s to the tweets and checking that it was between the start and end time of a game and grouping the game id’s produced the view ‘tweets_per_game’ which showed all the tweets with their related games matched.The next step was to then calculate the tweets per minute; this is where problems began to occur.
Impala doesn’t support all the functions that MySQL has to offer, especially around the date. In order to calculate the time each game lasted I had to cast the string which showed the date and time for the start and end dates and times into the data type TIMESTAMP and then use unix_timestamp function to convert that into seconds since ‘1970-01-01 00:00:00’ UTC. From here I then took the end seconds from the start seconds to give the difference and then divided the result by 60 to get seconds. The total number of tweets were divided by this and gave the average number of tweets per minute.I did not create a view for the last step, I simply used a query to return a table that answered the problem statement.
After I calculated how to get the tweets per minute, I then ordered the table in descending order of most tweets during game and limited the results to the top 10 also showing the tweets per minute. A limitation of limiting the results to the top ten is that doesn’t account for if multiple games have the exact same amount of tweets, but the likelihood of that is minimal so I don’t believe it is something that is worth doing.Problem 2Task 2 is very similar to task 1, but instead of unique tweets it counts unique users (user id). The rest of the logic is the same. However, because they are not the same values, games returned on top of these two queries are different.Problem 4While solving this query, I ran into a very strange problem. I converted the logic from MySQL into Hive exactly – but the output turned out to be very different. I managed to establish that this is due to Hive handling date time format very differently from MySQL – but unfortunately did not have time to solve it.
Query outputs 6 records even though I asked for the top 10 – and only few of those coincide with the records returned by the original statement that was designed for MySQL.Problem 6 and VisualisationThis query was really challenging. The only way to calculate number of tweets per minute was to use cursor. However, I discovered that cursors in MySQL can only be used within a stored procedure.
Therefore neither impala nor hive would be able to run that. As I could not figure out how to calculate number of tweets every minute I decided to run MySQL code just to generate CSV file with tweets and then plot those using Python program. I understand that we were supposed to use either impala or hive – but since I could not find a way to run this query in impala or hive I still decided to use MySQL so that I can at least apply my Python skills to plot the data.
The beauty of stored procedure is that it takes parameter which is game id – and displays the game progress for that given game. Stored procedure is called at the very end. If called directly, it sends output to screen. However, it is also possible to send the output into text file:mysql -utraining -ptraining Assignment -e “SELECT COUNT(*) FROM vx6; SELECT MAX(tweets) FROM vx6; SELECT tweets FROM vx6 ORDER BY id;” &> OUTPUT/q6.csvthis line redirects the output into the file q6.csv under folder OUTPUT. Once that file has been created, we can call python program:python2.
7 q6.pyProgram q6.py opens the text file (which is supposed to be located in the sub folder OUTPUT, and must be named q6.csv). It reads number of tweets first, then it reads the maximum value of tweets per minute. This value will be used later on to determine the vertical scale: if it is too high, graph will be going off screen.
If it is too low, all values will be close to horizontal line. So I looked at the various games and came up with the following ideas: if the maximum number of tweets per minute is between 350 and 700, then leave the scale unchanged (variable scale is set to 1). If it is over 700, divide it by 2 (so variable scale is 0.5). If it is even more than that, 1500, then divide by 4 (variable scale is 0.25). If, however, maximum number of tweets is less than 350, it scales graph up by setting variable scale to 2.
If it is less than 150, it multiplies all the numbers by 4. This way code can adapt to different scenarios and display nicely different games where number of tweets per minute is very different.Visualisation turned out to be fun as I learnt Python in past a little bit. I used to draw using graphical library Turtle – and that helped a lot! Having investigated way to transfer data from query to Python programme, it was very easy to plot data points on a plane.Once program calculated number of tweets per minute, it then sums them up together and cross checks with the number of tweets returned in query 1. As you can see in the on-screen output, they are the same:+——————–+| tweets_guring_game |+——————–+| 63606 |+——————–++———————————————————————————–+| CROSS CHECKING |+———————————————————————————–+| Comparing tweets_guring_game with number of tweets for that game calculated in Q1 |+———————————————————————————–++——–+| tweets |+——–+| 63606 |+——–+ConclusionThis was fantastic assignment even though it turned out to be very challenging and much more difficult than I expected – but I truly enjoyed it. Importing data was difficult, but writing queries turned out to be even more difficult.If I had to do it again, I would probably start working on the assignment much earlier, and would also pay more attention to labs – having revisited them while working on the assignment, I discovered that we covered absolutely all of it during labs – I wish I paid more attention to those!