FLIGHT DELAY PREDICTION: MACHINE LEARNING TO THE RESCUE

Tolulope Oladeji
9 min readMay 20, 2021

Echoing the words of the great naturalist Charles Darwin, “A man who dares to waste one hour has not discovered the value of life”.

Imagine having to transport a critically ill patient to a more advanced hospital for proper treatment and, the flight being booked was delayed due to some negligence on the part of the airline management team. Obviously, you wouldn’t love to be in such an awkward scenario.

William Shakespeare once said, “Defer no time, delays have dangerous ends”. Time is crucial to everyone irrespective of the gender, race or color etc.

credit: safetyskills.com

BRIEF OVERVIEW

Flight delay do not only adversely affect the passengers, but also the companies involved.

Most Airlines over the years have refused to compensate passengers affected due to delay and this has often times resulted into conflicts between the airline management team and passengers involved.

This delay has often led to flight cancellation thereby making the airline companies to pay adversely in reimbursement to the passengers involved.

Just like any other business, the airline business is also profit-oriented, therefore losing petty cash because of delay, cancellation of flight would adversely affect the airline company with increased number of recurring cases. Often times, passengers have filed court cases (claim) against airlines. This not only affect the reputation of the company, but also reduces public confidence and patronage from potential future passengers.

TUNIS AIR

Tunisair is the flag carrier airline of Tunisia. Formed in 1948, it operates scheduled international services to four continents. Its main base is Tunis–Carthage International Airport. The airline’s head office is in Tunis, near Tunis Airport. Tunisair is a member of the Arab Air Carriers Organization

credit:behance.net

This challenge to be addressed is by the Tunis airline. Flight delays not only irritate air passengers and disrupt their schedules but also cause:

> a reduction in efficiency

> a rise in capital costs, rearrangement of flight crews and aircraft

> an extra/additional crew expenses

As a result, on collective basis, an airline’s record of flight delays may have a negative impact on the passengers’ demand.

What is Machine Learning?

Machine learning is a subset of Artificial intelligence and it is the ability of computers to adapt to new circumstances and to detect and extrapolate patterns. It consists of techniques that enables computers to figure things out from the data and deliver AI applications (Russell and Norvig 2016).

Credit: Prattay Chakrabarty

Machine learning is a field that is broad, it helps in decision making based on checking past data patterns to predict the future occurrences. Machine learning can be defined as a process of solving a practical problem by gathering datasets and algorithmically building a statistical model based on that dataset (Russell and Norvig 2016).

Types of Machine Learning

· Supervised learning

· Unsupervised learning

· Reinforcement learning

1.) Supervised Learning

This deals with a set of labelled datasets. It has a set of input variables (x), and an output variable (y). An algorithm recognizes the mapping function between the input and output variables. The relationship is y = f(x).

The learning is observed since the output is already known and the algorithm are improved each time to optimize its results. The algorithm is trained over the datasets and modified until it attains an adequate level of performance.

supervised learning problems can be grouped as:

Classification — This deals with non-continuous values. Various labels train the algorithm to identify items within a specific category. E.g., Yes or no, Alive or dead, Male or Female. Etc.

Regression — This deals with continuous values (e.g., height, weight, price etc.). It estimates how closely related variables are to one another. E.g., Predicting the exam score based on students’ test score etc.

2.) Unsupervised Learning

The data sets are unlabeled, the output(target) is unspecified, and with numerous input variables. The algorithm learns by itself and detects a striking structure in the data.

unsupervised learning problems can be grouped as follows:

Clustering: This means grouping the input variables with similar features together. E.g., grouping users based on their search history

Similarities

3.) Reinforcement learning

With this type of machine learning approach, models are trained to make a chain of decisions based on the reaction they receive for their actions.

Reinforcement learning differs from supervised learning, there is no answer available, so the reinforcement driver decides the steps to perform a project. The machine learns from its own experiences, when there is no available training data set.

MACHINE LEARNING FLOW CHART

Credit: www.imcorp.jp

Good quality data is fed to the machines, and different algorithms are used to build ML models to train the machines on this data. The choice of algorithm depends on the type of data at hand, and the type of activity that needs to be automated.

Steps of Machine Learning

1.) Data Gathering

2.) Preparing the data (data cleaning and preprocessing)

3.) Model selection

4.) Training of the model

5.) Evaluation

6.) Hyperparameter Tuning

7.) Prediction

8.) Deployment

credit: stockphotos

PROGRAMMING LANGUAGE USED (PYTHON)

Python is incontrovertibly the best programming language for Machine Learning applications due to the diverse benefits explained below. Other programming languages that could be used for Machine Learning Applications include: R, C++, JavaScript, Java, C#, Julia, Shell, TypeScript etc.

Credit: stoodnt.com

Python is prominent for its readability and moderately lesser complexity as compared to other programming languages. Machine Learning applications involve complex mathematical concepts like calculus, matrix and linear algebra which take a lot of effort and time to implement. Python helps in moderating this burden with quick implementation for the ML engineer to validate an idea. Another advantage of using Python in Machine Learning is the inbuilt libraries. There are different packages for a different type of applications, as listed here: NumPy, OpenCV, Scikit, Matplotlib, Seaborn, TensorFlow and Pytorch for Deep Learning applications, SciPy for Scientific Computing, Django for integrating web applications and Pandas for high-level data structures and analysis. Etc.

Python is a flexible programming language that can run on any platform, including Windows, MacOS, Linux etc. While drifting from one platform to another, the code requires some minor adjustments and changes, before it can be ready to work on the new platform.

OBJECTIVE OF THE STUDY

This study aims to predict the estimated duration of flight delays per flight

METHODOLOGY

The materials used for this study was produced by Tunis air in conjunction with Zindi. Here is the link to the dataset used for this study. Below is the list of some variable definitions of some terminologies peculiar to the airline industry.

STEP1: Data Gathering

The datasets used for this study was made available by Tunis air in conjunction with zindi through the AI Hack challenge in Tunisia. Check this link provided to read more on the dataset used for this study. Data is the new oil, obviously we all know it is not easy to get oil and its end product same with data. Hence, the knowledge of web scrapping and use of SQL should be imbibed by all data scientists, no one will spoon feed you.

STEP2: Preparing the data (data cleaning, preprocessing, visualization)

The dataset used for this study was pretty clean, obviously it must have been worked on before being made available to the public. This is not common, as most real-life big data is extremely dirty and would require all necessary exploratory skills to clean the data and make it readily available for use. Your machine learning model built, is as wonderful as the dataset used. Garbage in garbage out. The computer definitely gives out whatever is given to it. It is advisable to ensure the dataset is clean before modelling.

This dataset has more dates and the visualization wasn’t explicit enough, though with seaborn and Matplotlib, I was still able to visualize to some extent. But, with Tableau and Power BI it was extremely limited. Hence, decided to stick with the visualization with seaborn and Matplotlib. It is also advisable to always visualize as much as possible for better graphical understanding of the dataset. You can check here on few of my visualizations so far on Tableau public with other datasets.

Importing the necessary libraries

Checking for missing values

This dataset has no missing values; hence cleaning is not imperative here. Pandas has a package datetime used for various datetime conversions based on the preference of the user and what we want to achieve.

Datetime conversion

The language the computer understands is binary likewise most machine models deal with numerical values easily than categorical variables. The use of various categorical encoding techniques can be employed to convert categorical variables to numerical variables. pandas.get_dummies was used for this study.

STEP3: model Selection

The model selection wasn’t an easy task, since this study is addressing a regression problem and not a classification problem. The search became limited to regression models. The linear regression, Xgboost regressor , RandomForestRegressor, lightgbm regressor was experimented on to determine the best performing of them all.

STEP4: Training of the Model

The model was trained using linear regression, random forest regressor, extreme xgboost gradient, and light gradient boosting. This was carefully done after splitting the train and test dataset

STEP5: Evaluation

Once the model is trained on a defined training set, it needs to be checked for discrepancies and errors. We use a fresh set of data to accomplish this task. The evaluation metrics used for this dataset is Root Mean Square Error (RMSE) as required by Tunis Air and Zindi.

STEP 6: Hyperparameter tuning

It influences how the model’s parameters will be efficient and learned during training. Of course, the output of your model depends on its learned parameters, and its learned parameters are constantly updated and determined during the training phase. if you can set the right hyperparameters, your model will learn the most optimal weights that it possibly can with a given training algorithm and data.

Finding the best hyper-parameters is usually done manually. It’s a simple task of trial and error, with some intelligent estimation.

Upon finding the best hyperparameters one can find, this might just boost the model’s performance.

STEP7: Prediction

The best performing model was used to predict the test dataset after series of hyper parameter tuning.

STEP8: Deployment of the model for production.

This model built wasn’t deployed for production, only a submission file was made available to partners involved.

RESULTS

This study proposes to build a flight delay predictive model using Machine Learning techniques. The accurate prediction of flight delays will help all players in the air travel ecosystem to set up effective action plans to reduce the impact of the delays and avoid loss of time, capital and resources. The codes accompanying this study can be found in my GitHub repository. The Machine learning model built that can best be used to solve this flight delay challenge. Though, not 100% accurate, but it can be improved on, upon further hyperparameter tuning and ensemble method fine tuning.

CONCLUSION

Conclusively, Tunis Air can now ascertain specific airlines that causes delay, and those that are more efficient and time conscious. Also, passengers can opt for which airline best fit in their time of travelling. Without further ado, both the Tunis air and passengers can now relate well without conflicts and disagreements. Also, the airlines causing delay can be advised to improve on their services.

My warmest appreciation goes to the entire management of She Code Africa for this wonderful and rare opportunity. The community is awesomely filled with beautiful and great minds. The Amazon herself in person of Ada Nduka Oyom and other members of the team, God bless you all for this selfless contribution to the African women growth in Tech. I look forward to giving back to the community soonest.

REFERENCES

Russell, s. j. and Novig. P. 2016 Artificial intelligence: A modern approach 3ed., Pearson, ISBN9780136042594.

--

--

Tolulope Oladeji

Environmental Scientist || Geospatial Data Science || Remote sensing || GIS|| Climate Change||