How To Build A Machine Learning Model For Medical Insurance Classification

Tolulope Oladeji
6 min readNov 7, 2022

--

“No one plans to get sick or hurt”, but we all need medical care at some point.

photo by: amaidenenergy.com

Medical insurance is a kind of insurance that covers partly or wholly the cost of medical expenses incurred by a person. It covers basic health benefits, critical to maintaining our health, treatment of illnesses, and accidents. With medical insurance, one is protected against unforeseen health problems and exorbitant medical costs.

Medical insurance ensures you pay less, and get preventable care such as vaccines, routine check-ups, and screenings, even before you meet your deductible. However, not everyone will be able to afford medical insurance bills.

This article will explain a step-wise procedure on how to build a machine learning model using a medical insurance company dataset, the model built will predict people, that will purchase and won’t purchase a medical insurance form based on the following features in our dataset:

  • Estimated Salary
  • Age
  • Gender

Now let’s dive in;

STEP 1: Data Collection

The first step in any machine learning project is data collection and understanding. Machine learning must be trained with enough datasets to avoid mistakes. This article made use of a publicly available dataset in .csv format from the repository of a medical insurance company. After the data collection, then the local environment set up.

STEP 2: Setting up the local environment

It is recommended to create an Anaconda environment when setting up your local environment. This help ensures you have all that is needed for your project before you proceed with the task ahead,

  • Ensure you have, Anaconda, python.org, Jupyter notebook, or Visual Studio Code installed.
  • Create a local folder to store your codes and data
  • Ensure the Excel file or .csv file of your downloaded data is saved to your folder
  • Lastly, in the local folder, create a dummy file named medicalinsurance.ipynb or any name of your choice.

Data Upload:

After successfully installing Python, the necessary extensions and IDE you are comfortable with (eg. VScode, PyCharm, or Jupyternotebook). Now proceed with the data upload, though the process of uploading files slightly differs, depending on the editor used:

NB: This article uses Jupyter Notebook as the editor

  • Open Anaconda Navigator, then click on Jupyter Notebook
  • Go to the opened window in the browser of your Jupyter Notebook
  • Navigate to your local computer and open the project folder where you saved your CSV file.
  • Press Upload then select the path to your CSV file in the opened window.
  • Confirm the upload

STEP 3: Importing necessary libraries and modules.

Upon uploading the dataset successfully, proceed with importing the necessary libraries. To effectively perform machine learning tasks, certain modules must be installed. For instance: Numpy, Pandas, Matplotlib, Seaborn, Scipy, Sci-kit learn, etc. They can be installed using the pip function in your anaconda prompt or through the extension tab in VScode. After the installation, you can then proceed to import them as shown below:

STEP 4: Data Reading

Now that we have imported our libraries, Pandas library will be used to import the dataset. Use the command pd.read_csv() or pd.read_excel () to read the data and save it in a variable. Now, use the .head() function to print the first five rows of the data to ensure the data have been read correctly. Follow the procedure in the code snippets below

The preview of the dataset shows the features and target variables. The target variable (purchased and not-purchased column )was re-encoded since it is a classification problem as, Not-purchased == 0, and Purchased == 1 using the steps shown below:

STEP 5: Data Cleaning and Preprocessing

The first step in data cleaning is to remove errors, missing values, and outliers and make the data fit for modeling. Check for missing values and null values using the .isnull().sum() function. However, there were no missing values in this dataset. The data is relatively clean and ready for modeling

STEP 6: Exploratory Data Analysis

It is advisable to always generate the descriptive statistics(such as mean, median, mode, Standard deviation, etc) of a dataset before proceeding with modeling, it helps generate relevant summaries from a dataset through the attribute features.

STEP 7: Dropping the User_ID column

The User_ID column was dropped due to its irrelevance in the modeling.

STEP 8: Data Visualization

This is a significant aspect that helps in gaining insights into the data. Here you can know whether the data is normally or non-normally distributed, It further helps reveal skewness and potential outliers in the datasets. data visualization can be carried out using count plots, box plots, bar charts, histograms, etc.

STEP 9: Modelling

Ensure the necessary modules are correctly imported before modeling. Split the dataset into the features (Gender, Age, Estimated Salary) and target variable(Purchase). Now follow the steps in the code snippet below:

However, ensure that all categorical variables are encoded into numerical variables as shown in the case of the Gender feature below: This can be achieved using LabelEncoding, OneHotEncoding, or get_dummies() function.

Train-Test-Split:

Split the dataset into 80% train and 20% test data, Using the train_test_split() function. However, you can use 70% : 30% depending on your preference.

This medical insurance classification problem employed the use of the three machine learning models, which are:

  1. Logistic Regression model
  2. Naive_Bayes Model
  3. Random Forest Classifier

We would explore the three models and eventually go with the best-performing model. Aside from these three, several models can be used such as Decision Tree, Ensemble Model, Catboost, etc.

Logistic Regression Model

From the accuracy score, we have 0.6625, while f1_score is 0.0

Naive Bayes Model

From the accuracy score, we have 0.85, while f1_score is 0.77

Random Forest Classifier

STEP 10: Model Deployment

STEP 11: Model Evaluation

Conclusion

From the three machine learning models used, it was observed that the Naive Bayes Model had the best performance on our dataset, with an accuracy score of 85% and an F1-Score of 0.77. With this step-wise guide on how to build and deploy an ML model for medical insurance prediction. I hope this guide can aid you in building ML models that are of interest to you.

Feel free to comment and ask your questions.

Thank you.

--

--

Tolulope Oladeji

Environmental Scientist || Geospatial Data Science || Remote sensing || GIS|| Climate Change||