Summary Of Insights Generated From A Digital Lending Company
Generating insights from a large dataset can sometimes be a herculean task, simply because it entails more of the analytical mindset and skills to effectively do this without prejudice.
As a data scientist, who most likely is a researcher/ Thinker that explore and model data, despite following some software engineering skills to help:
1.) Study how to write clean and modular code
2.) Advance code efficiency
3.) Add effective citations
4.) Use Version Control
Most Data Scientists tends to work more with teams and also share codes, ideas, insights across their organizations in their area of expertise.
A more effective way to get across your discoveries is through simple and unambiguous data analytics skills and as data scientists the more you are able to generate insights and clean your model, the better your results and predictions.
Below is the summary of the Insights generated from the dataset collected from a digital lending company, which prides itself on its effective use of credit risk models to deliver profitable and high-impact loan alternatives. This company’s assessment approach is based on two main risk drivers of loan default predictions which are:
i) Ability to pay
ii) Willingness to pay.
Since not all customers, the company now require to invest in experienced data scientists to build robust models to effectively predict the odds of repayment. Matplotlib and Seaborn will be used
#IMPORTING THE NECESSARY LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
#reading the datasets
df1 = pd.read_csv(‘traindemographics.csv’)
df2 = pd.read_csv(‘trainperf.csv’)
df3 = pd.read_csv(‘trainprevloans.csv’)
After dropping some columns in each datasets that was repeated, the datasets were merged using the concat() method
df = pd.concat([df1,df2,df3], axis=1
Data Cleaning
Duplicates were also dropped. Datetime module was used to convert the dates/time into datetimes and hence a form that the machine can read and understand(binary)
date_column = [‘firstrepaiddate’,’firstduedate’,’closeddate’,’approveddate’,’creationdate’]
def extract_date(df,cols,):
for x in cols:
df[x+’_year’]= df[x].dt.year
df[x+’_month’]= df[x].dt.month
df[x+’_day’]= df[x].dt.day
df[x+’_quarter’]= df[x].dt.quarter
df.drop(columns=date_column, axis=1, inplace=True)
extract_date(df,date_column)
Some other categorical columns with missing data were also filled with mode, median, backfill and forward fill while the numerical columns were filled with median, mean.
Label Encoder () and OneHotEncoder() was also used to transform some categorical data to numerical data. After ensuring that the datasets are clean and fit.Matplotlib and Seaborn was used for Data visualization
import seaborn as sns
sns.countplot(data=df, x=’loannumber’, hue=’bank_account_type’)
Most acoount types with savings account tends to apply for loan more than those with Current accounts or others.
Also, most of their customers are won’t default.
plt.figure(figsize=(7,5))
sns.countplot(y=’loanamount’, hue=’good_bad_flag’, data=df)
plt.title(‘loanamount & good_bad_flag’)
plt.show()
Customers with average loan amount tends not to default compared to those with higher or even least loan amounts.
Also, customers will lesser loan numbers tend to pay early compared to those with higher loan numbers.