Tolulope Oladeji
3 min readMar 23, 2021

Summary Of Insights Generated From A Digital Lending Company

Generating insights from a large dataset can sometimes be a herculean task, simply because it entails more of the analytical mindset and skills to effectively do this without prejudice.

As a data scientist, who most likely is a researcher/ Thinker that explore and model data, despite following some software engineering skills to help:

1.) Study how to write clean and modular code

2.) Advance code efficiency

3.) Add effective citations

4.) Use Version Control

Most Data Scientists tends to work more with teams and also share codes, ideas, insights across their organizations in their area of expertise.

A more effective way to get across your discoveries is through simple and unambiguous data analytics skills and as data scientists the more you are able to generate insights and clean your model, the better your results and predictions.

Below is the summary of the Insights generated from the dataset collected from a digital lending company, which prides itself on its effective use of credit risk models to deliver profitable and high-impact loan alternatives. This company’s assessment approach is based on two main risk drivers of loan default predictions which are:

i) Ability to pay

ii) Willingness to pay.

Since not all customers, the company now require to invest in experienced data scientists to build robust models to effectively predict the odds of repayment. Matplotlib and Seaborn will be used

#IMPORTING THE NECESSARY LIBRARIES

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import datetime

from sklearn.preprocessing import LabelEncoder,OneHotEncoder

#reading the datasets

df1 = pd.read_csv(‘traindemographics.csv’)

df2 = pd.read_csv(‘trainperf.csv’)

df3 = pd.read_csv(‘trainprevloans.csv’)

After dropping some columns in each datasets that was repeated, the datasets were merged using the concat() method

df = pd.concat([df1,df2,df3], axis=1

Data Cleaning

Duplicates were also dropped. Datetime module was used to convert the dates/time into datetimes and hence a form that the machine can read and understand(binary)

date_column = [‘firstrepaiddate’,’firstduedate’,’closeddate’,’approveddate’,’creationdate’]

def extract_date(df,cols,):

for x in cols:

df[x+’_year’]= df[x].dt.year

df[x+’_month’]= df[x].dt.month

df[x+’_day’]= df[x].dt.day

df[x+’_quarter’]= df[x].dt.quarter

df.drop(columns=date_column, axis=1, inplace=True)

extract_date(df,date_column)

Some other categorical columns with missing data were also filled with mode, median, backfill and forward fill while the numerical columns were filled with median, mean.

Label Encoder () and OneHotEncoder() was also used to transform some categorical data to numerical data. After ensuring that the datasets are clean and fit.Matplotlib and Seaborn was used for Data visualization

import seaborn as sns

sns.countplot(data=df, x=’loannumber’, hue=’bank_account_type’)

Most acoount types with savings account tends to apply for loan more than those with Current accounts or others.

Also, most of their customers are won’t default.

plt.figure(figsize=(7,5))

sns.countplot(y=’loanamount’, hue=’good_bad_flag’, data=df)

plt.title(‘loanamount & good_bad_flag’)

plt.show()

Customers with average loan amount tends not to default compared to those with higher or even least loan amounts.

Also, customers will lesser loan numbers tend to pay early compared to those with higher loan numbers.

Tolulope Oladeji

Environmental Scientist || Geospatial Data Science || Remote sensing || GIS|| Climate Change||