Geek Culture

A new tech publication by Start it up (https://medium.com/swlh).

Follow publication

A Beginner’s Level Guide to Statistics and Probability for Newbies in Data Science

Have you ever wondered as a high school student back then why some topics such as: Statistics, probability, Calculus, Algebra, Permutation and combination etc. was being taught or introduced to the curriculum?😘 Some students would even say!! Hey, why do we have to compulsorily take mathematics. Most students run away from mathematics for diverse reasons. It can be exhausting for some people.

Well, I loved mathematics back then in school but never really knew where and how much of it can be applied to daily happenings around us.

For someone diving into Data science as a field, you can’t but learn some of these fundamental mathematical concepts in data science, because most things Data scientists do go beyond writing of clean, efficient and modular codes but practical applications of statistics, probability etc. for modelling and predictions. This article will only introduce us to some fundamental concepts in statistics and probability.

photo from dreamstime.com

1. STATISTICS

Statistics mainly means ‘data’ or ‘information’. Often, one hears such assertions as college enrollment statistics (possibly referring to diverse forms of book keeping in colleges, such as number on roll, graduation records, admission etc.), financial statistics (disbursement records ,budget and expenditure), health statistics (Natality records, mortality records ,hospital attendance, morbidity) etc. While these are undeniably interrelated to Statistics they are not on their own adequate to be exemplified as Statistics.

What is statistics?

Statistics is a branch of mathematics that deals with the analysis and interpretation of numerical data from samples and/or populations. Your sample must be representative of your population as a whole

The three major objectives of statistics are thus to:

Ø decrease large volumes of data to practicable sizes that improves easy understanding and improved decision making.

Ø summarize data in a very unbiased and definitive manner bereft of sentiments and ambiguities.

Ø make indiscriminate statements on a larger population based on a smaller sample.

VARIABLES

A variable is a representative or attribute that occurs in two or more forms or levels i.e. depicts variability in level, form or value (e.g. height, sex, weight, ethnicity, temperature, nationality, precipitation etc.) Variables can be categorized either as Quantitative or Qualitative.

There are two main statistical methods used in analyzing data:

· Descriptive Statistics

· Inferential Statistics

In Descriptive statistics , a sample data is summarized using indexes like mean, median, mode, variance etc. while in Inferential statistics , we draw conclusions from our datasets.

MEASURES OF CENTRAL TENDENCY

One of the most important characteristics of samples is the tendency or preponderance to cluster around a central value, i.e. lie around the middle or center of its observed range of values. This disposition to concentrate, cluster or lie around the center is known as the average, measure of location or measure of central tendency. The various measures of central tendency which are useful parameters in statistics include the arithmetic average of a set of data (mean), the point exactly mid-way of an array of the set of data (median) and the most frequently occurring observation or value (mode). The characteristics of these parameters and the sample statistics used to estimate them are discussed below using our class data.

Arithmetic Mean of Grouped Data

Arithmetic mean can also be calculated from a grouped data, especially when individual data is not available. This situation is most often encountered in published articles where all is available is grouped data presented in tables and figures. For such situations, the arithmetic mean is computed as

where X is the class mark or the mean of the class intervals

THE MEDIAN

The ‘median’ ‘m’ is the middle or central measurement in a set of data that divides a distribution into two equal parts in such a way that exactly equal number of observations or measurements fall above and below the point of divide. That is, median is the ‘half-way value’ or the score in a distribution above and below which one half of the frequency lies. Note that emphasis is on frequency not numerical values.

Thus, in an array of numbers (i.e. numbers arranged in order of magnitude):

Median = the middle value (if counts are odd) or the arithmetic mean of the two middle values (if counts are even). This is expressed as: Median X(n+1) = ½(n+1) observation if n is odd

THE MODE

The Mode of a set of numbers is that number that occurs with the highest frequency, i.e. the most frequently occurring measurement or observation. Where two adjacent measurements or observations in a set of data have the same highest number of frequencies, the mode of that set of data is the sum of the two observations or measurements divided by two. But where the two measurements or observations with equally high frequencies are non-adjacent, each is listed as a mode and the set of data is said to have two modes or to be bi-modal. If more than two equally high frequencies so occurred the set of data is said to be multi-modal.

Relationship between mean, median and mode

· For symmetrical or normally distributed dataset, the mean, median and mode are equal.

· For skewed dataset. Here, the mean, median and mode will occur at different points.

Mean — Mode = 3(Mean — Median)

QUARTILES, DECILES AND PERCENTILES

If a set of data is divided into four equal parts, each is said to be a quartile denoted as Q1, Q2, Q3 and Q4 for the 1st, 2nd, 3rd and 4th parts. Similarly, the values that divide the dataset into ten equal parts are said to be deciles denoted D1, D2, D3 ……….. D10. Furthermore, the value dividing the dataset into 100 equal parts are called percentiles denoted P1, P2, P3, P4 ………………….. P100 or simply percentage points. Thus, the median = Q2=D5 = P50. Similarly, the 25 and 75th percentiles are equal to the 1st and 3rd quartiles respectively.

MEASURES OF DISPERSION

Another characteristic of a dataset that is very important in understanding its distribution is its measure of variability, variation or dispersion. This is the degree of scatter of measurements around the center. It is also the opposite of cluster, and describes the tendency of measurements to depart from a central pull. Knowledge of the measure of dispersion is important for determining the degree to which numerical data cluster around the average value (measures of central tendency) but also the degree to which they spread or scatter away from it. It is determined using several methods namely range, mean deviation, variance, standard deviation, standard error, etc.

RANGE

This is the simplest measure of dispersion, spread or variation and is equal to the difference between the largest and smallest observations, measured quantities or values. However, because range is based on only 2 valves out of an array of values, it does not give any idea of the actual distribution being considered.

MEAN DEVIATION

The mean deviation (MD) of a set of data X1, X2, X3…..Xn is a measure of how clustered about the mean a distribution is. That is, it gives a very good idea of the relationship between the central tendency and dispersion of a data. Note that absolute value is used because the total sum of all deviations from the mean is usually 0.

VARIANCE

Variance is a measure of spread. It tells us how our data is spread out around our mean. The variance of a set of data X1, X2, X3, X4…Xn is the mean sum of squares of mean deviation of that data.

Generally, variance increases the farther individual observations are away from their mean and decreases as they come closer to the mean. That is, the more individual observations or measurements are clustered around the mean, the smaller is the variance but larger the more they are scattered. For instance, if all observations or measurements have equal values, then the variance s2=0, but as the values differ or vary, variance increases.

STANDARD DEVIATION

This is the square root of variance

STANDARD ERROR OF THE MEAN

The standard error of a mean is the deviation of a sample mean from a population mean. If the distribution of sample means in a population is normally distributed with most values of sample means clustered around the true population mean, , each sample mean has

68% chance of being within 1SE of

95% chance of being within 2SE of

99.7% chance of being within 3SE of

However, to minimize error due to use of sample estimates, a correction of confidence is used as follows

OUTLIERS

In Data science, we can’t but have outliers in some or most of our datasets. There are various methods through which this can be treated in the dataset. The best measure of central tendency will be the median not the mean because the mean would have either been skewed to the right or left by your outlier. We can have both low outlier and high outlier:

§ Low Outlier: Q1–1.5 (IQR)

§ High Outlier: Q3 + 1.5(IQR)

This is well represented using a boxplot for graphical representation.

GAUSSIAN DISTRBUTION (NORMAL DISTRIBUTION)

Normal Distribution

A normal distribution is that in which the frequency of events has preponderance of values around the mean with progressively fewer values or observations towards the extremes. The curve defined by such frequencies is universally given as:

fi = e-1/2 (X-µ)2/ 2

Where µ = population mean, = standard deviation, = 3.14159 (a constant) and e = 2.71828 = (known as the natural logarithm). The equation suggests that a distribution defined by this equation has a mean of µ and standard deviation of . Thus, for any given there are an infinite number of curves depending on the value of µ. Also, for any given µ several curves are possible as the value of change. However, for each of these cases, the shape of the curve may change, becoming more steep or broader, defining what is generally known as kurtosis. Kurtosis which simply means (shape) is the measure of peaked-ness of a curve or distribution. Defined as the fourth moment or deviations from the mean, it is given as:

к4 = = 4 since the second moment к4 = = 2 (the population variance).

It has been shown that a distribution is normally distributed when к4/ 4 = 3. Such distribution is said to be mesokurtic, leptokurtic if > 3 and platykurtic if < 3. Mesokurtic curves are symmetrical and bell shaped, leptokurtic curves are significantly peaked (i.e. higher than while) while platykurtic curves are broader than high.

Features of a Normal Distribution

photo from dreamstime.com

1. For a normally distributed data:

i. 68% of samples lie within 1 (i.e. (µ- ) and + ))

ii. 95.45% of samples lie within 2 (i.e. (µ- ) and + ))

iii. 99.73% of samples lie within 3 (i.e. (µ- ) and + ))

2. A normal curve has a mean of µ, standard deviation of , variance of 2 and moment coefficient of kurtosis of 3.

2. PROBABILITY

Probability is going to allow us to determine how reliable our statistical results actually are

What is Probability?

Probability is the likelihood of whether or not something will happen. The probability of any event occurring can either be expressed as a decimal, fraction or percentage. The values you will get will always be between zeros and one [0,1].

If we have a probability of 1 or 100 percent, that means our event is guaranteed to happen. (i.e. 100% chance our event will happen). If we say it is 0, that means it is impossible. For instance, If I roll a dice; what’s the probability that I will get a 9 (it is zero)

A probability of 0.5 0r ½ means an event is just likely to happen as it is also not likely to happen (i.e. equal chance of getting head/tail when a coin is tossed).

I EXPERIMENTAL PROBABILITY: This include:

i Flipping a coin one time is called an experiment

ii Rolling a dice one time is an experiment.

So, each time you perform an action, that is one experiment. Lets say I have a fair-coin and flipped it three times. we are performing three(3) experiments on that coin.

If we are performing an experiment over an over again, what we would develop is an experiment of probability and it can change depending on our result.

II THEORETICAL PROBABILITY:

Is the likelihood that something will occur if we run an infinite number of experiments. It is a true probability or real-actual probability. The more experiments we run the closer the experiment we would get to the theoretical probability, hence this often leads to what we call the law of large numbers

III INDEPENDENT AND DEPENDENT EVENTS:

In an independent event, one trial doesn’t affect the result of the other trial hence leading to the multiplication rule whereas In Dependent events, a trial has an effect on the other trial, which brings about the Bayes’ Theorem.

Addition Rule

If A and B are events, the probability of obtaining either of them is equal to the sum of the probability of their individual occurrences minus the probability of their joint occurrence.

Ii p(A or B) = p(A) + p(B) — p(A and B)

Multiplication rule

The probability of the simultaneous or successive occurrence of two events A and B is the product of the separate probability of each of the events. Thus, for independent events:

p(A and B) = p(A)p(B)

With this article , I explained the fundamental but important mathematical concepts that would aid your knowledge in data science. I hope you have seen how learning about probability also helps in making informed decisions, based on patterns of data collected and how statistical inferences are often used to analyze and predict trends from data .

Let me know if you have been able to learn a thing or few in the comments below. Don’t forget to leave 👏👏👏👏👏 claps.

Sign up to discover human stories that deepen your understanding of the world.

Tolulope Oladeji
Tolulope Oladeji

Written by Tolulope Oladeji

Environmental Scientist || Geospatial Data Science || Remote sensing || GIS|| Climate Change||

No responses yet

Write a response