P-value, A Beginner’s Nightmare

Praveen Jalaja
4 min readJun 17, 2020

Why This Post??

P-Value, One of the Concepts, In Statistics I would have revisited at least 10 to get a better understanding. When I first Learned P-value in High School, I was so confused and can’t wrap my 16yr old mind around it.

The way I remembered P-value,

if the P-value is less than 0.05 then rejects the null hypothesis. otherwise fails to reject the null hypothesis.

The intuition behind P-value and why it is used was not at all clear to me.

The Concept is very confusing and bizarrely counter-intuitive.

So, In this article, I am gonna try to explain the P-value and Hypothesis Testing intuitively to get a better understanding by myself and whoever cares to read it.

To understand P-value we first have to understand what is Hypothesis Testing.

Hypothesis Testing

we will start with old-fashioned, Wikipedia Defenition

A statistical hypothesis, sometimes called confirmatory data analysis, is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables.

Let’s define a hypothesis to understand the Defenition,

There is no difference in average water consumption (liters) per year of households in Kansas and Missouri.

Let’s scrutinize the above hypothesis with the help of the wiki definition.

whether the hypothesis is testable by a process modeled by a set of random variables?

Yes, the Statement is testable by collecting the water consumption Data (random variable) of the two cities Kansas and Missouri, and test whether the average is different or not.

A brute force approach is to collect all the water consumption of every household in Kansas and Missouri and finds two averages of water consumption in the two states. And, if the averages(means) didn’t differ From each other then the hypothesis is True otherwise False.

The Brute Force approach is insanely time-consuming and costly. Think about it if a company wants to test this hypothesis, they have to spend money and time to collect data from all the counties of States. uh! Tiring!!

For, this Problem Hypothesis Testing is a sweet solution which takes, only Sample of data From the Whole Population and tests the Hypothesis

First Step: Define the Null Hypothesis and Alternative hypothesis

Null Hypothesis: The mean of water Consumption in households per year of both States is the Same.

but we have to define an alternate or opposite of it, so then we can able to compare and Test our Hypothesis Right or Not?.

Alternate Hypothesis: “The mean of water Consumption in households per year both States is Different”. A contradictory hypothesis to our null Hypothesis.

Second Step: Collect the sample data

Instead of collecting water consumption data in all the households of all the counties across two states. we can randomly choose households from different counties of both States and collect the water consumption data.

Third Step: Test our Null Hypothesis:

From our collected Sample Data, we Find the Mean of water consumption per year in Both the States.

Then we find the Difference between the Means of Sample Data from our two States. In our case, the Mean difference in water consumption per year is 300.

Remember, The above mean difference is of our Sample Data collected, but our objective is to reject or fails to reject our Null hypothesis is for Population not for Sample Data.

But, How can we reject or fails to reject the Null hypothesis for the entire population Data (all the households in both the States)?

This is where P-value comes into the picture to Save us.

Now the Defenition,

P-value is a Probability of observing the extreme values assuming our null Hypothesis True in the Population.

Let’s again dive into the Defenition with the help of a defined hypothesis for water consumption.

In our case, the mean Difference is 300 For the Sample Data.

we now assume the null hypothesis true which means there is no difference in the water consumption. And now the calculated mean difference from the Sample data is 300.

Now, what is the Probability that this mean difference 300 will occur in the sample data which is collected from the Population data if there is no difference in means of water consumption of both the States?

So, the P-value answers the above question by assuming the Difference in means follows the normal distribution when the null hypothesis true and gives the probability of getting a difference of 300.

And P-value is the Probability value which can lie between 0 and 1

if the P-value is above 0.05(5% Significance level) then we have the probability of above 5% for finding a mean difference of 300 liters in the Population if our Null hypothesis is true (No mean Difference between the Cities). So, we can’t able to reject our null hypothesis, there is not much evidence to reject our Null hypothesis.

But, if our P-value is below 0.05 (5% Significance level) then we have a probability of below 5% for finding a mean difference of 300 liters in the Population if our Null hypothesis is true. which means, it is very rare to see the mean difference between the cities of 300 liters, so from our Collected Sample Data and calculated P-value it is clear that finding a mean difference of 300 liters is rare. So, we have evidence to reject our Null hypothesis.

A hypothesis Testing can only answer whether we can reject or fail to reject the Null Hypothesis, based on the evidence we have collected and tested our hypothesis against it.

Calculation P-value can be Done using Re-sampling and Permutation test OR Finding the Probability of mean difference 300 liters and above given null hypothesis true, And, assuming the Probability follows Normal Distribution and find the P-value on the Bell -curve.

I hope the above intuitive approach to P-value can help to make the nightmare less Scarier than before. The way to understand the P-value better is by always defining and testing our own hypothesis. Like I did in this Post.

adiós

References:

--

--