Inferential statistics help to identify relationships, test hypothesis draw conclusions about sets of data.
Most of these statistical methods will be new to you, but none of the maths behind them is tricky! If you follow the instructions step-by-step, they're a really worthwhile technique to encorporate in your Geographical study. You will also find they come up in university courses such as Geography, Psychology, Science and Humanities subjects.
These are not statistical methods, but are important foundations.
Hypothesis
Significance
Chi² Analysis
Linear regression Analysis
Nearest Neighbour Analysis
Pearson's Product Moment Correlation Coefficient
Spearman's Rank correlation coefficient
A hypothesis is a statement or a hunch. It is used to set up a question which can then be tested - data gathered, processed and a conclusion drawn.
Most of the time in Geography, research is centred around testing whether a geographical model is true.
For example, understanding the process of attrition means that we would expect that the size of sediment should get smaller as you move downstream.
To test whether this is true for a particular river, a hypothesis would be written.
"There is a relationship between the size of the bedload in River X and the distance in which it is found from the source.”
However, as in a court of law it's "innocent until proven guilty." It's better to assume there's no relationship and then find one!
A better hypothesis would be:
"There is no relationship between the size of bedload in River X and the distance in which it is found from the source." This is an example of a null hypothesis.
To test a hypothesis we first write the null hypothesis (H0).
This is the "no change" position. Phrases included in null hypothesis include...
"There is no difference between..."
"There has been no change since..."
"There is no relationship between..."
The reason we do this is because it's good practice to assume a relationship doesn't exist, and then prove that one does.
If you later prove a relationship does exist you can reject the null hypothesis and accept the alternative hypothesis (H1).
"There is a difference between..."
"There has been a change since..."
"There is a relationship between..."
The Bradshaw Model is commonly used by Geographers to predict how river characteristics will change downstream. Practice writing out some null & alternative hypothesis for other river characteristics. You can use the one above, for sediment size, to help you.
H0: There is no relationship between the channel depth in River X and the distance from the source.
H1: There is a relationship between the channel depth in River X and the distance from the source.
H0: There is no relationship between the discharge of River X and the distance from the source.
H1: There is a relationship between the discharge of River X and the distance from the source.
H0: There is no relationship between the channel bed roughness of River X and the distance from the source.
H1: There is a relationship between the channel bed roughness of River X and the distance from the source.
For each of the investigations, write out the null and alternative hypotheses.
1. The James Hutton Institute want to find out if the yield of wheat on Scottish farms has changed after using brand X fertiliser instead of using brand Y fertiliser.
2. The Walk Wheel Cycle Trust found that the mean travel time to work for Glasgow residents in 2015 was 20 minutes. A transport official wants to use a sample of this year's travel times to see if they have changed since 2015.
3. The team at Our World in Data are writing an article about improvements in healthcare. They want to establish whether a relationship exists between the GDP per capita ($) of ten low-income countries and the provision of doctors per 1,000 people in those countries.
Significance is our measure of at what stage we accept or reject the null hypothesis. Using significance ensures your data isn't just demonstrating a relationship by chance.
When you calculate the result of a statistical test, you will have to check it against a significance table.
We will use data tables sourced from the Island Geographer website. These should be referenced if you use them in your Geographical Study.
Significance tables show values of probability.
This example uses the Pearson's Product significance table.
A significance level of 0.01 says that there is a 1 in 100 chance of your data being a chance event. That means that there is 99% certainty that the relationship exists.
If the value that you've calculated in your statistical test is more than this (e.g. more than 0.995 - the critical value) then you can say that you're highly certain the relationship exists.
e.g. The result of 0.998 exceeds the 0.01 significance level. The null hypothesis must be rejected and replaced by the alternative hypothesis - that there is a relationship between the size of sediment in River X and the distance from the source.
However, the degrees of freedom refers to the numbers of values that have been studied to reach this conclusion. The more data the better - but also the more likely the data is to have anomalies. This means that at higher degrees of freedom, the number which needs to be reached is less. Be sure to calculate the degrees of freedom correctly - it's done differently depending on the test conducted.
Based on a significance level of 0.02 (98%), do you:
A - Reject the null hypothesis and accept the alternative hypothesis.
B - Accept the null hypothesis.
C - Change the level of significance you're looking at because although it doesn't meet it at 98%, it would at 95%.
In this case you would do A - reject the null hypothesis and accept the alternative hypothesis. This is because the critical value you were looking for, at 0.02 was 0.810 and your result was higher than this.
The Scottish Geology Trust are investigating the feasibility of a new path and interpretation centre at Siccar Point. They've compared the occurrences of path erosion at Siccar Point to locations on other parts of the coastal path. They want to see if visitors to Siccar Point are causing an increase in path erosion.
A chi squared analysis was undertaken to statistically test the data. The result (critical value) of this was 24.57. The degrees of freedom was 18.
Write down the null hypothesis and the alternative hypothesis.
Open the significance table for Chi squared. Is the critical value of 24.57 high enough to pass at any of the listed significant levels? If so, which ones?
Write down whether you accept or reject the null hypothesis.
Suggest two other data gathering techniques, other than measuring path erosion, that they might want to use to collect data to inform their plans.
Pearson's Product Moment Correlation Coefficient, to give it it's full name, is a way to examine two sets of data and find out if there is a significant relationship between them.
It gives us an idea of the strength of the relationship and what direction - i.e. is it a positive or negative relationship.
Before learning about how to do the statistical calculation, it might be worth reminding yourself about correlation. This was covered in the Graphical Data Presentation part of the course.
A Pearson's Product calculation will give us a value which will allow us to decide what kind of correlation the data has - and if it's strong enough for us to accept a hypothesis.
The formula for Pearson's Product looks complicated, but it just has to be calculated in steps. Each of these steps is represented with a heading in the table.
A geographical researcher is examining how air pollution levels vary with distance from the CBD in Inverness. Nitrogen dioxide levels were recorded and mapped across the city at regular intervals along a transect.
The researcher believed that as the centre of the city was quite old and contained few roads wide enough for large volumes of traffic, it would not have as high a level of nitrogen dioxide as the city outskirts.
The researcher therefore formulated the following null hypothesis:
“There is no correlation between the level of nitrogen dioxide in the air and the distance from the CBD.”
Step 1: Use the data to create a scattergraph. This will allow you to determine what kind of correlation to expect.'
In this case, the data shows a negative correlation. We will expect the result of our Pearson's Product calculation to be a negative number.
Step 2: Create a table as shown. This gives you all the components you need to work out Pearson's Product using the formula.
If you need more rows (for a higher number of data sets), then just add them!
What are all the columns for?
In the first column, put the values for your independent variable x (in this case, distance from the CBD).
In the second column, put the values for your dependent variable y (in this case, nitrogen dioxide levels).
The (x - x̅) column is how different the x values are from the mean.
The (y - y̅) column is how different the y values are from the mean.
The (x - x̅)(y - y̅) column is these two figures multiplied together, for each data set.
The (x - x̅)² and (y - y̅)² columns are the square.
Step 3: Calculate the mean of x (xbar) and the mean of y (ybar).
Step 4: Use the table to work through and calculate all the components you will need to use the formula.
These are:
Total (∑) of (x - x̅)(y - y̅)
Total (∑) of (x - x̅)²
Total (∑) of (y - y̅)²
The completed table is shown.
Step 5: Use the components to slot into the formula.
The negative value confirms to us that there is a negative correlation. The value is also close to -0.9, which tells us that there is a strong, negative correlation. But, before we can accept or reject the hypothesis we should check it's significance.
Step 6: Check the calculated value against the critical value, using the significance table. The degrees of freedom for Pearson's Product is
n-1 so in this case would be 12. When checking the level of significance, ignore any negative signs.
We can see from the table that the value has to be above 0.661 to be accepted. Our value of 0.908 is therefore above and we can say there is a significant relationship.
Step 7: Write out your answer in full, referring to the calculated value, critical value, significance level and hypothesis.
The Pearson’s value result of -0.908 exceeds the 0.005 or 99.5% significance level at 12 degrees of freedom of 0.661. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is a significant, strong, negative correlation between the level of nitrogen dioxide in the air and the distance from the CBD.
A - To find out if there's a relationship between the depth of soil and distance up a slope
B - To find out if two towns are seeing a similar rise in incidences of coastal erosion
C - To find out if there is a correlation between the GDP per capita and carbon dioxide emissions in South American countries
Both A and C could be analysed using Pearson's Product. This is because they look at whether there's a relationship, or correlation, between sets of data.
B looks at whether there's a significant difference between sets of data. One of the data types (the name of the town) is nominal data, so we need to use something other than Pearson's Product. We'll come across the more suitable statistical method later - Chi squared analysis.
A researcher for a health charity is looking to find out if there is a relationship between the % of people who work in agriculture and the life expectancy. She has randomly sampled 8 countries and sourced data from Gapminder.
Write out the null hypothesis for the investigation and undertake a Pearson's Product analysis to determine whether it should be accepted or rejected.
Step 1: Scattergraph
Step 2: Table
Step 3: Calculate mean
Step 4: Complete table
Step 5: Formula
Step 6: Significance table
Step 7: Write-out in full
Spearman's Rank Coefficient, to give it it's full name, is another way to examine two sets of data and find out if there is a significant relationship between them.
It gives us an idea of the strength of the relationship and what direction - i.e. is it a positive or negative relationship.
It uses ranked data rather than the true values.
Spearman's Rank has a lot of similarities to Pearson's Product and is used with similar data sets. However, it has some key differences:
It uses the data's rank, rather than it's true value.
This is an advantage because it reduces the impact of extreme values.
This makes the technique less powerful than Pearson's Product which gives more reliability and precision.
It has an fewer steps than Pearson's Product, which makes it an easier statistical test to complete.
The formula for Spearman's Rank has fewer components:
∑d² = the total of the differences in rank for each set of paired data
n = the number of sets of paired data
However, there are still a number of steps to follow to get the information you need.
A Geography class are investigating whether there is a relationship between the adult literacy of countries and their fertility rate.
The adult literacy rate is the percentage of people ages 15 and above who can both read and write, with understanding, a short simple statement about their everyday life.
Total fertility rate represents the number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with age-specific fertility rates of the specified year.
In order to see if there is a relationship, and if it's significant, they will undertake a Spearman's Rank calculation. Their teacher has provided them with a dataset featuring 68 countries.
Null hypothesis: There is no relationship between the adult literacy rate and the total fertility rate in select countries.
Data sampling: As the data set is very large, the students decided to use a sampling strategy to select 14 countries each. That is because ranking very large data sets is difficult and there is a higher chance of an error occurring.
They could do this using a:
systematic sampling approach, taking every 5th country
stratified sampling approach, separating the countries into those with high, medium and low human development and then selecting 4 from each
random sampling approach, generating 14 random numbers between 1-68 to select their countries.
This worked example, shown in the table, used a systematic sampling method and selected countries on every 5th line, starting with line 2.
Step 1: Use the data to create a scattergraph. This will allow you to determine what kind of correlation to expect.'
In this case, the data shows a negative correlation. We will expect the result of our Spearman's Product calculation to be a negative number.
Step 2: Create a table as shown. This gives you all the components you need to work out Spearman's Rank using the formula.
If you need more rows (for a higher number of data sets), then just add them!
What are all the columns for?
In the first column, put the values for your independent variable x (in this case, adult literacy rate).
In the second column, put the values for your dependent variable y (in this case, total fertility rate).
The rank x and rank y columns are the rank given to the data. There is more information about how to do this below.
The d column is the difference between the ranks for x and y (difference between column 3 and 5)
The d² column is the d column squared. We do this to remove negative numbers.
Step 3: Rank the x and y data.
You can do this in ascending (doing up) or descending (going down) order. It doesn't matter, as long as it's the same for both x and y.
If there are two identical values (like the values in red) then you give them a rank half way between the two. For example, the value of 4.8 would have come at rank position 2 and 3, so it's given the rank of 2.5 Importantly, you then continue ranking from value 4. I like to keep a track of this by writing in brackets - otherwise I get in a mess!
If there are three (or more!) identical values, like those shown in blue, you give them a rank that is the average of the values. In this case, the adult literacy rate of 100 would have been ranks 1, 2 and 3 - the average of which is 2. You then continue ranking from rank 4. Again, I add the numbers in brackets just for my own sake and keeping track.
You should know that you've done this step correctly as your last rank will be the same as the number of data sets you have.
Step 4: Calculate the difference in ranks and square, to complete the table.
Step 5: Use the calculated ∑d² value in the formula. Remember to do the 1 - part of the calculation.
The negative value confirms to us that there is a negative correlation. The value is -0.715, which tells us that the relationship isn't strong, but it is there. But, before we can accept or reject the hypothesis we should check it's significance.
Step 6: Check the calculated value against the critical value, using the significance table. The degrees of freedom for Spearman's Rank is
simply n, so in this case would be 14. When checking the level of significance, ignore any negative signs.
We can see from the table that the value has to be above 0.675 to be accepted at a 0.01 level of significance. Our value of -0.715 is therefore above and we can say there is a significant relationship.
Step 7: Write out your answer in full, referring to the calculated value, critical value, significance level and hypothesis.
The Spearman's Rank value result of -0.715 exceeds the 0.01 or 99% significance level at 14 degrees of freedom of 0.675. The null hypothesis must be rejected and replaced by the alternative hypothesis – that there is a significant negative relationship between the adult literacy rate and total fertility rate of select countries.
Conduct a Spearman's Rank analysis on another 14 countries from the data set.
You could choose to sample these with random or stratified sampling.
Sampling
Step 1: Scattergraph
Step 2: Table
Step 3: Ranking
Step 4: Complete table
Step 5: Formula
Step 6: Significance table
Step 7: Write-out in full