Foundations of Geospatial Analysis

Professor Adam Dennett - @adam_dennett

Bartlett Centre for Advanced Spatial Analysis, University College London

November 21, 2023

About Me

Professor of Urban Analytics @ Bartlett Centre for Advanced Spatial Analysis (CASA), UCL
Geographer by background - ex-Secondary School Teacher - back in HE for 16+ years
Taught GIS / Spatial Data Science at postgrad level for last 11 years

About this session

Whistle-stop tour of some of the key concepts relating to spatial data
An illustrative example analysing some spatial data in London - demonstrating the “spatial is special” idiom and how we might account for spatial factors in our analysis
All slides and examples are produced in RMarkdown using Quarto and R so everything can be forked and reproduced in your own time later - just go to the Github Repo link below
By the end I hope you’ll all leave with a better introductory understanding of why and how we should pay attention to the influence of space in any analysis

Key Geospatial Concepts

Where? (absolute)

Where? (relative)

Storing where - spatial data

How near or distant?

What scale?

What shape?

Where? (absolute)

Everything happens somewhere
- We’re here: Wallspace, 22 Duke’s Road, Camden, London, England, *Europe, Northern Hemisphere, Earth

Where? (absolute)

How do we know exactly where?

XKCD - No, The Other One — https://xkcd.com/2480/

Where? Coordinate Reference Systems

More reliable than names (that are rarely unique or reference fuzzy locations), are coordinates
The earth is roughly spherical and points anywhere on its surface can be described using the World Geodetic System (WGS) - a geographic (spherical) coordinate system
Points can be referenced according to their position on a grid of latitudes (degrees north or south of the equator) and longitudes (degrees east or west of the Prime - Greenwich - meridian)
The last major revision of the World Geodetic System was in 1984 and WGS84 is still used today as the standard system for references places on the globe.

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/geographic-vs-projected-coordinate-reference-systems-python/

Where? Coordinate Reference Systems

Projected Coordinate Reference Systems convert the 3D globe to a 2D plane and can do so in a huge variety of different ways
Most national mapping agencies have their own projected coordinate systems - in Britain the Ordnance Survey maintain the British National Grid which locates places according to 6-digit Easting and Northing coordinates
Every coordinate system can be referenced by its EPSG code, e.g. WGS84 = 4326 or British National Grid = 27700 with mathematical transformations to convert between them

Where? Describing and Locating Things with Coordinates

Once we have a coordinate reference system we can locate objects accurately in space
Most objects that spatial data scientists are concerned with (apart from gridded representations, which we will ignore for now!) can be simplified to either a point, a line or a polygon in that space
Polygons and lines are just multiple point coordinates joined together!
The examples on the right store geometries in the ‘well-known-text’ (WKT) format for representing vector (point, line, polygon) geometries

Storing where - managing spatial data

Impossible to talk about spatial data without mentioning the shapefile
Developed in the 1980s by ESRI and has become, pretty much, the de facto standard for storing and sharing spatial data - even though it’s a terrible format!
Shapefiles store geometries (shapes) and attributes (information about those shapes)
Not a single file, actually a collection of files
- .shp - geometries
- .shx - index
- .dbf - attributes
- +some others!
Superseded by LOTS of alternative formats - geojson (web), GeoPackage (everything) which do the same thing in better ways for different applications

Storing where - Simple Features

Simple Features - OGC (Open Geospatial Consortium) standard that specifies a common storage and access model for 2D geometries
2 part standard:
- Part 1 - Common Architecture defining geometries, attributes etc. via WKT
- Part 2 - supports storage, retrieval, query and update of simple geospatial feature collections via SQL (structured query language – been around since the 1970s)
Simple Features implemented in most spatially enabled database management systems (e.g. PostGIS extension for PostgreSQL, Oracle Spatial etc.)
sf package in R enables storage of spatial data and attributes in a single data frame object

Where? Relative - Tobler’s First Law of Geography

“Everything is related to everything else, but near things are more related than distant things.”

This observation underpins much of what spatial data scientists do
Being able to locate something in space, relative to something else, allows us to:
- explain why something may be occurring where it is
- make better predictions about nearby or further away things
Underpins the whole Geodeomographics (customer segmentation) industry!!

Where? Relative - John Snow’s Cholera Map

Where? Relative - Defining ‘near’ and ‘distant’

Near and distant can mean different things in different contexts
- the furthest one would travel to buy a pint of milk is somewhat different to furthest one might be willing to commute for a job
In spatial data science one way of separating near from distant can simply be to define their topological relationship - Dimensionally Extended 9-Intersection Model (DE-9IM) is the standard topological model used in GIS
Touching or overlapping objects = ‘near’

Where? Relative - Exploring Near and Distant

Near and distant in London
Map shows 2011 Census Wards in London, within Borough Boundaries
The Greater London Authority produced the London Ward Atlas - https://data.london.gov.uk/dataset/ward-profiles-and-atlas - which collates a range of demographic and economic indicators for each of these zones in the city

Where? Relative - Exploring Near and Distant

If we measure the distance from the centre (centroid) of one ward to another, then we might decide that the 1st, 2nd, 3rd, kth. closest wards are near, the others are far
These neighbour relationships can be stored in an \(n*n\) ‘spatial weights’ matrix
The spdep package in R will calculate a range of spatial weights matrices given a set of geometries

Where? Relative - Exploring Near and Distant

We can then decide to include the “k” nearest neighbours or exclude the rest

Where? Relative - Exploring Near and Distant

Other conceptions of near might include any contiguous ward with distant simply being those which are not contiguous
Near or distant could also be defined by some distance threshold

Analysis of ‘where’?

Where in London do students perform best and worst in their post-16 exams?

Is there any pattern? Do better scores and worse scores appear to be clustered? How can we tell?

Spatial Autocorrelation

Spatial Autocorrelation - phenomenon of near things being more similar than distant things.
- Do neighbouring wards have more similar GCSE points scores than distant wards?
Can test for spatial autocorrelation by comparing the GCSE Scores in any given ward with the GCSE scores in neighbouring wards (however we choose to define our neighbours - k-nearest, those that are contiguous etc.)
Average value of GCSE scores in the neighbouring wards is known as the spatial lag of GSCE scores

Spatial Autocorrelation

                          (Intercept) average_gcse_capped_point_scores_2014 
                          190.2624075                             0.4190508

If there is a linear correlation between the variable and its spatial lag (don’t ask me why the lag is the \(y\) variable in this case!), we can observe that values in near places do tend to cluster

Moran’s I

Moran’s I is another name for the least-squares regression slope parameter when the variable is correlated with its spatial lag)
Values range from +1 (perfect spatial autocorrelation) to -1 (perfect dispersal) with values close to 0 indicating no relationship

moran.test(LondonWardsMerged$average_gcse_capped_point_scores_2014, nb2listw(LWard_nb))


    Moran I test under randomisation

data:  LondonWardsMerged$average_gcse_capped_point_scores_2014  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 17.785, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.4190507533     -0.0016025641      0.0005594495

Moran’s I

Moran’s I = 0.42
Moderate, positive spatial autocorrelation between average GCSE scores in London - some clustering of both low and high scores
Spatial Autocorrelation might be expected when distribution of schools overlaid and one realises that pupils from multiple neighbouring wards might attend the same school

Explaining Spatial Patterns

Having observed some spatial patterns in school exam performance in London, we might next want to explain these patterns, perhaps using another variable measured for the same spatial units.
Our own experience might tell us that missing class could negatively impact our ability to learn things in that class
Hypothesis: wards with higher rates of absence from school will tend to experience lower average exam grades

Explaining Spatial Patterns

                                     (Intercept) 
                                       371.71500 
unauthorised_absence_in_all_schools_percent_2013 
                                       -41.40264

Taking the whole of London, it would appear that there is a moderately strong, negative relationship between missing school and exam performance
For every 1% of additional school days missed, we might expect a decrease of -41 points in GCSE score.
But does this relationship hold true across all wards in the city?

Explaining Spatial Patterns

Moran’s I of GSCE scores means that we already know that the observations are probably not independent of each other (violating one assumption of regression)
Mapping the residual values from the regression model allows us to observe any spatial clustering in the errors
Clustering of residuals could also indicate a violation of the independence assumption of errors


    Moran I test under randomisation

data:  LondonWardsMerged$model1_resids  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 12.183, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2862894906     -0.0016025641      0.0005583971

Dealing with Spatial Patterns - Spatial Regression Models (the spatial lag model)


Call:
lagsarlm(formula = average_gcse_capped_point_scores_2014 ~ unauthorised_absence_in_all_schools_percent_2013, 
    data = LondonWardsMerged, listw = nb2listw(LWard_nb, style = "W"), 
    method = "eigen")

Residuals:
      Min        1Q    Median        3Q       Max 
-68.70402  -9.44615  -0.64207   8.53417  58.56788 

Type: lag 
Coefficients: (asymptotic standard errors) 
                                                 Estimate Std. Error z value
(Intercept)                                      207.4009    15.0053  13.822
unauthorised_absence_in_all_schools_percent_2013 -30.7843     2.0792 -14.806
                                                  Pr(>|z|)
(Intercept)                                      < 2.2e-16
unauthorised_absence_in_all_schools_percent_2013 < 2.2e-16

Rho: 0.46705, LR test value: 104.93, p-value: < 2.22e-16
Asymptotic standard error: 0.041738
    z-value: 11.19, p-value: < 2.22e-16
Wald statistic: 125.22, p-value: < 2.22e-16

Log likelihood: -2581.93 for lag model
ML residual variance (sigma squared): 217.21, (sigma: 14.738)
Number of observations: 625 
Number of parameters estimated: 4 
AIC: 5171.9, (AIC for lm: 5274.8)
LM test for residual autocorrelation
test value: 3.0949, p-value: 0.078537

One way of coping with spatial dependence in the dependent variable is to include the spatial lag of that variable as an independent explanatory variable
the spatialreg package in R allows us to easily incorporate a spatial lag of the dependent variable as an independent variable \(\rho\) (Rho) in a standard linear regression model
Running the spatial lag model reveals that the spatial lag is statistically significant and has the effect of reducing the estimated impact of missing 1% of schools days from -42 points to -31 points.

Dealing with Spatial Patterns - Spatial Non-Stationarity

One reason behind a clustering of residuals could be that the relationship between dependent and independent variables might not remain constant across space
- In some parts of London, it could be that as unauthorised absence from school rises, exam grades also rise (as unlikely as that might be!).
- Or, more plausibly, that in some parts of the city, absence has an even more pronounced negative effect than in others.
- It’s also likely that the intercept values (the average value of GSCE rules, given no days of unauthorised absence) will be different in different parts of the city - some areas, on average, doing better than others
We can test for the presence of such phenomena by running a series of smaller, more localised regressions and comparing the coefficients that emerge

Geographically Weighted Regression

GWR is a method for systematically running a series of localised regression analyses across a study area, collecting coefficients and other diagnostics for an independent variable in each zone of interest.
Something similar can be achieved through spatial sub-setting - i.e. running analyses for groups of zones within a higher level geography

Geographically Weighted Regression

In a GWR analysis, kernel weighting functions of different bandwidths (diameters) and shapes are used to include and weight or exclude neighbouring observations
Adaptive weighting can be used to adjust the size of the kernel according to some threshold of observations
For every point in the dataset a regression is run including the values within the kernel (which, of course, can only be achieved effectively through understanding the coordinate reference system of the observations)

Geographically Weighted Regression

Plotting coefficient values for each ward reveals noticable non-stationarity in the relationship between absence and GSCE scores
In well-off central London boroughs (particularly Hammersmith and Fulham, Kensington and Chelsea and Camden) we see evidence that absence is positively related to GCSE performance
In some of the outer-London boroughs (Barnet, Sutton, Richmond etc.) the effect of missing school is even more severe than it is elsewhere in the city

Scale and Shape - Modifiable Areal Units and Ecological Fallacies

Methods which accommodate space explicitly can help us better understand spatial phenomomena, but the arrangement of space can alter perceptions and the outcomes of analyses
The Modifiable Areal Unit Problem (MAUP) - popularised in the 1980s by Stan Openshaw - describes issues that relate to the shape, scale and aggregation of underlying phenomenon to artificial spatial units
Politicians have known about the issues of scale and aggregation for a long time and have used it to their advantage
The practice of Gerrymandering is widespread wherever there is a first-past-the-post electoral system and has been used to manipulate vote counts to influence election outcomes

Scale and Shape - Modifiable Areal Units and Ecological Fallacies

Related to the MAUP, the Ecological Fallacy describes a confusion between patterns revealed at one level of aggregation and the assumption that they apply either to individuals or lower levels of aggregation
The basic idea that just because a patterns of educational attainment are revealed at Borough level, they won’t necessarily translate down to neighbourhood levels
“Simpson’s Paradox” - a type of ecological fallacy where the statistical association or correlation between two variables at one level of aggregation disappears or reverses at another - think back to the Geographically Weighted Regression example from earlier

Conclusions

Knowing where something occurs underpins everything spatial data scientists do
Various conventions around how to locate something on the earth’s surface and store information about it have emerged
Near things are more related than distant things and being aware of this when analysing data with a spatial dimension is fundamental to carrying out a robust analysis
Accounting for spatial clustering in data can help analysts:
- more correctly interpret relationships between variables
- avoid making erroneous generalisations that do not apply in local contexts
- be aware of potentially significant consequences in statistical outcomes that are the result of a particular arrangement of space