Welcome to your crash course in spatial statistics for public health! We’re going to cover a lot today in terms of analysis of spatial data. We’ll explore mapping, clustering, and how relationships between variables vary across geography. Buckle up!
The objectives of this guide are to teach you to:
As always, we need to load our packages. We have a couple of new friends joining us this week.
spdep: This package handles spatial
dependence — creating neighborhood structures and calculating statistics
like Moran’s I and Local Indicators of Spatial Association (LISA). Think
of it as your go-to for spatial autocorrelation.spatialreg: This package builds on
spdep and provides maximum likelihood estimation for
spatial regression models like spatial lag and spatial error models.
It’s your go-to tool for formal spatial regressions in R.GWmodel: A robust package for running
Geographically Weighted Regression (GWR), including model diagnostics,
bandwidth selection, and mapping local coefficients. It’s the main
engine behind spatially varying relationships.spgwr: A lighter package also used for
GWR. In this tutorial, we use it mainly for its convenient
gwr.sel() function to find the best adaptive bandwidth via
cross-validation.Together, these packages help us detect patterns, clusters, and relationships that change across space — essential for spatial health analysis.
First, we need data. We’ll grab county-level diabetes prevalence for 2021 from the CDC and merge it with county geometries and median household income from the American Community Survey 2021. For the diabetes data, I downloaded this from the CDC US Diabetes Surveillance System. It comes in a little ugly, so we will have to clean up the column names, reformat the FIPS codes, and filter to just California counties.
# Load and clean CDC diabetes CSV
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/DiabetesAtlas_CountyData.csv", destfile = "DiabetesAtlas_CountyData.csv", mode = "wb")
diab_raw <- read_csv("DiabetesAtlas_CountyData.csv", skip = 2) |>
slice(1:(n() - 1)) # remove last row
# Clean column names
names(diab_raw) <- str_trim(names(diab_raw))
# Format FIPS codes with leading zeros
diab_raw <- diab_raw |>
mutate(CountyFIPS = str_pad(as.character(as.integer(CountyFIPS)), 5, pad = "0"))
# Filter for California counties only
ca_diabetes <- diab_raw |>
filter(str_starts(CountyFIPS, "06")) |>
mutate(Percentage = as.numeric(Percentage)) |>
select(FIPS = CountyFIPS, County, Diabetes = Percentage)
Now lets get our good ole median income data from the US Census. Because it’s 2021 Diabetes data, we will pull the 2017-2021 five year estimates from the ACS.
# Download median income and geometries for California counties
ca_geo <- get_acs(
geography = "county",
variables = c(income = "B19013_001"), # Median household income
state = "CA",
geometry = TRUE,
year = 2021
) %>%
rename(income = estimate) %>%
select(GEOID, NAME, income, geometry)
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
Now let’s join our diabetes dataset with our income dataset at the county level. Then we will take a quick look at our data.
# Merge diabetes and ACS income data
ca_diab_data <- ca_geo |>
left_join(ca_diabetes, by = c("GEOID" = "FIPS"))
glimpse(ca_diab_data)
## Rows: 58
## Columns: 6
## $ GEOID <chr> "06059", "06111", "06063", "06015", "06023", "06043", "06037"…
## $ NAME <chr> "Orange County, California", "Ventura County, California", "P…
## $ income <dbl> 100485, 94150, 57885, 53280, 53350, 53304, 76367, 76066, 6700…
## $ County <chr> "Orange County", "Ventura County", "Plumas County", "Del Nort…
## $ Diabetes <dbl> 8.0, 7.8, 7.5, 8.1, 9.0, 7.3, 9.9, 9.8, 8.4, 7.9, 7.8, 8.5, 9…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-118.1144 3..., MULTIPOLYGON (((…
summary(ca_diab_data$Diabetes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.800 7.325 7.900 8.059 8.600 11.200
summary(ca_diab_data$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42206 58833 70037 75898 89428 140258
Let’s make a map and see what diabetes prevalence looks like across counties in California.
tmap_mode("view")
tm_shape(ca_diab_data) +
tm_polygons("Diabetes", palette = "Reds", style = "quantile",
title = "% Diagnosed Diabetes")
Now let’s ask: are diabetes rates clustered in space, or scattered randomly like confetti in a strong breeze? This is where spatial autocorrelation steps in. We’ll start with Moran’s I — the granddaddy of spatial pattern detection.
Moran’s I is a measure of overall spatial autocorrelation — that is, how similar or dissimilar values (like diabetes prevalence) are in nearby areas (i.e., in nearby counties).
Let’s calculate Moran’s I and find out if diabetes prevalence in
California follows any spatial logic. First, we use
poly2nb() in the spdep package to figure
out which polygons are neighbors–that is which counties are adjacent to
each other. Next, we use nb2listw() from
spdep to create a spatial weights object. We are
basically giving weights to each neighbor. Then we run the Moran’s I
test using moran.test() and the weights we just
created.
nb <- poly2nb(ca_diab_data)
lw <- nb2listw(nb, style = "W")
moran.test(ca_diab_data$Diabetes, lw)
##
## Moran I test under randomisation
##
## data: ca_diab_data$Diabetes
## weights: lw
##
## Moran I statistic standard deviate = 3.7581, p-value = 8.559e-05
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.296544968 -0.017543860 0.006984903
👉 Take home: Diabetes rates aren’t randomly distributed across California. There are clusters of high and low values, and we’ve got statistical proof! This result tells us that there is statistically significant spatial clustering of diabetes prevalence across California counties. Counties with high diabetes rates tend to be near other high-rate counties, and the same is true for low-rate counties. The Moran’s I value of ~0.3 indicates moderate positive spatial autocorrelation, and the tiny p-value confirms that this pattern is not due to chance.
While Moran’s I tells us whether there’s spatial autocorrelation in general, LISA zooms in and tells us where it’s happening.
LISA stands for Local Indicators of Spatial Association, and it helps us detect:
Mapping LISA helps target interventions more precisely, and explain those spatial oddballs.
To run LISA, we use the weights we just created, but now use the
function localmoran(). We will then pull out Ii,
which is the local Moran’s I statistic for each county — a measure of
local spatial autocorrelation. A high positive value indicates strong
clustering with neighbors (e.g., a high-rate county surrounded by other
high-rate counties). We also pull out Pval, which gives the
p-value for that local statistic, testing whether the observed local
clustering is statistically significant. Small p-values suggest that the
clustering is unlikely due to random chance.
local_moran <- localmoran(ca_diab_data$Diabetes, lw)
ca_diab_data$Ii <- local_moran[, "Ii"]
ca_diab_data$Pval <- local_moran[, "Pr(z != E(Ii))"]
# Map LISA
cat("\n### Local Spatial Clustering of Diabetes\n")
##
## ### Local Spatial Clustering of Diabetes
tm_shape(ca_diab_data) +
tm_polygons("Ii", style = "quantile", palette = "PuOr", title = "Local Moran's I")
###️ Interpreting the LISA Map
This map shows where spatial clustering of diabetes is strongest — not just overall, but locally. Here’s how to read it:
👉 Take home: Counties with strong clustering (dark purple) may benefit from regional strategies, while outliers may need more localized investigation. This helps tailor public health interventions to spatial realities.
Now we spice things up with the Getis-Ord Gi* statistic — pronounced “gee-eye-star.” This method identifies statistically significant hot and cold spots in the data.
Gi* doesn’t just look at one value — it considers the neighborhood around each observation too. This is powerful for public health mapping, especially when prioritizing interventions or allocating resources. Hot and cold spots are areas with significantly higher or lower rates than expected.
We run Gi* by using the function localG() from the
spdep package and using the same weights as before. We
can then map the gi_star values by county.
gi_star <- localG(ca_diab_data$Diabetes, lw)
ca_diab_data$Gi_star <- as.numeric(gi_star)
tm_shape(ca_diab_data) +
tm_polygons("Gi_star", palette = "-RdBu", style = "pretty", title = "Gi* Hot Spots")
This map uses the Getis-Ord Gi* statistic to highlight statistically significant spatial clusters of diabetes — also known as hot and cold spots:
👉 Take home: The red-hot lower half of the state (e.g., Inyo, Orange) may benefit from coordinated, regional interventions, while blue-cold northern counties may offer models of prevention or service delivery worth emulating.
Now that we’ve explored autocorrelation and hot spots, let’s try a classic spatial regression: the Spatial Lag Model. This model is useful when the outcome in one location might be influenced by nearby locations — like diabetes in one county being affected by its neighbors, which we have seen above!
Based on the exploratory mapping, Moran scatterplot, and the global Moran’s I, there appears to be spatial autocorrelation in diabetes. This means that if there is a spatial lag process going on and we fit a normal regression model our coefficients will be biased and inefficient. That is, the coefficient sizes and signs are not close to their true value and its standard errors are underestimated. This means trouble. Big trouble. Real big trouble.
In a spatial lag model, we include a spatially lagged dependent variable as a predictor: - It captures the “spillover” effect of nearby values. - It helps address spatial dependence that ordinary regression would miss.
We’ll use the lagsarlm() function from the
spatialreg package and will use the same spatial weights as
before.
# Define neighbors and spatial weights
nb <- poly2nb(ca_diab_data)
lw <- nb2listw(nb, style = "W")
# Run spatial lag model
lag_model <- lagsarlm(Diabetes ~ income, data = ca_diab_data, listw = lw)
summary(lag_model)
##
## Call:lagsarlm(formula = Diabetes ~ income, data = ca_diab_data, listw = lw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75134 -0.66941 -0.17221 0.51176 2.63300
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.7377e+00 1.3120e+00 3.6112 0.0003048
## income -8.0421e-06 5.8702e-06 -1.3700 0.1706906
##
## Rho: 0.4808, LR test value: 9.6355, p-value: 0.0019085
## Asymptotic standard error: 0.14347
## z-value: 3.3512, p-value: 0.00080451
## Wald statistic: 11.231, p-value: 0.00080451
##
## Log likelihood: -84.05424 for lag model
## ML residual variance (sigma squared): 1.0024, (sigma: 1.0012)
## Number of observations: 58
## Number of parameters estimated: 4
## AIC: 176.11, (AIC for lm: 183.74)
## LM test for residual autocorrelation
## test value: 0.034085, p-value: 0.85353
Let’s break down the results of our spatial lag model:
👉 Take home: There is strong spatial dependence in diabetes rates, and including spatial effects accounts for this spatial autocorrelation. However, income may not be a key driver of diabetes once those spatial influences are considered.
GWR — Geographically Weighted Regression — is like regular regression but with a twist: the coefficients can change depending on where you are.
Traditional regression assumes one-size-fits-all relationships. But in public health, the impact of income on diabetes may differ from San Francisco to Fresno.
GWR fits a separate regression model–no more assuming one relationship is consistent across all areas in your analysis!
We will run GWR first by setting up our data to have the geometric
centroid of each county using st_centroid() from the
sf package. Then, we will use the
gwr.sel() function from spgwr and will use
the geometric centroid of our counties to find the optimal bandwidth for
our GWR model. We will then model the GWR using the gwr()
function from the GWmodel package. We will use the
bandwidths and centroids that we set up in the previous steps.
coords <- st_coordinates(st_centroid(ca_diab_data))
ca_diab_data$X <- coords[, 1]
ca_diab_data$Y <- coords[, 2]
bw <- gwr.sel(Diabetes ~ income, data = ca_diab_data, coords = coords, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 64.59459
## Adaptive q: 0.618034 CV score: 67.46468
## Adaptive q: 0.236068 CV score: 60.33242
## Adaptive q: 0.145898 CV score: 54.84612
## Adaptive q: 0.09016994 CV score: 51.36862
## Adaptive q: 0.05572809 CV score: 52.55573
## Adaptive q: 0.0889909 CV score: 51.25455
## Adaptive q: 0.07731792 CV score: 51.04182
## Adaptive q: 0.06907134 CV score: 51.67776
## Adaptive q: 0.08166285 CV score: 50.95911
## Adaptive q: 0.08136232 CV score: 50.96031
## Adaptive q: 0.08205407 CV score: 50.95845
## Adaptive q: 0.08210682 CV score: 50.95844
## Adaptive q: 0.08214751 CV score: 50.95844
## Adaptive q: 0.08210682 CV score: 50.95844
# Run GWR using adaptive bandwidth
model <- gwr(Diabetes ~ income, data = ca_diab_data, coords = coords,
adapt = bw, hatmatrix = TRUE)
ca_diab_data$gwr_income <- model$SDF$income # Store model coefficients
The output shows the model testing different ‘adaptive q’ values, which represent the proportion of neighboring counties used in each local model. It chooses the q value with the lowest cross-validation (CV) score — best predictive fit. For example: q = 0.082 means ~8% of counties (≈5 counties) are used in each local regression. Lower CV = better model. It’s finding the ‘just right’ neighborhood size for each regression.
Let’s map those locally varying regression coefficients and see where income matters more (or less) for diabetes.
tm_shape(ca_diab_data) +
tm_polygons("gwr_income", palette = "RdYlGn", style = "quantile", title = "GWR: Income-Diabetes Association")
This map shows how the relationship between income and diabetes varies across California — it’s a spatial view of the regression coefficient between income and diabetes.
All the values are negative, meaning higher income is associated with lower diabetes prevalence — but the strength of this relationship varies across space.
Dark red counties (most negative values): These are the places where income has the strongest inverse relationship with diabetes. In other words, improving economic conditions here could have a big public health payoff.
Lighter yellow to green counties: These areas still show a negative relationship, but it’s weaker. That might mean other social or environmental factors are playing a bigger role than income in these counties.
Dark green areas: The weakest negative associations — here, income might matter less for diabetes prevention, or other factors may dominate.
👉 Take home: This map helps public health professionals tailor interventions. In counties where income is a strong driver of diabetes, economic support programs might reduce prevalence. In others, different strategies may be needed.
Boom! You just crunched real public health data using cutting-edge spatial tools to look at spatial clustering / autocorrelation. And we even learned some regression approaches to account for spatial autocorrelation. Well done! We’ve just scratched the surface on these spatial statistics approaches, but we’ve come so far from Week 1! I’m proud of you!