Introduction
Welcome to your crash course in spatial statistics for public health!
We’re going to cover a lot today in terms of analysis of spatial data.
We’ll explore mapping, clustering, and how relationships between
variables vary across geography. Buckle up!
The objectives of this guide are to teach you to:
- Use multiple methods to identify spatial autocorrelation
- Account for spatial autcorrelation in regression using spatial lag
models
- Conduct geographically weighted regression to model how associations
vary across space
Load packages
As always, we need to load our packages. We have a couple of new
friends joining us this week.
spdep: This package handles spatial
dependence — creating neighborhood structures and calculating statistics
like Moran’s I and Local Indicators of Spatial Association (LISA). Think
of it as your go-to for spatial autocorrelation.
spatialreg: This package builds on
spdep and provides maximum likelihood estimation for
spatial regression models like spatial lag and spatial error models.
It’s your go-to tool for formal spatial regressions in R.
GWmodel: A robust package for running
Geographically Weighted Regression (GWR), including model diagnostics,
bandwidth selection, and mapping local coefficients. It’s the main
engine behind spatially varying relationships.
spgwr: A lighter package also used for
GWR. In this tutorial, we use it mainly for its convenient
gwr.sel() function to find the best adaptive bandwidth via
cross-validation.
Together, these packages help us detect patterns, clusters, and
relationships that change across space — essential for spatial health
analysis.
Data Collection and Preparation
First, we need data. We’ll grab county-level diabetes prevalence for
2021 from the CDC and merge it with county geometries and median
household income from the American Community Survey 2021. For the
diabetes data, I downloaded this from the CDC
US Diabetes Surveillance System. It comes in a little ugly, so we
will have to clean up the column names, reformat the FIPS codes, and
filter to just California counties.
# Load and clean CDC diabetes CSV
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/DiabetesAtlas_CountyData.csv", destfile = "DiabetesAtlas_CountyData.csv", mode = "wb")
diab_raw <- read_csv("DiabetesAtlas_CountyData.csv", skip = 2) |>
slice(1:(n() - 1)) # remove last row
# Clean column names
names(diab_raw) <- str_trim(names(diab_raw))
# Format FIPS codes with leading zeros
diab_raw <- diab_raw |>
mutate(CountyFIPS = str_pad(as.character(as.integer(CountyFIPS)), 5, pad = "0"))
# Filter for California counties only
ca_diabetes <- diab_raw |>
filter(str_starts(CountyFIPS, "06")) |>
mutate(Percentage = as.numeric(Percentage)) |>
select(FIPS = CountyFIPS, County, Diabetes = Percentage)
Now lets get our good ole median income data from the US Census.
Because it’s 2021 Diabetes data, we will pull the 2017-2021 five year
estimates from the ACS.
# Download median income and geometries for California counties
ca_geo <- get_acs(
geography = "county",
variables = c(income = "B19013_001"), # Median household income
state = "CA",
geometry = TRUE,
year = 2021
) %>%
rename(income = estimate) %>%
select(GEOID, NAME, income, geometry)
Now let’s join our diabetes dataset with our income dataset at the
county level. Then we will take a quick look at our data.
# Merge diabetes and ACS income data
ca_diab_data <- ca_geo |>
left_join(ca_diabetes, by = c("GEOID" = "FIPS"))
glimpse(ca_diab_data)
## Rows: 58
## Columns: 6
## $ GEOID <chr> "06059", "06111", "06063", "06015", "06023", "06043", "06037"…
## $ NAME <chr> "Orange County, California", "Ventura County, California", "P…
## $ income <dbl> 100485, 94150, 57885, 53280, 53350, 53304, 76367, 76066, 6700…
## $ County <chr> "Orange County", "Ventura County", "Plumas County", "Del Nort…
## $ Diabetes <dbl> 8.0, 7.8, 7.5, 8.1, 9.0, 7.3, 9.9, 9.8, 8.4, 7.9, 7.8, 8.5, 9…
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-118.1144 3..., MULTIPOLYGON (((…
summary(ca_diab_data$Diabetes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.800 7.325 7.900 8.059 8.600 11.200
summary(ca_diab_data$income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42206 58833 70037 75898 89429 140258
🗺️ Exploratory Mapping
Let’s make a map and see what diabetes prevalence looks like across
counties in California.
tmap_mode("view")
tm_shape(ca_diab_data) +
tm_polygons(fill = "Diabetes",
fill.scale = tm_scale_intervals(style = "quantile",
values = "brewer.reds"),
fill.legend = tm_legend(title = "% Diagnosed Diabetes"))
Methods to Identify Spatial Autocorrelation
Now let’s ask: are diabetes rates clustered in space, or scattered
randomly like confetti in a strong breeze? This is where spatial
autocorrelation steps in. We’ll start with Moran’s I — the granddaddy of
spatial pattern detection.
Global Moran’s I
Moran’s I is a measure of overall spatial autocorrelation — that is,
how similar or dissimilar values (like diabetes prevalence) are in
nearby areas (i.e., in nearby counties).
- Values close to +1 mean strong positive spatial
autocorrelation — high values cluster with other high values, and low
with low (think: like attracts like).
- Values near 0 mean no spatial pattern — just
noise.
- Values near -1 mean high and low values are
actively repelling each other (rare in public health).
Let’s calculate Moran’s I and find out if diabetes prevalence in
California follows any spatial logic. First, we use
poly2nb() in the spdep package to figure
out which polygons are neighbors–that is which counties are adjacent to
each other. Next, we use nb2listw() from
spdep to create a spatial weights object. We are
basically giving weights to each neighbor. Then we run the Moran’s I
test using moran.test() and the weights we just
created.
nb <- poly2nb(ca_diab_data)
lw <- nb2listw(nb, style = "W")
moran.test(ca_diab_data$Diabetes, lw)
##
## Moran I test under randomisation
##
## data: ca_diab_data$Diabetes
## weights: lw
##
## Moran I statistic standard deviate = 3.7581, p-value = 8.559e-05
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.296544968 -0.017543860 0.006984903
Interpreting Moran’s I
- A Moran’s I value of ~0.3 means moderate positive
spatial autocorrelation — counties with similar diabetes rates are near
each other.
- A p-value < 0.001 tells us this clustering is
statistically significant — not random.
👉 Take home: Diabetes rates aren’t randomly
distributed across California. There are clusters of high and low
values, and we’ve got statistical proof! This result tells us that there
is statistically significant spatial clustering of diabetes prevalence
across California counties. Counties with high diabetes rates tend to be
near other high-rate counties, and the same is true for low-rate
counties. The Moran’s I value of ~0.3 indicates moderate positive
spatial autocorrelation, and the tiny p-value confirms that this pattern
is not due to chance.
Local Moran’s I (LISA)
While Moran’s I tells us whether there’s spatial autocorrelation in
general, LISA zooms in and tells us where it’s
happening.
LISA stands for Local Indicators of Spatial Association, and it helps
us detect:
- High-High clusters: High diabetes rates surrounded
by other high-rate counties (hot clusters).
- Low-Low clusters: Healthy zones with low diabetes
prevalence.
- High-Low or Low-High outliers: Counties that buck
the trend — like a donut hole in a cake.
Mapping LISA helps target interventions more precisely, and explain
those spatial oddballs.
To run LISA, we use the weights we just created, but now use the
function localmoran(). We will then pull out Ii,
which is the local Moran’s I statistic for each county — a measure of
local spatial autocorrelation. A high positive value indicates strong
clustering with neighbors (e.g., a high-rate county surrounded by other
high-rate counties). We also pull out Pval, which gives the
p-value for that local statistic, testing whether the observed local
clustering is statistically significant. Small p-values suggest that the
clustering is unlikely due to random chance.
local_moran <- localmoran(ca_diab_data$Diabetes, lw)
ca_diab_data$Ii <- local_moran[, "Ii"]
ca_diab_data$Pval <- local_moran[, "Pr(z != E(Ii))"]
# Map LISA
tm_shape(ca_diab_data) +
tm_polygons(fill = "Ii",
fill.scale = tm_scale_intervals(style = "quantile",
values = "brewer.pu_or"),
fill.legend = tm_legend(title = "Local Moran's I")) +
tm_title_out(text = "Local Spatial Clustering of Diabetes")
Interpreting the LISA Map
This map shows where spatial clustering of diabetes is strongest —
not just overall, but locally. Here’s how to read
it:
- Dark purple counties have strong local clustering —
likely high-high or low-low patterns. These are areas where diabetes
rates mirror their neighbors and form significant clusters.
- Orange or tan counties may be spatial
outliers — places where diabetes rates differ from
nearby counties. For example, a healthy county surrounded by high-rate
neighbors.
- White areas or muted tones indicate weak or
non-significant clustering.
👉 Take home: Counties with strong clustering (dark
purple) may benefit from regional strategies, while outliers may need
more localized investigation. This helps tailor public health
interventions to spatial realities.
🔥 Hot Spot Analysis (Gi*)
Now we spice things up with the Getis-Ord Gi* statistic — pronounced
“gee-eye-star.” This method identifies statistically significant
hot and cold spots in the data.
- A hot spot is a county with a high value (e.g.,
diabetes rate) surrounded by other high-value neighbors.
- A cold spot is the opposite: a low-value county in
a sea of low values.
Gi* doesn’t just look at one value — it considers the neighborhood
around each observation too. This is powerful for public health mapping,
especially when prioritizing interventions or allocating resources.
Hot and cold spots are areas with
significantly higher or lower rates than expected.
We run Gi* by using the function localG() from the
spdep package and using the same weights as before. We
can then map the gi_star values by county.
gi_star <- localG(ca_diab_data$Diabetes, lw)
ca_diab_data$Gi_star <- as.numeric(gi_star)
tm_shape(ca_diab_data) +
tm_polygons(fill = "Gi_star",
fill.scale = tm_scale_intervals(style = "pretty",
values = "-brewer.rd_bu"),
fill.legend = tm_legend(title = "Gi* Hot Spots"))
Interpreting the Gi* Map
This map uses the Getis-Ord Gi* statistic to highlight
statistically significant spatial clusters of diabetes
— also known as hot and cold spots:
- Dark red counties (Gi* > 2) are hot
spots: areas with high diabetes prevalence that are also
surrounded by other high-prevalence counties. These are prime targets
for regional public health efforts.
- Lighter shades (e.g., peach or sky blue) represent
less intense clustering — not as statistically extreme.
- Dark blue counties (Gi* < -2) are cold
spots: clusters of lower-than-average diabetes rates. Great for
understanding what’s going right.
👉 Take home: The red-hot lower half of the state
(e.g., Inyo, Orange) may benefit from coordinated, regional
interventions, while blue-cold northern counties may offer models of
prevention or service delivery worth emulating.
Spatial Lag Model
Now that we’ve explored autocorrelation and hot spots, let’s try a
classic spatial regression: the Spatial Lag Model. This
model is useful when the outcome in one location might be influenced by
nearby locations — like diabetes in one county being affected by its
neighbors, which we have seen above!
Based on the exploratory mapping, Moran scatterplot, and the global
Moran’s I, there appears to be spatial autocorrelation in diabetes. This
means that if there is a spatial lag process going on and we fit a
normal regression model our coefficients will be biased and inefficient.
That is, the coefficient sizes and signs are not close to their true
value and its standard errors are underestimated. This means trouble.
Big trouble. Real big trouble.
In a spatial lag model, we include a spatially lagged
dependent variable as a predictor: - It captures the
“spillover” effect of nearby values. - It helps address spatial
dependence that ordinary regression would miss.
We’ll use the lagsarlm() function from the
spatialreg package and will use the same spatial weights as
before.
# Define neighbors and spatial weights
nb <- poly2nb(ca_diab_data)
lw <- nb2listw(nb, style = "W")
# Run spatial lag model
lag_model <- lagsarlm(Diabetes ~ income, data = ca_diab_data, listw = lw)
summary(lag_model)
##
## Call:lagsarlm(formula = Diabetes ~ income, data = ca_diab_data, listw = lw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75134 -0.66941 -0.17221 0.51176 2.63300
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.7377e+00 1.3120e+00 3.6112 0.0003048
## income -8.0421e-06 5.8702e-06 -1.3700 0.1706906
##
## Rho: 0.4808, LR test value: 9.6355, p-value: 0.0019085
## Asymptotic standard error: 0.14347
## z-value: 3.3512, p-value: 0.00080451
## Wald statistic: 11.231, p-value: 0.00080451
##
## Log likelihood: -84.05424 for lag model
## ML residual variance (sigma squared): 1.0024, (sigma: 1.0012)
## Number of observations: 58
## Number of parameters estimated: 4
## AIC: 176.11, (AIC for lm: 183.74)
## LM test for residual autocorrelation
## test value: 0.034085, p-value: 0.85353
Interpreting the Spatial Lag Model
Let’s break down the results of our spatial lag model:
- The spatial lag coefficient (rho) is about
0.48 and statistically significant (p <
0.002). That means diabetes rates in a county are strongly
influenced by neighboring counties — there’s real spatial
autocorrelation.
- The income coefficient is
negative, suggesting that higher income is associated
with lower diabetes, but it’s not statistically
significant (p = 0.171). So, in this model, income
alone doesn’t explain much variation in diabetes once spatial effects
are accounted for.
- The AIC (Akaike Information Criterion) is lower for
the spatial lag model than for the regular OLS model (176 vs. 184),
suggesting that the spatial lag model is a better fit—this means we
really should be accounting for spatial autocorrelation!
- The residual autocorrelation test is not
significant (p = 0.85), which means the model has successfully
addressed spatial dependence in the residuals.
👉 Take home: There is strong spatial dependence in
diabetes rates, and including spatial effects accounts for this spatial
autocorrelation. However, income may not be a key driver of diabetes
once those spatial influences are considered.
📉 Geographically Weighted Regression (GWR)
GWR — Geographically Weighted Regression — is like regular regression
but with a twist: the coefficients can change depending on
where you are.
Traditional regression assumes one-size-fits-all relationships. But
in public health, the impact of income on diabetes may differ from San
Francisco to Fresno.
GWR fits a separate regression model–no more assuming one
relationship is consistent across all areas in your analysis!
We will run GWR first by setting up our data to have the geometric
centroid of each county using st_centroid() from the
sf package. Then, we will use the
gwr.sel() function from spgwr and will use
the geometric centroid of our counties to find the optimal bandwidth for
our GWR model. We will then model the GWR using the gwr()
function from the GWmodel package. We will use the
bandwidths and centroids that we set up in the previous steps.
coords <- st_coordinates(st_centroid(ca_diab_data))
ca_diab_data$X <- coords[, 1]
ca_diab_data$Y <- coords[, 2]
bw <- gwr.sel(Diabetes ~ income, data = ca_diab_data, coords = coords, adapt = TRUE)
## Adaptive q: 0.381966 CV score: 64.59459
## Adaptive q: 0.618034 CV score: 67.46468
## Adaptive q: 0.236068 CV score: 60.33242
## Adaptive q: 0.145898 CV score: 54.84612
## Adaptive q: 0.09016994 CV score: 51.36862
## Adaptive q: 0.05572809 CV score: 52.55573
## Adaptive q: 0.0889909 CV score: 51.25455
## Adaptive q: 0.07731792 CV score: 51.04182
## Adaptive q: 0.06907134 CV score: 51.67776
## Adaptive q: 0.08166285 CV score: 50.95911
## Adaptive q: 0.08136232 CV score: 50.96031
## Adaptive q: 0.08205407 CV score: 50.95845
## Adaptive q: 0.08210682 CV score: 50.95844
## Adaptive q: 0.08214751 CV score: 50.95844
## Adaptive q: 0.08210682 CV score: 50.95844
# Run GWR using adaptive bandwidth
model <- gwr(Diabetes ~ income, data = ca_diab_data, coords = coords,
adapt = bw, hatmatrix = TRUE)
ca_diab_data$gwr_income <- model$SDF$income # Store model coefficients
Interpreting the GWR results
The output shows the model testing different ‘adaptive q’ values,
which represent the proportion of neighboring counties used in each
local model. It chooses the q value with the lowest cross-validation
(CV) score — best predictive fit. For example: q = 0.082 means ~8% of
counties (≈5 counties) are used in each local regression. Lower CV =
better model. It’s finding the ‘just right’ neighborhood size for each
regression.
🗾 Map Local GWR Coefficients
Let’s map those locally varying regression coefficients and see where
income matters more (or less) for diabetes.
tm_shape(ca_diab_data) +
tm_polygons(fill = "gwr_income",
tm_scale_intervals(style = "quantile",
values = "brewer.rd_yl_gn"),
fill.legend = tm_legend(title = "GWR: Income-Diabetes Association"))
🧩 Interpreting the GWR Map
This map shows how the relationship between income and diabetes
varies across California — it’s a spatial view of the
regression coefficient between income and diabetes.
All the values are negative, meaning higher
income is associated with lower diabetes prevalence —
but the strength of this relationship varies across
space.
Dark red counties (most negative values): These
are the places where income has the strongest inverse
relationship with diabetes. In other words, improving economic
conditions here could have a big public health
payoff.
Lighter yellow to green counties: These areas
still show a negative relationship, but it’s weaker.
That might mean other social or environmental factors are playing a
bigger role than income in these counties.
Dark green areas: The weakest negative
associations — here, income might matter less for diabetes prevention,
or other factors may dominate.
👉 Take home: This map helps public health
professionals tailor interventions. In counties where income is a strong
driver of diabetes, economic support programs might reduce prevalence.
In others, different strategies may be needed.
Take home
- Moran’s I: Measures overall spatial autocorrelation
(is there a pattern?).
- LISA/Gi: Reveal local clusters and hot/cold spots
(where’s the action?).
- Spatial Lag Model: Quantifies spatial
autocorrelation and adjusts for spatial autocorrelation in regression
models.
- GWR: Shows where exposure-outcome relationships are
strong, weak, or flipping direction.
Summary
Boom! You just crunched real public health data using cutting-edge
spatial tools to look at spatial clustering / autocorrelation. And we
even learned some regression approaches to account for spatial
autocorrelation. Well done! We’ve just scratched the surface on these
spatial statistics approaches, but we’ve come so far from Week 1! I’m
proud of you!
---
title: 'Lab 8: Spatial Statistics and Geographically Weighted Regression in Public
  Health'
---

\

# Introduction

Welcome to your crash course in spatial statistics for public health! We're going to cover a lot today in terms of analysis of spatial data. We'll explore mapping, clustering, and how relationships between variables vary across geography. Buckle up!

The objectives of this guide are to teach you to:

1. Use multiple methods to identify spatial autocorrelation
2. Account for spatial autcorrelation in regression using spatial lag models
3. Conduct geographically weighted regression to model how associations vary across space

\

# Load packages

As always, we need to load our packages. We have a couple of new friends joining us this week.

- **`spdep`**: This package handles spatial dependence — creating neighborhood structures and calculating statistics like Moran's I and Local Indicators of Spatial Association (LISA). Think of it as your go-to for spatial autocorrelation.
- **`spatialreg`**: This package builds on **spdep** and provides maximum likelihood estimation for spatial regression models like spatial lag and spatial error models. It's your go-to tool for formal spatial regressions in R.
- **`GWmodel`**: A robust package for running Geographically Weighted Regression (GWR), including model diagnostics, bandwidth selection, and mapping local coefficients. It’s the main engine behind spatially varying relationships.
- **`spgwr`**: A lighter package also used for GWR. In this tutorial, we use it mainly for its convenient `gwr.sel()` function to find the best adaptive bandwidth via cross-validation.

Together, these packages help us detect patterns, clusters, and relationships that change across space — essential for spatial health analysis.


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
library(tidyverse)
library(tidycensus)
library(sf)
library(tmap)
library(spdep)
library(GWmodel)
library(spatialreg)
library(spgwr)
```


\

# Data Collection and Preparation

First, we need data. We'll grab county-level diabetes prevalence for 2021 from the CDC and merge it with county geometries and median household income from the American Community Survey 2021. For the diabetes data, I downloaded this from the [CDC US Diabetes Surveillance System](https://gis.cdc.gov/grasp/diabetes/diabetesatlas-surveillance.html#). It comes in a little ugly, so we will have to clean up the column names, reformat the FIPS codes, and filter to just California counties.

```{r load-data}
# Load and clean CDC diabetes CSV
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/DiabetesAtlas_CountyData.csv", destfile = "DiabetesAtlas_CountyData.csv", mode = "wb")
diab_raw <- read_csv("DiabetesAtlas_CountyData.csv", skip = 2) |> 
  slice(1:(n() - 1))  # remove last row

# Clean column names
names(diab_raw) <- str_trim(names(diab_raw))

# Format FIPS codes with leading zeros
diab_raw <- diab_raw |> 
  mutate(CountyFIPS = str_pad(as.character(as.integer(CountyFIPS)), 5, pad = "0"))

# Filter for California counties only
ca_diabetes <- diab_raw |> 
  filter(str_starts(CountyFIPS, "06")) |> 
  mutate(Percentage = as.numeric(Percentage)) |> 
  select(FIPS = CountyFIPS, County, Diabetes = Percentage)
```

\

Now lets get our good ole median income data from the US Census. Because it's 2021 Diabetes data, we will pull the 2017-2021 five year estimates from the ACS.
```{r get-acs, results=FALSE}
# Download median income and geometries for California counties
ca_geo <- get_acs(
  geography = "county",
  variables = c(income = "B19013_001"),  # Median household income
  state = "CA",
  geometry = TRUE,
  year = 2021
) %>%
  rename(income = estimate) %>%
  select(GEOID, NAME, income, geometry)
```

\

Now let's join our diabetes dataset with our income dataset at the county level. Then we will take a quick look at our data.
```{r}
# Merge diabetes and ACS income data
ca_diab_data <- ca_geo |> 
  left_join(ca_diabetes, by = c("GEOID" = "FIPS"))
glimpse(ca_diab_data)
summary(ca_diab_data$Diabetes)
summary(ca_diab_data$income)
```

\

# 🗺️ Exploratory Mapping

Let's make a map and see what diabetes prevalence looks like across counties in California. 

```{r exploratory-mapping}

tmap_mode("view")
tm_shape(ca_diab_data) +
  tm_polygons(fill = "Diabetes", 
              fill.scale = tm_scale_intervals(style = "quantile",
                                              values = "brewer.reds"),
              fill.legend = tm_legend(title = "% Diagnosed Diabetes"))
```

\

# Methods to Identify Spatial Autocorrelation

Now let’s ask: are diabetes rates clustered in space, or scattered randomly like confetti in a strong breeze? This is where spatial autocorrelation steps in. We'll start with Moran’s I — the granddaddy of spatial pattern detection.

\

## Global Moran’s I

Moran’s I is a measure of overall spatial autocorrelation — that is, how similar or dissimilar values (like diabetes prevalence) are in nearby areas (i.e., in nearby counties). 

- Values close to **+1** mean strong positive spatial autocorrelation — high values cluster with other high values, and low with low (think: like attracts like).
- Values near **0** mean no spatial pattern — just noise.
- Values near **-1** mean high and low values are actively repelling each other (rare in public health).

Let's calculate Moran’s I and find out if diabetes prevalence in California follows any spatial logic. First, we use `poly2nb()` in the **spdep** package to figure out which polygons are neighbors--that is which counties are adjacent to each other. Next, we use `nb2listw()` from **spdep** to create a spatial weights object. We are basically giving weights to each neighbor. Then we run the Moran's I test using `moran.test()` and the weights we just created. 

```{r morans-i}
nb <- poly2nb(ca_diab_data) 
lw <- nb2listw(nb, style = "W")
moran.test(ca_diab_data$Diabetes, lw)
```

### Interpreting Moran's I

- A **Moran’s I value of ~0.3** means moderate positive spatial autocorrelation — counties with similar diabetes rates are near each other.
- A **p-value < 0.001** tells us this clustering is **statistically significant** — not random.

👉 **Take home:** Diabetes rates aren’t randomly distributed across California. There are clusters of high and low values, and we’ve got statistical proof! This result tells us that there is statistically significant spatial clustering of diabetes prevalence across California counties. Counties with high diabetes rates tend to be near other high-rate counties, and the same is true for low-rate counties. The Moran’s I value of ~0.3 indicates moderate positive spatial autocorrelation, and the tiny p-value confirms that this pattern is not due to chance.

\

## Local Moran’s I (LISA)

While Moran’s I tells us whether there's spatial autocorrelation in general, LISA zooms in and tells us **where** it's happening. 

LISA stands for Local Indicators of Spatial Association, and it helps us detect:

- **High-High clusters**: High diabetes rates surrounded by other high-rate counties (hot clusters).
- **Low-Low clusters**: Healthy zones with low diabetes prevalence.
- **High-Low or Low-High outliers**: Counties that buck the trend — like a donut hole in a cake.

Mapping LISA helps target interventions more precisely, and explain those spatial oddballs.

To run LISA, we use the weights we just created, but now use the function `localmoran()`. We will then pull out *Ii*, which is the local Moran's I statistic for each county — a measure of local spatial autocorrelation. A high positive value indicates strong clustering with neighbors (e.g., a high-rate county surrounded by other high-rate counties). We also pull out *Pval*, which gives the p-value for that local statistic, testing whether the observed local clustering is statistically significant. Small p-values suggest that the clustering is unlikely due to random chance.

```{r lisa}
local_moran <- localmoran(ca_diab_data$Diabetes, lw)
ca_diab_data$Ii <- local_moran[, "Ii"]
ca_diab_data$Pval <- local_moran[, "Pr(z != E(Ii))"]

# Map LISA
tm_shape(ca_diab_data) +
  tm_polygons(fill = "Ii", 
              fill.scale = tm_scale_intervals(style = "quantile", 
                                              values = "brewer.pu_or"), 
              fill.legend = tm_legend(title = "Local Moran's I")) +
  tm_title_out(text = "Local Spatial Clustering of Diabetes")
```

\

### Interpreting the LISA Map

This map shows where spatial clustering of diabetes is strongest — not just overall, but **locally**. Here's how to read it:

- **Dark purple counties** have strong local clustering — likely high-high or low-low patterns. These are areas where diabetes rates mirror their neighbors and form significant clusters.
- **Orange or tan counties** may be spatial **outliers** — places where diabetes rates differ from nearby counties. For example, a healthy county surrounded by high-rate neighbors.
- **White areas or muted tones** indicate weak or non-significant clustering.

👉 **Take home:** Counties with strong clustering (dark purple) may benefit from regional strategies, while outliers may need more localized investigation. This helps tailor public health interventions to spatial realities.

\

## 🔥 Hot Spot Analysis (Gi*)

Now we spice things up with the Getis-Ord Gi* statistic — pronounced "gee-eye-star." This method identifies **statistically significant hot and cold spots** in the data.

- A **hot spot** is a county with a high value (e.g., diabetes rate) surrounded by other high-value neighbors.
- A **cold spot** is the opposite: a low-value county in a sea of low values.

Gi* doesn't just look at one value — it considers the neighborhood around each observation too. This is powerful for public health mapping, especially when prioritizing interventions or allocating resources. **Hot** and **cold** spots are areas with significantly higher or lower rates than expected.

We run Gi* by using the function `localG()` from the **spdep** package and using the same weights as before. We can then map the *gi_star* values by county.

```{r gi-star}
gi_star <- localG(ca_diab_data$Diabetes, lw)
ca_diab_data$Gi_star <- as.numeric(gi_star)

tm_shape(ca_diab_data) +
  tm_polygons(fill = "Gi_star", 
              fill.scale = tm_scale_intervals(style = "pretty",
                                              values = "-brewer.rd_bu"),
              fill.legend = tm_legend(title = "Gi* Hot Spots"))
```

### Interpreting the Gi* Map

This map uses the Getis-Ord Gi* statistic to highlight **statistically significant spatial clusters** of diabetes — also known as hot and cold spots:

- **Dark red counties** (Gi* > 2) are **hot spots**: areas with high diabetes prevalence that are also surrounded by other high-prevalence counties. These are prime targets for regional public health efforts.
- **Lighter shades** (e.g., peach or sky blue) represent less intense clustering — not as statistically extreme.
- **Dark blue counties** (Gi* < -2) are **cold spots**: clusters of lower-than-average diabetes rates. Great for understanding what’s going right.


👉 **Take home:** The red-hot lower half of the state (e.g., Inyo, Orange) may benefit from coordinated, regional interventions, while blue-cold northern counties may offer models of prevention or service delivery worth emulating.

\

# Spatial Lag Model

Now that we've explored autocorrelation and hot spots, let’s try a classic spatial regression: the **Spatial Lag Model**. This model is useful when the outcome in one location might be influenced by nearby locations — like diabetes in one county being affected by its neighbors, which we have seen above!

Based on the exploratory mapping, Moran scatterplot, and the global Moran’s I, there appears to be spatial autocorrelation in diabetes. This means that if there is a spatial lag process going on and we fit a normal regression model our coefficients will be biased and inefficient. That is, the coefficient sizes and signs are not close to their true value and its standard errors are underestimated. This means trouble. Big trouble. Real big trouble.

In a spatial lag model, we include a **spatially lagged dependent variable** as a predictor:
- It captures the "spillover" effect of nearby values.
- It helps address spatial dependence that ordinary regression would miss.

We'll use the `lagsarlm()` function from the `spatialreg` package and will use the same spatial weights as before.

```{r spatial-lag-model}
# Define neighbors and spatial weights
nb <- poly2nb(ca_diab_data)
lw <- nb2listw(nb, style = "W")

# Run spatial lag model
lag_model <- lagsarlm(Diabetes ~ income, data = ca_diab_data, listw = lw)
summary(lag_model)
```

## Interpreting the Spatial Lag Model

Let’s break down the results of our spatial lag model:

- The **spatial lag coefficient (rho)** is about **0.48** and statistically significant (*p < 0.002*). That means diabetes rates in a county are **strongly influenced by neighboring counties** — there's real spatial autocorrelation.
- The **income coefficient** is **negative**, suggesting that higher income is associated with lower diabetes, but it’s **not statistically significant** (*p = 0.171*). So, in this model, income alone doesn't explain much variation in diabetes once spatial effects are accounted for.
- The **AIC (Akaike Information Criterion)** is lower for the spatial lag model than for the regular OLS model (176 vs. 184), suggesting that the spatial lag model is a better fit---this means we really should be accounting for spatial autocorrelation!
- The **residual autocorrelation test** is not significant (*p = 0.85*), which means the model has successfully addressed spatial dependence in the residuals.

👉 **Take home:** There is strong spatial dependence in diabetes rates, and including spatial effects accounts for this spatial autocorrelation. However, income may not be a key driver of diabetes once those spatial influences are considered.

\

# 📉 Geographically Weighted Regression (GWR)

GWR — Geographically Weighted Regression — is like regular regression but with a twist: the coefficients can change depending on **where** you are.

Traditional regression assumes one-size-fits-all relationships. But in public health, the impact of income on diabetes may differ from San Francisco to Fresno.

GWR fits a separate regression model--no more assuming one relationship is consistent across all areas in your analysis!

We will run GWR first by setting up our data to have the geometric centroid of each county using `st_centroid()` from the **sf** package. Then, we will use the `gwr.sel()` function from **spgwr** and will use the geometric centroid of our counties to find the optimal bandwidth for our GWR model. We will then model the GWR using the `gwr()` function from the **GWmodel** package. We will use the bandwidths and centroids that we set up in the previous steps.

```{r gwr}
coords <- st_coordinates(st_centroid(ca_diab_data))
ca_diab_data$X <- coords[, 1]
ca_diab_data$Y <- coords[, 2]

bw <- gwr.sel(Diabetes ~ income, data = ca_diab_data, coords = coords, adapt = TRUE)

# Run GWR using adaptive bandwidth
model <- gwr(Diabetes ~ income, data = ca_diab_data, coords = coords, 
             adapt = bw, hatmatrix = TRUE)

ca_diab_data$gwr_income <- model$SDF$income # Store model coefficients
```

\

## Interpreting the GWR results

The output shows the model testing different 'adaptive q' values, which represent the proportion of neighboring counties used in each local model. It chooses the q value with the lowest cross-validation (CV) score — best predictive fit. For example: q = 0.082 means ~8% of counties (≈5 counties) are used in each local regression. Lower CV = better model. It's finding the 'just right' neighborhood size for each regression.

\

## 🗾 Map Local GWR Coefficients

Let’s map those locally varying regression coefficients and see where income matters more (or less) for diabetes.

```{r gwr-map}
tm_shape(ca_diab_data) +
  tm_polygons(fill = "gwr_income", 
              tm_scale_intervals(style = "quantile",
                                 values = "brewer.rd_yl_gn"), 
              fill.legend = tm_legend(title = "GWR: Income-Diabetes Association"))
```

\

### 🧩 Interpreting the GWR Map

This map shows how the relationship between income and diabetes **varies across California** — it's a spatial view of the regression coefficient between income and diabetes.

- All the values are **negative**, meaning higher income is associated with **lower diabetes prevalence** — but **the strength** of this relationship varies across space.

- **Dark red counties** (most negative values): These are the places where income has the **strongest inverse relationship** with diabetes. In other words, improving economic conditions here could have a **big public health payoff**.
- **Lighter yellow to green counties**: These areas still show a negative relationship, but it's **weaker**. That might mean other social or environmental factors are playing a bigger role than income in these counties.
- **Dark green areas**: The weakest negative associations — here, income might matter less for diabetes prevention, or other factors may dominate.

👉 **Take home**: This map helps public health professionals tailor interventions. In counties where income is a strong driver of diabetes, economic support programs might reduce prevalence. In others, different strategies may be needed.

\

# Take home

- **Moran's I**: Measures overall spatial autocorrelation (is there a pattern?).
- **LISA/Gi**: Reveal local clusters and hot/cold spots (where’s the action?).
- **Spatial Lag Model**: Quantifies spatial autocorrelation and adjusts for spatial autocorrelation in regression models.
- **GWR**: Shows where exposure-outcome relationships are strong, weak, or flipping direction.

# Summary

Boom! You just crunched real public health data using cutting-edge spatial tools to look at spatial clustering / autocorrelation. And we even learned some regression approaches to account for spatial autocorrelation. Well done! We've just scratched the surface on these spatial statistics approaches, but we've come so far from Week 1! I'm proud of you!

