Applied Spatial Statistics（七）：Python 中的空间回归

本笔记本演示了如何使用 pysal 的 spreg 库拟合空间滞后模型和空间误差模型。

OLS
空间误差模型
空间滞后模型
三种模型的比较
探索滞后模型中的直接和间接影响

import numpy as np
import pandas as pdimport geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt

from libpysal.weights import Queen
from splot.esda import plot_moran
from esda.moran import Moran
import spreg

1.数据

在此笔记本中，我将使用 2020 年美国总统选举数据集进行演示。
voting 数据框包含县级投票给民主党的人数百分比（编码为 new_pct_dem）以及该县的一些社会经济变量。该数据集仅包含美国本土 48 个州的统计数据。

voting = pd.read_csv('https://raw.github.com/Ziqi-Li/gis5122/master/data/voting_2020.csv')voting[['median_income']] = voting[['median_income']]/10000

voting.head()

	county_id	state	county	NAME	proj_X	proj_Y	total_pop	new_pct_dem	sex_ratio	pct_black	...	median_income	pct_65_over	pct_age_18_29	gini	pct_manuf	ln_pop_den	pct_3rd_party	turn_out	pct_fb	pct_uninsured
0	17051	17	51	Fayette County, Illinois	597979.5531	1796861.993	21565	18.445122	113.6	4.7	...	4.6650	18.8	14.899142	0.4373	14.9	3.392715	1.923652	58.930984	1.3	8.2
1	17107	17	107	Logan County, Illinois	559814.6766	1920479.975	29003	29.420030	97.2	6.9	...	5.7308	18.0	17.256836	0.4201	12.4	3.847224	2.332850	56.631552	1.6	4.5
2	17165	17	165	Saline County, Illinois	650278.3579	1660709.808	23994	25.601911	96.9	2.6	...	4.4090	19.9	13.586730	0.4692	8.7	4.128654	1.778139	59.147937	1.0	4.2
3	17097	17	97	Lake County, Illinois	654010.9262	2174576.605	701473	62.275888	99.8	6.8	...	8.9427	13.7	15.823132	0.4847	16.3	7.308201	1.954177	71.151975	18.7	6.8
4	17127	17	127	Massac County, Illinois	640398.9863	1599902.491	14219	25.626118	89.5	5.8	...	4.7481	20.8	12.370772	0.4097	7.4	4.067788	1.396443	62.281425	1.0	5.4

5 rows × 22 columns

然后我们阅读了美国的县边界文件。

shp = gpd.read_file("https://raw.github.com/Ziqi-Li/gis5122/master/data/us_counties.geojson")

#Merge the shapefile with the voting data by the common county_id
shp_voting = shp.merge(voting, on ="county_id")#Dissolve the counties to obtain boundary of states, used for mapping
state = shp_voting.dissolve(by='STATEFP').geometry.boundary

选择本练习中要使用的变量，我从列表中选择了 6 个预测因子。

variable_names = ['sex_ratio', 'pct_black', 'pct_hisp','pct_bach', 'median_income','ln_pop_den']y = shp_voting[['new_pct_dem']].valuesX = shp_voting[variable_names].values

2.OLS model (baseline)

这里我演示如何使用 spring 来拟合 OLS 模型。当然你也可以使用 statsmodels。

#In the spreg.OLS() you need to specify the y and X, also variable names (optional)ols = spreg.OLS(y, X, name_y='new_pct_dem', name_x=variable_names)

print(ols.summary)

REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
R-squared           :      0.6091
Adjusted R-squared  :      0.6083
Sum squared residual:      319249                F-statistic           :    803.9833
Sigma-square        :     103.117                Prob(F-statistic)     :           0
S.E. of regression  :      10.155                Log likelihood        :  -11592.001
Sigma-square ML     :     102.884                Akaike info criterion :   23198.003
S.E of regression ML:     10.1432                Schwarz criterion     :   23240.284------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         5.83676         1.96534         2.96984         0.00300sex_ratio         0.00613         0.01754         0.34970         0.72659pct_black         0.48310         0.01401        34.47681         0.00000pct_hisp         0.23952         0.01329        18.02612         0.00000pct_bach         0.97537         0.02854        34.17057         0.00000median_income        -1.66008         0.19755        -8.40329         0.00000ln_pop_den         2.13283         0.12925        16.50160         0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           33.073TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1166.636           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                6         268.617           0.0000
Koenker-Bassett test              6         118.745           0.0000
================================ END OF REPORT =====================================

我们可以与 statsmodels 进行比较，结果相同

查看 OLS 残差图

ols.u

array([[ 0.98102642],[-7.61122909],[ 3.44911401],...,[-3.82424613],[-0.7422498 ],[-1.11507546]])

from matplotlib import colors#For creating a discrete color classification
norm = colors.BoundaryNorm([-20, -10, -5, 0, 5, 10, 20],ncolors=256)ax = shp_voting.plot(column=ols.u.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)state.plot(ax=ax,linewidth=0.3,edgecolor="black")plt.title("Map of residuals of the OLS model",fontsize=15)

Text(0.5, 1.0, 'Map of residuals of the OLS model')

在这里插入图片描述

从 OLS 残差图中，我们可以看到空间自相关性很强，高/低残差聚集在一起。这强烈表明我们的模型缺少空间结构，并且违反了 OLS 的独立性假设。

然后让我们通过计算残差的 Moran’s I 来更定量地评估空间自相关性。

#Here we use the Queen contiguity
w = Queen.from_dataframe(shp_voting)#row standardization
w.transform = 'R'#The warning is saying there are two counties without neighbors, lets don't worry about this for now.

<ipython-input-14-9c7ca81e50b6>:2: FutureWarning: `use_index` defaults to False but will default to True in future. Set True/False directly to control this behavior and silence this warningw = Queen.from_dataframe(shp_voting)('WARNING: ', 2441, ' is an island (no neighbors)')
('WARNING: ', 2701, ' is an island (no neighbors)')/usr/local/lib/python3.10/dist-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 3 disconnected components.There are 2 islands with ids: 2441, 2701.warnings.warn(message)

#Here, lets calculate the Moran's I value, and plot it.
#ols.u is the residuals from the OLS modelols_moran = Moran(ols.u, w, permutations = 199) #199 permutationsplot_moran(ols_moran, figsize=(10,4))

在这里插入图片描述

我们发现 Moran’s I 值等于 0.6，这让我们确信 OLS 残差图上确实存在很强的空间模式。

现在有两个选择：我们可以使用 滞后模型，或者我们可以使用 误差模型。

一个便利之处在于，如果您将权重矩阵传递给 OLS 函数，同时指定 spat_diag=True，那么您将获得一些额外的空间诊断，可以帮助您做出决定。如果您指定 moran=True，这还包括残差的 Moran’s I。

ols = spreg.OLS(y, X, w=w, spat_diag=True, moran=True,name_y='pct_dem', name_x=variable_names)print(ols.summary)

REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
R-squared           :      0.6091
Adjusted R-squared  :      0.6083
Sum squared residual:      319249                F-statistic           :    803.9833
Sigma-square        :     103.117                Prob(F-statistic)     :           0
S.E. of regression  :      10.155                Log likelihood        :  -11592.001
Sigma-square ML     :     102.884                Akaike info criterion :   23198.003
S.E of regression ML:     10.1432                Schwarz criterion     :   23240.284------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         5.83676         1.96534         2.96984         0.00300sex_ratio         0.00613         0.01754         0.34970         0.72659pct_black         0.48310         0.01401        34.47681         0.00000pct_hisp         0.23952         0.01329        18.02612         0.00000pct_bach         0.97537         0.02854        34.17057         0.00000median_income        -1.66008         0.19755        -8.40329         0.00000ln_pop_den         2.13283         0.12925        16.50160         0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           33.073TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1166.636           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                6         268.617           0.0000
Koenker-Bassett test              6         118.745           0.0000DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.6000        56.164           0.0000
Lagrange Multiplier (lag)         1        1952.063           0.0000
Robust LM (lag)                   1          18.792           0.0000
Lagrange Multiplier (error)       1        3119.345           0.0000
Robust LM (error)                 1        1186.074           0.0000
Lagrange Multiplier (SARMA)       2        3138.137           0.0000================================ END OF REPORT =====================================

从 L-M 检验中，我们可以预期误差模型比滞后模型更为合适（比较稳健得分：1186 比 18）。

3.Spatial Error Model (SEM)

现在让我们使用“spreg.ML_Error()”拟合空间错误模型，其中您需要指定 y、X 和权重矩阵 w。

sem = spreg.ML_Error(y, X, w=w, name_x=variable_names, name_y='new_pct_dem')print(sem.summary)

/usr/local/lib/python3.10/dist-packages/scipy/optimize/_minimize.py:913: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance.warn("Method 'bounded' does not support relative tolerance in x; "REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
Pseudo R-squared    :      0.5167
Log likelihood      : -10170.5905
Sigma-square ML     :     33.4938                Akaike info criterion :   20355.181
S.E of regression   :      5.7874                Schwarz criterion     :   20397.462------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT        19.00473         1.49728        12.69280         0.00000sex_ratio        -0.04056         0.00988        -4.10353         0.00004pct_black         0.74856         0.01685        44.43261         0.00000pct_hisp         0.31866         0.01734        18.37383         0.00000pct_bach         0.72854         0.02000        36.42952         0.00000median_income        -2.42279         0.15714       -15.41832         0.00000ln_pop_den         1.83430         0.13375        13.71399         0.00000lambda         0.86984         0.00994        87.51130         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

空间滞后误差项的 lambda（或其他使用 rho 的软件或符号）系数非常显著，并且其幅度相当大，这表明残差中存在很强的空间自相关性，这被滞后误差项捕获。

请注意，sem 的 sem.e_filtered 属性应该是 iid 误差。而 sem.u 是自回归误差 + iid 误差。现在让我们再次查看残差的 Moran’s I。

sem.e_filtered

array([[-2.82249844],[-2.93425648],[ 1.76293602],...,[ 0.58894001],[ 4.25853052],[-4.82251595]])

sem_moran = Moran(sem.e_filtered, w, permutations = 199) #199 permutations
plot_moran(sem_moran, zstandard=True, figsize=(10,4))

在这里插入图片描述

非常低的 Moran’s I -> 随机

ax = shp_voting.plot(column=sem.e_filtered.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)
state.plot(ax=ax,linewidth=0.3,edgecolor="black")
plt.title("Map of filtered residuals of the SEM model",fontsize=15)

Text(0.5, 1.0, 'Map of filtered residuals of the SEM model')

在这里插入图片描述

随机模式！太棒了！

4.Spatial Lag Model

类似地，让我们将此重复到空间滞后模型

slm = spreg.ML_Lag(y, X, w=w, name_y='new_pct_dem', name_x=variable_names)print(slm.summary)

REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           8
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3095
Pseudo R-squared    :      0.7744
Spatial Pseudo R-squared:  0.5591
Log likelihood      : -10853.5647
Sigma-square ML     :     59.4991                Akaike info criterion :   21723.129
S.E of regression   :      7.7136                Schwarz criterion     :   21771.450------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         1.07097         1.50321         0.71245         0.47618sex_ratio         0.01044         0.01333         0.78274         0.43378pct_black         0.27765         0.01295        21.43381         0.00000pct_hisp         0.16619         0.01060        15.68552         0.00000pct_bach         0.83467         0.02200        37.93947         0.00000median_income        -2.99112         0.15043       -19.88429         0.00000ln_pop_den         1.47061         0.10082        14.58655         0.00000W_new_pct_dem         0.58112         0.01342        43.31362         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================

空间滞后项“W_new_pct_dem”的 rho 系数显著，且幅度很大，这表明因变量具有很强的空间溢出效应。

slm_moran = Moran(slm.u, w, permutations = 199) #199 permutations
plot_moran(slm_moran, zstandard=True, figsize=(10,4))

在这里插入图片描述

ax = shp_voting.plot(column=slm.u.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)state.plot(ax=ax,linewidth=0.3,edgecolor="black")
plt.title("Map of residuals of the spatial lag model",fontsize=15)

Text(0.5, 1.0, 'Map of residuals of the spatial lag model')

在这里插入图片描述

5.滞后、误差和 OLS 模型的交叉比较。

总体而言，我们看到尽管有一些变化（例如，在 SEM 模型中，%black 的影响更大），但估计值是一致的。OLS 模型不可靠，因为我们知道假设被违反了。在滞后模型中，即使我们考虑了邻近投票偏好，残差仍然显示出一些弱自相关性。而在误差模型中，我们确实观察到了随机残差。

所以如果我需要做出决定，我会使用误差模型。这也得到了 LM 测试的证据以及误差模型具有最低 AIC 值的支持。

Predictor	OLS Estimates	SLM Estimates	SEM Estimates
CONSTANT	5.83*	1.07	19.00*
sex_ratio	0.00	0.01	-0.04*
pct_black	0.48*	0.27*	0.74*
pct_hisp	0.23*	0.16*	0.31*
pct_bach	0.97*	0.83*	0.72*
median_income	-1.66*	-2.99*	-2.42*
ln_pop_den	2.13*	1.47*	1.83*
lambda	NA	0.58*	0.86*

AIC	23198.00	21723.12	20355.18
Moran’s I of residuals	0.60	0.15	-0.08

6.更多关于 SLM 模型的内容：间接影响的检查。

场景：如果莱昂的 bach 百分比增加 1%，会怎样？附近县的 dem 百分比会发生什么变化？

步骤：

使用 w.full() 获取完整的 n x n 矩阵
计算 (I-pW)^-1*beta（此处的估计值是 SLM 模型中的 bach 百分比，因此为 0.83），现在您获得了完整的 n x n 变化交互。
找到任何感兴趣的县的行索引。
现在您可以选择该县的列并检查这将如何影响其他县。

#1.
w.full()[0]

array([[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],...,[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.]])

np.identity(3103)

array([[1., 0., 0., ..., 0., 0., 0.],[0., 1., 0., ..., 0., 0., 0.],[0., 0., 1., ..., 0., 0., 0.],...,[0., 0., 0., ..., 1., 0., 0.],[0., 0., 0., ..., 0., 1., 0.],[0., 0., 0., ..., 0., 0., 1.]])

#2.
effects = np.linalg.inv(np.identity(3103) - 0.58*w.full()[0])*0.83 #n=3103, rho=0.58, est_pct_bach = 0.83effects

array([[8.92141250e-01, 1.32877317e-03, 1.52996497e-14, ...,3.14050249e-31, 5.12287114e-06, 1.79564935e-10],[9.49123691e-04, 8.98379566e-01, 1.43986921e-13, ...,4.16193409e-29, 1.14493906e-03, 1.32924287e-10],[1.27497081e-14, 1.67984741e-13, 9.03375896e-01, ...,8.63302503e-23, 5.33348558e-13, 2.63149009e-21],...,[2.61708541e-31, 4.85558978e-29, 8.63302503e-23, ...,8.88070682e-01, 8.55288372e-27, 1.32652812e-25],[3.65919367e-06, 1.14493906e-03, 4.57155907e-13, ...,7.33104319e-27, 8.95542764e-01, 4.10759600e-10],[1.12228084e-10, 1.16308751e-10, 1.97361756e-21, ...,9.94896087e-26, 3.59414650e-10, 9.05420698e-01]])

#3. find the row index for Leon, which is 67.
shp_voting[shp_voting['NAME_x'] == "Leon"]

	GEOID	STATEFP	NAME_x	county_id	geometry	state	county	NAME_y	proj_X	proj_Y	...	median_income	pct_65_over	pct_age_18_29	gini	pct_manuf	ln_pop_den	pct_3rd_party	turn_out	pct_fb	pct_uninsured
67	12073	12	Leon	12073	POLYGON ((1080730.82089 870592.41110, 1086551....	12	73	Leon County, Florida	1.120705e+06	889449.6751	...	5.3106	12.9	30.040722	0.4896	2.0	6.023568	1.202364	71.910474	6.8	8.1
1050	48289	48	Leon	48289	POLYGON ((-30255.42040 920246.79226, -30598.62...	48	289	Leon County, Texas	3.831304e+02	913423.2253	...	4.3045	24.2	12.342525	0.5271	5.7	2.767577	0.899446	68.074417	5.2	17.0

2 rows × 26 columns

#Total effects for Leon can be obtained in the diagnoal of the full effects matrix.effects[67,67]

0.903899945968712

这表明，大学毕业生人数每增加 1%，民主党的投票份额将增加约 0.90%（直接 + 间接）。请注意，这大于滞后模型的系数（即 0.83），该系数仅捕捉直接影响。
间接影响是其自身与邻居之间的空间相互作用的结果，约为 0.07%（0.90 - 0.83）

现在让我们来看看莱昂的变化如何影响周边县市。

#get the effects for leon and plot it
shp_voting['d_pct_bach_leon'] = effects[:,67]ax = shp_voting[shp_voting['state'] == 12].plot(column='d_pct_bach_leon',legend=True,figsize=(15,8), linewidth=0.0,aspect=1)plt.title("% increase in Dem share if Leon has \n1% more college graduates",fontsize=15,y=1.08)

Text(0.5, 1.08, '% increase in Dem share if Leon has \n1% more college graduates')

在这里插入图片描述

shp_voting[(shp_voting['state'] == 12) & (shp_voting['NAME_x'] == "Jefferson")].d_pct_bach_leon

2808    0.121902
Name: d_pct_bach_leon, dtype: float64

因此，我们基本上可以看出，如果莱昂的大学毕业生人数增加 1%，预计附近县的 %dem 份额将增加约 0.12%。例如，受莱昂变化的影响，杰斐逊县的 dem 份额可能会增加 0.12%。

shp_voting[(shp_voting['state'] == 12) & (shp_voting['NAME_x'] == "Miami-Dade")].d_pct_bach_leon

2607    1.163942e-08
Name: d_pct_bach_leon, dtype: float64

然而，我们可以看到，对于迈阿密戴德等较远的县，间接溢出效应基本为零。这是因为效应的幅度 (rho) 以及指定的 W 矩阵非常局部。