Applied Spatial Statistics(七):Python 中的空间回归
本笔记本演示了如何使用 pysal 的 spreg 库拟合空间滞后模型和空间误差模型。
- OLS
- 空间误差模型
- 空间滞后模型
- 三种模型的比较
- 探索滞后模型中的直接和间接影响
import numpy as np
import pandas as pdimport geopandas as gpd
import seaborn as sns
import matplotlib.pyplot as plt
from libpysal.weights import Queen
from splot.esda import plot_moran
from esda.moran import Moran
import spreg
1.数据
在此笔记本中,我将使用 2020 年美国总统选举数据集进行演示。
 voting 数据框包含县级投票给民主党的人数百分比(编码为 new_pct_dem)以及该县的一些社会经济变量。该数据集仅包含美国本土 48 个州的统计数据。
voting = pd.read_csv('https://raw.github.com/Ziqi-Li/gis5122/master/data/voting_2020.csv')voting[['median_income']] = voting[['median_income']]/10000
voting.head()
| county_id | state | county | NAME | proj_X | proj_Y | total_pop | new_pct_dem | sex_ratio | pct_black | ... | median_income | pct_65_over | pct_age_18_29 | gini | pct_manuf | ln_pop_den | pct_3rd_party | turn_out | pct_fb | pct_uninsured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17051 | 17 | 51 | Fayette County, Illinois | 597979.5531 | 1796861.993 | 21565 | 18.445122 | 113.6 | 4.7 | ... | 4.6650 | 18.8 | 14.899142 | 0.4373 | 14.9 | 3.392715 | 1.923652 | 58.930984 | 1.3 | 8.2 | 
| 1 | 17107 | 17 | 107 | Logan County, Illinois | 559814.6766 | 1920479.975 | 29003 | 29.420030 | 97.2 | 6.9 | ... | 5.7308 | 18.0 | 17.256836 | 0.4201 | 12.4 | 3.847224 | 2.332850 | 56.631552 | 1.6 | 4.5 | 
| 2 | 17165 | 17 | 165 | Saline County, Illinois | 650278.3579 | 1660709.808 | 23994 | 25.601911 | 96.9 | 2.6 | ... | 4.4090 | 19.9 | 13.586730 | 0.4692 | 8.7 | 4.128654 | 1.778139 | 59.147937 | 1.0 | 4.2 | 
| 3 | 17097 | 17 | 97 | Lake County, Illinois | 654010.9262 | 2174576.605 | 701473 | 62.275888 | 99.8 | 6.8 | ... | 8.9427 | 13.7 | 15.823132 | 0.4847 | 16.3 | 7.308201 | 1.954177 | 71.151975 | 18.7 | 6.8 | 
| 4 | 17127 | 17 | 127 | Massac County, Illinois | 640398.9863 | 1599902.491 | 14219 | 25.626118 | 89.5 | 5.8 | ... | 4.7481 | 20.8 | 12.370772 | 0.4097 | 7.4 | 4.067788 | 1.396443 | 62.281425 | 1.0 | 5.4 | 
5 rows × 22 columns
然后我们阅读了美国的县边界文件。
shp = gpd.read_file("https://raw.github.com/Ziqi-Li/gis5122/master/data/us_counties.geojson")
#Merge the shapefile with the voting data by the common county_id
shp_voting = shp.merge(voting, on ="county_id")#Dissolve the counties to obtain boundary of states, used for mapping
state = shp_voting.dissolve(by='STATEFP').geometry.boundary
选择本练习中要使用的变量,我从列表中选择了 6 个预测因子。
variable_names = ['sex_ratio', 'pct_black', 'pct_hisp','pct_bach', 'median_income','ln_pop_den']y = shp_voting[['new_pct_dem']].valuesX = shp_voting[variable_names].values
2.OLS model (baseline)
这里我演示如何使用 spring 来拟合 OLS 模型。当然你也可以使用 statsmodels。
#In the spreg.OLS() you need to specify the y and X, also variable names (optional)ols = spreg.OLS(y, X, name_y='new_pct_dem', name_x=variable_names)
print(ols.summary)
REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
R-squared           :      0.6091
Adjusted R-squared  :      0.6083
Sum squared residual:      319249                F-statistic           :    803.9833
Sigma-square        :     103.117                Prob(F-statistic)     :           0
S.E. of regression  :      10.155                Log likelihood        :  -11592.001
Sigma-square ML     :     102.884                Akaike info criterion :   23198.003
S.E of regression ML:     10.1432                Schwarz criterion     :   23240.284------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         5.83676         1.96534         2.96984         0.00300sex_ratio         0.00613         0.01754         0.34970         0.72659pct_black         0.48310         0.01401        34.47681         0.00000pct_hisp         0.23952         0.01329        18.02612         0.00000pct_bach         0.97537         0.02854        34.17057         0.00000median_income        -1.66008         0.19755        -8.40329         0.00000ln_pop_den         2.13283         0.12925        16.50160         0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           33.073TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1166.636           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                6         268.617           0.0000
Koenker-Bassett test              6         118.745           0.0000
================================ END OF REPORT =====================================
我们可以与 statsmodels 进行比较,结果相同
查看 OLS 残差图
ols.u
array([[ 0.98102642],[-7.61122909],[ 3.44911401],...,[-3.82424613],[-0.7422498 ],[-1.11507546]])
from matplotlib import colors#For creating a discrete color classification
norm = colors.BoundaryNorm([-20, -10, -5, 0, 5, 10, 20],ncolors=256)ax = shp_voting.plot(column=ols.u.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)state.plot(ax=ax,linewidth=0.3,edgecolor="black")plt.title("Map of residuals of the OLS model",fontsize=15)
Text(0.5, 1.0, 'Map of residuals of the OLS model')

从 OLS 残差图中,我们可以看到空间自相关性很强,高/低残差聚集在一起。这强烈表明我们的模型缺少空间结构,并且违反了 OLS 的独立性假设。
然后让我们通过计算残差的 Moran’s I 来更定量地评估空间自相关性。
#Here we use the Queen contiguity
w = Queen.from_dataframe(shp_voting)#row standardization
w.transform = 'R'#The warning is saying there are two counties without neighbors, lets don't worry about this for now.
<ipython-input-14-9c7ca81e50b6>:2: FutureWarning: `use_index` defaults to False but will default to True in future. Set True/False directly to control this behavior and silence this warningw = Queen.from_dataframe(shp_voting)('WARNING: ', 2441, ' is an island (no neighbors)')
('WARNING: ', 2701, ' is an island (no neighbors)')/usr/local/lib/python3.10/dist-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 3 disconnected components.There are 2 islands with ids: 2441, 2701.warnings.warn(message)
#Here, lets calculate the Moran's I value, and plot it.
#ols.u is the residuals from the OLS modelols_moran = Moran(ols.u, w, permutations = 199) #199 permutationsplot_moran(ols_moran, figsize=(10,4))

我们发现 Moran’s I 值等于 0.6,这让我们确信 OLS 残差图上确实存在很强的空间模式。
现在有两个选择:我们可以使用 滞后模型,或者我们可以使用 误差模型。
一个便利之处在于,如果您将权重矩阵传递给 OLS 函数,同时指定 spat_diag=True,那么您将获得一些额外的空间诊断,可以帮助您做出决定。如果您指定 moran=True,这还包括残差的 Moran’s I。
ols = spreg.OLS(y, X, w=w, spat_diag=True, moran=True,name_y='pct_dem', name_x=variable_names)print(ols.summary)
REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :     pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
R-squared           :      0.6091
Adjusted R-squared  :      0.6083
Sum squared residual:      319249                F-statistic           :    803.9833
Sigma-square        :     103.117                Prob(F-statistic)     :           0
S.E. of regression  :      10.155                Log likelihood        :  -11592.001
Sigma-square ML     :     102.884                Akaike info criterion :   23198.003
S.E of regression ML:     10.1432                Schwarz criterion     :   23240.284------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         5.83676         1.96534         2.96984         0.00300sex_ratio         0.00613         0.01754         0.34970         0.72659pct_black         0.48310         0.01401        34.47681         0.00000pct_hisp         0.23952         0.01329        18.02612         0.00000pct_bach         0.97537         0.02854        34.17057         0.00000median_income        -1.66008         0.19755        -8.40329         0.00000ln_pop_den         2.13283         0.12925        16.50160         0.00000
------------------------------------------------------------------------------------REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           33.073TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2        1166.636           0.0000DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                6         268.617           0.0000
Koenker-Bassett test              6         118.745           0.0000DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                           MI/DF       VALUE           PROB
Moran's I (error)              0.6000        56.164           0.0000
Lagrange Multiplier (lag)         1        1952.063           0.0000
Robust LM (lag)                   1          18.792           0.0000
Lagrange Multiplier (error)       1        3119.345           0.0000
Robust LM (error)                 1        1186.074           0.0000
Lagrange Multiplier (SARMA)       2        3138.137           0.0000================================ END OF REPORT =====================================
从 L-M 检验中,我们可以预期误差模型比滞后模型更为合适(比较稳健得分:1186 比 18)。
3.Spatial Error Model (SEM)
现在让我们使用“spreg.ML_Error()”拟合空间错误模型,其中您需要指定 y、X 和权重矩阵 w。
sem = spreg.ML_Error(y, X, w=w, name_x=variable_names, name_y='new_pct_dem')print(sem.summary)
/usr/local/lib/python3.10/dist-packages/scipy/optimize/_minimize.py:913: RuntimeWarning: Method 'bounded' does not support relative tolerance in x; defaulting to absolute tolerance.warn("Method 'bounded' does not support relative tolerance in x; "REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: ML SPATIAL ERROR (METHOD = full)
---------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           7
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3096
Pseudo R-squared    :      0.5167
Log likelihood      : -10170.5905
Sigma-square ML     :     33.4938                Akaike info criterion :   20355.181
S.E of regression   :      5.7874                Schwarz criterion     :   20397.462------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT        19.00473         1.49728        12.69280         0.00000sex_ratio        -0.04056         0.00988        -4.10353         0.00004pct_black         0.74856         0.01685        44.43261         0.00000pct_hisp         0.31866         0.01734        18.37383         0.00000pct_bach         0.72854         0.02000        36.42952         0.00000median_income        -2.42279         0.15714       -15.41832         0.00000ln_pop_den         1.83430         0.13375        13.71399         0.00000lambda         0.86984         0.00994        87.51130         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
空间滞后误差项的 lambda(或其他使用 rho 的软件或符号)系数非常显著,并且其幅度相当大,这表明残差中存在很强的空间自相关性,这被滞后误差项捕获。
请注意,sem 的 sem.e_filtered 属性应该是 iid 误差。而 sem.u 是自回归误差 + iid 误差。现在让我们再次查看残差的 Moran’s I。
sem.e_filtered
array([[-2.82249844],[-2.93425648],[ 1.76293602],...,[ 0.58894001],[ 4.25853052],[-4.82251595]])
sem_moran = Moran(sem.e_filtered, w, permutations = 199) #199 permutations
plot_moran(sem_moran, zstandard=True, figsize=(10,4))

非常低的 Moran’s I -> 随机
ax = shp_voting.plot(column=sem.e_filtered.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)
state.plot(ax=ax,linewidth=0.3,edgecolor="black")
plt.title("Map of filtered residuals of the SEM model",fontsize=15)Text(0.5, 1.0, 'Map of filtered residuals of the SEM model')

随机模式!太棒了!
4.Spatial Lag Model
类似地,让我们将此重复到空间滞后模型
slm = spreg.ML_Lag(y, X, w=w, name_y='new_pct_dem', name_x=variable_names)print(slm.summary)
REGRESSION RESULTS
------------------SUMMARY OF OUTPUT: MAXIMUM LIKELIHOOD SPATIAL LAG (METHOD = FULL)
-----------------------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  : new_pct_dem                Number of Observations:        3103
Mean dependent var  :     33.7616                Number of Variables   :           8
S.D. dependent var  :     16.2257                Degrees of Freedom    :        3095
Pseudo R-squared    :      0.7744
Spatial Pseudo R-squared:  0.5591
Log likelihood      : -10853.5647
Sigma-square ML     :     59.4991                Akaike info criterion :   21723.129
S.E of regression   :      7.7136                Schwarz criterion     :   21771.450------------------------------------------------------------------------------------Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------CONSTANT         1.07097         1.50321         0.71245         0.47618sex_ratio         0.01044         0.01333         0.78274         0.43378pct_black         0.27765         0.01295        21.43381         0.00000pct_hisp         0.16619         0.01060        15.68552         0.00000pct_bach         0.83467         0.02200        37.93947         0.00000median_income        -2.99112         0.15043       -19.88429         0.00000ln_pop_den         1.47061         0.10082        14.58655         0.00000W_new_pct_dem         0.58112         0.01342        43.31362         0.00000
------------------------------------------------------------------------------------
================================ END OF REPORT =====================================
空间滞后项“W_new_pct_dem”的 rho 系数显著,且幅度很大,这表明因变量具有很强的空间溢出效应。
slm_moran = Moran(slm.u, w, permutations = 199) #199 permutations
plot_moran(slm_moran, zstandard=True, figsize=(10,4))

ax = shp_voting.plot(column=slm.u.reshape(-1),legend=True,figsize=(15,8), norm=norm, linewidth=0.0)state.plot(ax=ax,linewidth=0.3,edgecolor="black")
plt.title("Map of residuals of the spatial lag model",fontsize=15)Text(0.5, 1.0, 'Map of residuals of the spatial lag model')

5.滞后、误差和 OLS 模型的交叉比较。
总体而言,我们看到尽管有一些变化(例如,在 SEM 模型中,%black 的影响更大),但估计值是一致的。OLS 模型不可靠,因为我们知道假设被违反了。在滞后模型中,即使我们考虑了邻近投票偏好,残差仍然显示出一些弱自相关性。而在误差模型中,我们确实观察到了随机残差。
所以如果我需要做出决定,我会使用误差模型。这也得到了 LM 测试的证据以及误差模型具有最低 AIC 值的支持。
| Predictor | OLS Estimates | SLM Estimates | SEM Estimates | 
|---|---|---|---|
| CONSTANT | 5.83* | 1.07 | 19.00* | 
| sex_ratio | 0.00 | 0.01 | -0.04* | 
| pct_black | 0.48* | 0.27* | 0.74* | 
| pct_hisp | 0.23* | 0.16* | 0.31* | 
| pct_bach | 0.97* | 0.83* | 0.72* | 
| median_income | -1.66* | -2.99* | -2.42* | 
| ln_pop_den | 2.13* | 1.47* | 1.83* | 
| lambda | NA | 0.58* | 0.86* | 
| AIC | 23198.00 | 21723.12 | 20355.18 | 
| Moran’s I of residuals | 0.60 | 0.15 | -0.08 | 
6.更多关于 SLM 模型的内容:间接影响的检查。
场景:如果莱昂的 bach 百分比增加 1%,会怎样?附近县的 dem 百分比会发生什么变化?
步骤:
- 使用 w.full() 获取完整的 n x n 矩阵
- 计算 (I-pW)^-1*beta(此处的估计值是 SLM 模型中的 bach 百分比,因此为 0.83),现在您获得了完整的 n x n 变化交互。
- 找到任何感兴趣的县的行索引。
- 现在您可以选择该县的列并检查这将如何影响其他县。
#1.
w.full()[0]
array([[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],...,[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.],[0., 0., 0., ..., 0., 0., 0.]])
np.identity(3103)
array([[1., 0., 0., ..., 0., 0., 0.],[0., 1., 0., ..., 0., 0., 0.],[0., 0., 1., ..., 0., 0., 0.],...,[0., 0., 0., ..., 1., 0., 0.],[0., 0., 0., ..., 0., 1., 0.],[0., 0., 0., ..., 0., 0., 1.]])
#2.
effects = np.linalg.inv(np.identity(3103) - 0.58*w.full()[0])*0.83 #n=3103, rho=0.58, est_pct_bach = 0.83effects
array([[8.92141250e-01, 1.32877317e-03, 1.52996497e-14, ...,3.14050249e-31, 5.12287114e-06, 1.79564935e-10],[9.49123691e-04, 8.98379566e-01, 1.43986921e-13, ...,4.16193409e-29, 1.14493906e-03, 1.32924287e-10],[1.27497081e-14, 1.67984741e-13, 9.03375896e-01, ...,8.63302503e-23, 5.33348558e-13, 2.63149009e-21],...,[2.61708541e-31, 4.85558978e-29, 8.63302503e-23, ...,8.88070682e-01, 8.55288372e-27, 1.32652812e-25],[3.65919367e-06, 1.14493906e-03, 4.57155907e-13, ...,7.33104319e-27, 8.95542764e-01, 4.10759600e-10],[1.12228084e-10, 1.16308751e-10, 1.97361756e-21, ...,9.94896087e-26, 3.59414650e-10, 9.05420698e-01]])
#3. find the row index for Leon, which is 67.
shp_voting[shp_voting['NAME_x'] == "Leon"]
| GEOID | STATEFP | NAME_x | county_id | geometry | state | county | NAME_y | proj_X | proj_Y | ... | median_income | pct_65_over | pct_age_18_29 | gini | pct_manuf | ln_pop_den | pct_3rd_party | turn_out | pct_fb | pct_uninsured | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 67 | 12073 | 12 | Leon | 12073 | POLYGON ((1080730.82089 870592.41110, 1086551.... | 12 | 73 | Leon County, Florida | 1.120705e+06 | 889449.6751 | ... | 5.3106 | 12.9 | 30.040722 | 0.4896 | 2.0 | 6.023568 | 1.202364 | 71.910474 | 6.8 | 8.1 | 
| 1050 | 48289 | 48 | Leon | 48289 | POLYGON ((-30255.42040 920246.79226, -30598.62... | 48 | 289 | Leon County, Texas | 3.831304e+02 | 913423.2253 | ... | 4.3045 | 24.2 | 12.342525 | 0.5271 | 5.7 | 2.767577 | 0.899446 | 68.074417 | 5.2 | 17.0 | 
2 rows × 26 columns
#Total effects for Leon can be obtained in the diagnoal of the full effects matrix.effects[67,67]
0.903899945968712
这表明,大学毕业生人数每增加 1%,民主党的投票份额将增加约 0.90%(直接 + 间接)。请注意,这大于滞后模型的系数(即 0.83),该系数仅捕捉直接影响。
 间接影响是其自身与邻居之间的空间相互作用的结果,约为 0.07%(0.90 - 0.83)
现在让我们来看看莱昂的变化如何影响周边县市。
#get the effects for leon and plot it
shp_voting['d_pct_bach_leon'] = effects[:,67]ax = shp_voting[shp_voting['state'] == 12].plot(column='d_pct_bach_leon',legend=True,figsize=(15,8), linewidth=0.0,aspect=1)plt.title("% increase in Dem share if Leon has \n1% more college graduates",fontsize=15,y=1.08)
Text(0.5, 1.08, '% increase in Dem share if Leon has \n1% more college graduates')

shp_voting[(shp_voting['state'] == 12) & (shp_voting['NAME_x'] == "Jefferson")].d_pct_bach_leon
2808    0.121902
Name: d_pct_bach_leon, dtype: float64
因此,我们基本上可以看出,如果莱昂的大学毕业生人数增加 1%,预计附近县的 %dem 份额将增加约 0.12%。例如,受莱昂变化的影响,杰斐逊县的 dem 份额可能会增加 0.12%。
shp_voting[(shp_voting['state'] == 12) & (shp_voting['NAME_x'] == "Miami-Dade")].d_pct_bach_leon
2607    1.163942e-08
Name: d_pct_bach_leon, dtype: float64
然而,我们可以看到,对于迈阿密戴德等较远的县,间接溢出效应基本为零。这是因为效应的幅度 (rho) 以及指定的 W 矩阵非常局部。