气泡图、桑基图的绘制

1、气泡图

使用气泡图分析某一年中国同欧洲各国之间的贸易情况。
气泡图分析的三个维度：
• 进口额：横轴
• 出口额：纵轴
• 进出口总额：气泡大小

数据来源：链接: 国家统计局数据

数据概览（进出口总额，进口总额和出口总额数据格式与总额类似）
在这里插入图片描述

（1）数据预处理

1. 知识点：

通过 .str 来调用字符串处理方法
extract 方法的作用是根据指定的正则表达式模式，从字符串中提取出符合模式的部分。它会返回一个新的 DataFrame 或者 Series
正则表达式 r’同(.*?)进出口总额’：整体意思是捕获 “同” 后面直到遇到 “进出口总额” 之前的任意字符内容。

2. 代码：

#数据预处理：填充缺失值、提取国家名称
# 提取国家名称替换 country 列
df['country'] = df['country'].str.extract(r'同(.*?)进出口总额')
# 将 input 列中的缺失值用 0 填充
df['input'] = df['input'].fillna(0)
# 将结果保存为 Excel 文件
df.to_excel('/初始数据_2023_预处理.xlsx', index=False)

3. 结果：

备注：删除 “欧洲” 这样一行数据，避免造成数据量级差别较大造成的不美观
在这里插入图片描述

（2）可视化

1. 知识点：

plt.cm.tab20：cm 是 matplotlib 中颜色映射（colormap）模块。tab20 是 matplotlib 内置的一种颜色映射表，它包含 20 种不同的颜色，这些颜色在视觉上有较好的区分度，适用于区分多个类别。
linspace 函数用于在指定的区间内生成均匀间隔的数值序列。
使用plotly.express绘制气泡图并添加悬停提示

2. 代码：

import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np# 读取Excel文件数据
df = pd.read_excel('初始数据_2023_预处理.xlsx', engine='openpyxl')# 使用seaborn设置风格
sns.set_style("whitegrid")# 自定义颜色映射
colors = plt.cm.tab20(np.linspace(0, 5, len(df)))
cmap = ListedColormap(colors)# 使用plotly.express绘制气泡图并添加悬停提示
fig = px.scatter(df, x='input', y='output', size='total',color='country',color_discrete_sequence=cmap.colors,labels={'input': '进口额(万美元)', 'output': '出口额(万美元)', 'total': '进出口总额(万美元)','country': '国家/地区'},title='中国与部分欧洲国家进出口气泡图')
# 更新标记点为圆形
fig.update_traces(marker=dict(symbol='circle'))# 添加进出口平衡辅助线
fig.add_shape(type="line",x0=df['input'].min(),  # 辅助线起点x坐标为进口额最小值y0=df['input'].min(),  # 辅助线起点y坐标（与进口额相等，保证在平衡线上 ）x1=df['input'].max(),  # 辅助线终点x坐标为进口额最大值y1=df['input'].max(),  # 辅助线终点y坐标（与进口额相等，保证在平衡线上 ）line=dict(color="red",  # 辅助线颜色设为红色width=2,dash="dash"  # 辅助线样式设为虚线)
)# 为顺差区域添加注释
fig.add_annotation(xref="x",yref="y",x=df['input'].max() * 0.7,  # 注释x坐标位置y=df['input'].max() * 1.3,  # 注释y坐标位置text="顺差",  # 注释文本font=dict(size=12,color="green"  # 注释文字颜色),showarrow=False  # 不显示箭头
)# 为逆差区域添加注释
fig.add_annotation(xref="x",yref="y",x=df['input'].max() * 0.7,y=df['input'].max() * 0.4,text="逆差",font=dict(size=12,color="orange"),showarrow=False
)# 显示图形
fig.show()

3. 结果：

在这里插入图片描述

2、动态气泡图：

（1）数据预处理

1. 知识点：

melt 函数用于将数据从宽格式转换为长格式
- id_vars=[‘指标’] ：指定在重塑过程中保持不变的列，这里 ‘指标’ 列的内容会被保留。
- var_name=‘年份’ ：将原来宽格式数据中的列名（除 id_vars 列外）转换为长格式中的一列，并将该列命名为 ‘年份’ 。
- value_name=‘进出口总额’ ：将原来宽格式数据中对应的值转换为长格式中的一列，并将该列命名为 ‘进出口总额’ 。通过这一步，数据的结构变得更便于后续分析。
提取年份中的数字并转换类型
- df_total[‘年份’].str.extract(‘(\d+)’) ：使用 str.extract 方法，结合正则表达式 (\d+) 从 ‘年份’ 列的字符串中提取连续的数字部分。(\d+) 表示捕获一个或多个数字。
- .astype(int) ：将提取出的数字字符串转换为整数类型，这样 ‘年份’ 列的数据类型就变为整数，方便后续进行数值相关的操作或分析。
dropna 函数用于删除包含缺失值的行。subset=[‘国家’] 表示只检查 ‘国家’ 这一列，如果这一列存在缺失值（NaN ），则删除对应的行。这样可以保证数据集中的 ‘国家’ 列没有缺失值，使后续基于该列的分析更加可靠。

2. 代码：

import pandas as pdpd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
# 读取数据
def read_data():# 读取出口数据df_export = pd.read_csv('年度数据(1).csv', encoding='utf-8')df_export = df_export.melt(id_vars=['指标'], var_name='年份', value_name='出口额')df_export['年份'] = df_export['年份'].str.extract('(\d+)').astype(int)df_export['国家'] = df_export['指标'].str.extract('中国向(.*?)出口总额')df_export = df_export.dropna(subset=['国家'])# 读取进口数据df_import = pd.read_csv('年度数据(2).csv', encoding='utf-8')df_import = df_import.melt(id_vars=['指标'], var_name='年份', value_name='进口额')df_import['年份'] = df_import['年份'].str.extract('(\d+)').astype(int)df_import['国家'] = df_import['指标'].str.extract('中国从(.*?)进口总额')df_import = df_import.dropna(subset=['国家'])# 读取进出口总额数据df_total = pd.read_csv('年度数据.csv', encoding='utf-8')df_total = df_total.melt(id_vars=['指标'], var_name='年份', value_name='进出口总额')df_total['年份'] = df_total['年份'].str.extract('(\d+)').astype(int)df_total['国家'] = df_total['指标'].str.extract('中国同(.*?)进出口总额')df_total = df_total.dropna(subset=['国家'])# 合并数据df = pd.merge(df_export, df_import, on=['国家', '年份'])df = pd.merge(df, df_total, on=['国家', '年份'])# 计算贸易差额df['贸易差额'] = df['出口额'] - df['进口额']df['顺差/逆差'] = df['贸易差额'].apply(lambda x: '顺差' if x > 0 else '逆差')return df# 读取数据
df = read_data()# 将 DataFrame 存储为 CSV 文件
df.to_csv('中国进出口贸易数据.csv', index=False, encoding='utf-8-sig')  # utf-8-sig 支持 Excel 中文显示
print("\n数据已保存到 '中国进出口贸易数据.csv'")

3. 结果：

在这里插入图片描述

（2）可视化V1 —— 展示贸易顺差/逆差随着年份的变化

1. 实现步骤

a. 数据分组：

函数 create_bubble_chart 接受一个 DataFrame 对象 df 作为输入。
使用 groupby 方法按 ‘年份’ 和 ‘国家’ 对数据进行分组，然后使用 agg 方法对分组后的数据进行聚合操作：
‘出口额’、‘进口额’、‘进出口总额’ 和 ‘贸易差额’ 列使用 ‘sum’ 方法进行求和。
‘顺差/逆差’ 列使用 ‘first’ 方法，即取每组中的第一个值（假设每组中该值是相同的）。
最后使用 reset_index 方法重置索引，使分组的 ‘年份’ 和 ‘国家’ 变为普通列。

df_grouped = df.groupby(['年份', '国家']).agg({'出口额': 'sum','进口额': 'sum','进出口总额': 'sum','贸易差额': 'sum','顺差/逆差': 'first'}).reset_index()

b. 创建气泡图

使用 plotly.express 库的 scatter 函数创建一个散点图（气泡图）：
x 和 y 分别指定为 ‘进口额’ 和 ‘出口额’ 列。
size 指定为 ‘进出口总额’ 列，用于表示气泡的大小。
color 指定为 ‘顺差/逆差’ 列，用于根据贸易状况给气泡上色。
hover_name 指定为 ‘国家’ 列，当鼠标悬停在气泡上时显示国家名称。
animation_frame 指定为 ‘年份’ 列，使图表按年份进行动态变化。
animation_group 指定为 ‘国家’ 列，确保每个国家的数据在动画中保持一致。
size_max 设置气泡的最大大小为 60。
range_x 和 range_y 设置 x 轴和 y 轴的范围，分别为进口额和出口额最大值的 1.1 倍。
labels 字典用于自定义图表中各轴和图例的标签。
title 设置图表的标题。
color_discrete_map 字典指定了 ‘顺差’ 和 ‘逆差’ 对应的颜色。

    fig = px.scatter(df_grouped,x="进口额",y="出口额",size="进出口总额",color="顺差/逆差",hover_name="国家",animation_frame="年份",animation_group="国家",size_max=60,range_x=[0, df_grouped['进口额'].max() * 1.1],range_y=[0, df_grouped['出口额'].max() * 1.1],labels={"进口额": "进口额 (万美元)","出口额": "出口额 (万美元)","进出口总额": "进出口总额","顺差/逆差": "贸易状况"},title="中国与欧洲各国贸易情况 (2014-2023)",color_discrete_map={"顺差": "blue","逆差": "red"})

c. 添加辅助线

for frame in fig.frames:：这是一个循环，遍历 fig（即创建的气泡图对象）中的每一个 frame（帧）。因为这个气泡图是动态的，按年份作为动画帧展示数据变化，所以这里要对每一个帧都添加辅助线，以保证在动画的每一帧中都能显示平衡线。
frame.data += (…)：frame.data 表示每一帧中的数据集合，这里使用 += 操作符向每一帧的数据集合中添加一个新的 go.Scatter 对象。
go.Scatter 对象用于创建一个散点图（在这里用于创建一条线）：
x=[0, max_value] 和 y=[0, max_value]：指定了这条线的起点 (0, 0) 和终点 (max_value, max_value)，这样就形成了 y = x 的直线。
mode=‘lines’：表示这个 go.Scatter 对象的模式是绘制线。
line=dict(color=‘gray’, dash=‘dash’)：设置线的属性，颜色为灰色，样式为虚线。
name=‘平衡线 (出口=进口)’：给这条线命名为 ‘平衡线 (出口=进口)’，用于标识这条线的含义。
showlegend=False：设置这条线不显示在图例中，因为这条辅助线主要是为了视觉上的参考，不需要在图例中占据空间。

# 获取最大值的110%用于辅助线 这样可以确保所有的数据点都在辅助线所界定的区域内显示，使图表更加完整和美观
max_value = max(df_grouped['进口额'].max(), df_grouped['出口额'].max()) * 1.1# 添加辅助线 (y = x)
for frame in fig.frames:frame.data += (go.Scatter(x=[0, max_value],y=[0, max_value],mode='lines',line=dict(color='gray', dash='dash'),name='平衡线 (出口=进口)',showlegend=False

d. 更新布局

hovermode 用于设置当鼠标悬停在图形上时的交互模式。这里设置为 “closest”，表示当鼠标悬停在图表上时，会显示离鼠标位置最近的数据点的详细信息（例如国家名称、进口额、出口额等，这些信息是在创建气泡图时通过 hover_name 等参数设置的）。
updatemenus 用于在图形中添加一些交互按钮或菜单。这里创建了一个类型为 “buttons” 的 updatemenus，即添加按钮。
buttons 是一个列表，用于定义按钮的具体属性。这里列表中只有一个按钮，通过 dict 来设置按钮的属性。
label=“播放” 设置按钮的显示文本为 “播放”。
method=“animate” 表示当点击这个按钮时，执行的操作是启动动画。
args 是传递给 animate 方法的参数。[None, {“frame”: {“duration”: 1000, “redraw”: True}, “fromcurrent”: True}] 中，None 表示不指定特定的帧序列来播放动画；{“frame”: {“duration”: 1000, “redraw”: True}, “fromcurrent”: True} 表示设置动画帧的持续时间为 1000 毫秒，并且在播放动画时重新绘制图形（redraw": True），同时从当前帧开始播放动画（“fromcurrent”: True）。

	fig.update_layout(xaxis_title="进口额 (万美元)",yaxis_title="出口额 (万美元)",legend_title="贸易状况",hovermode="closest",transition={'duration': 1000},updatemenus=[dict(type="buttons",buttons=[dict(label="播放",method="animate",args=[None, {"frame": {"duration": 1000, "redraw": True}, "fromcurrent": True}]),dict(label="暂停",method="animate",args=[[None], {"frame": {"duration": 0, "redraw": False}, "mode": "immediate","transition": {"duration": 0}}])])])

e. 添加初始辅助线

fig.add_trace(…)：
fig 是 plotly 里的图形对象，代表了整个图表。add_trace 方法的作用是往图表里添加一个新的绘图轨迹（trace）。绘图轨迹可以理解成图表中的一个独立绘图元素，例如散点图、折线图等。这里添加的是一条直线。

    fig.add_trace(go.Scatter(x=[0, max_value],y=[0, max_value],mode='lines',line=dict(color='gray', dash='dash'),name='平衡线 (出口=进口)'))

3. 结果

图片：

在这里插入图片描述

视频：

动态气泡图_顺逆差

（3）可视化V1

1. 改进

体现具体国家而不仅仅是顺逆差

2. 结果

动态气泡图_彩色

3、桑基图 Sankey diagram：分析与可视化不同因素之间的关系，借助绘制桑基图来呈现这些因素之间的关联和流动情况

（1）桑基图简介

组成要素

节点：代表不同的类别或状态，通常用矩形或其他形状表示。例如，在能源流动的桑基图中，节点可以是不同的能源来源（如煤炭、石油、天然气）和能源使用部门（如工业、交通、居民生活）。
边或流线：连接节点的线条，用于表示数据的流动方向和数量。边的宽度与所代表的数据量成正比，因此可以直观地看出不同路径上数据的相对大小。

特点

可视化数据流动：能够清晰地展示数据从一个状态或类别到另一个状态或类别的流动过程，使复杂的流程和关系变得直观易懂。
定量展示：通过边的宽度准确地展示数据的数量，让观众可以直观地比较不同部分的数据大小，了解各部分在整体中所占的比例。
整体守恒：桑基图中所有流入节点的流量总和等于所有流出节点的流量总和，体现了数据在整个系统中的守恒关系，有助于分析数据在各个环节的分配和转化情况。

（2）数据集

credit_record.csv
application_record.csv:
数据说明：

（3）实现步骤

1.导入包、读取数据

import pandas as pd
import plotly.graph_objects as go# 读取数据
app = pd.read_csv("application_record.csv")
credit = pd.read_csv("credit_record.csv")

2.定义函数用于信用状态分类——优先考虑逾期情况最严重的状态

如果列表中有多种状态，首先使用 any(s in [‘1’, ‘2’, ‘3’, ‘4’, ‘5’] for s in status_list) 判断列表中是否存在逾期状态（‘1’ 到 ‘5’）。any() 函数用于判断可迭代对象中是否有任何一个元素满足条件。
如果存在逾期状态，优先考虑逾期最严重的情况，按照逾期天数从多到少的顺序进行判断，返回对应的信用状态分类描述
如果列表中不存在逾期状态，使用 all(s in [‘C’, ‘0’] for s in status_list) 判断列表中的所有状态是否都为 ‘C’（本月已还清）或 ‘0’（逾期 1 - 29 天）。
all() 函数用于判断可迭代对象中的所有元素是否都满足条件。如果是，则返回 ‘Paid off’ 表示已还清。
credit.groupby(‘ID’) 按照用户 ID 对信用记录数据 credit 进行分组。
.apply(classify_credit_status) 对每个分组应用 classify_credit_status 函数，得到每个用户的信用状态分类结果。
.reset_index() 重置索引，将结果转换为一个数据框。
credit_status.columns = [‘ID’, ‘CREDIT_STATUS’] 为数据框的列设置名称，分别为 ‘ID’ 和 ‘CREDIT_STATUS’。

f classify_credit_status(group):status_list = group['STATUS'].tolist()# 如果列表中只有一种状态，直接返回该状态对应的分类if len(set(status_list)) == 1:status = status_list[0]if status == '0':return '1-29 days due'elif status == '1':return '30-59 days due'elif status == '2':return '60-89 days due'elif status == '3':return '90-119 days due'elif status == '4':return '120-149 days due'elif status == '5':return 'Over 150 days due or bad debt'elif status == 'C':return 'Paid off this month'elif status == 'X':return 'No loan this month'# 如果列表中有多种状态，需要根据规则判断else:if any(s in ['1', '2', '3', '4', '5'] for s in status_list):# 优先考虑逾期严重的情况if '5' in status_list:return 'Over 150 days due or bad debt'elif '4' in status_list:return '120-149 days due'elif '3' in status_list:return '90-119 days due'elif '2' in status_list:return '60-89 days due'elif '1' in status_list:return '30-59 days due'elif all(s in ['C', '0'] for s in status_list):return 'Paid off'elif 'X' in status_list:return 'No loan this month'else:return 'Complex status'credit_status = credit.groupby('ID').apply(classify_credit_status).reset_index()
credit_status.columns = ['ID', 'CREDIT_STATUS']

3.合并数据

merged = pd.merge(app, credit_status, on='ID')

4.创建字段

依据 FLAG_OWN_REALTY 列创建新列 OWN_REALTY，将 Y 映射为 Yes Property，N 映射为 No Property。

merged['OWN_REALTY'] = merged['FLAG_OWN_REALTY'].map({'Y': 'Yes Property', 'N': 'No Property'})

5.选取需要的字段

df = merged[['NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OWN_REALTY', 'CREDIT_STATUS']]

6.汇总数据

按照 NAME_HOUSING_TYPE、NAME_INCOME_TYPE、OWN_REALTY 和 CREDIT_STATUS 进行分组，统计每组的数量，将结果保存到 count 列。

df_grouped = df.groupby(['NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OWN_REALTY', 'CREDIT_STATUS']).size().reset_index(name='count')

7.设置标签和映射

labels：将所有可能的标签（房屋类型、收入类型、是否拥有房产、信用状态）合并，去除重复项后转换为列表。
label_index：创建一个字典，将每个标签映射到一个唯一的索引。

labels = pd.concat([df_grouped['NAME_HOUSING_TYPE'], df_grouped['NAME_INCOME_TYPE'],df_grouped['OWN_REALTY'], df_grouped['CREDIT_STATUS']]).unique().tolist()
label_index = {label: i for i, label in enumerate(labels)}

8.构建链接

source、target 和 value：分别代表桑基图中链接的起始节点索引、结束节点索引和链接的值（即每组的数量）。
通过三次循环构建三组链接：房屋类型到收入类型、收入类型到是否拥有房产、是否拥有房产到信用状态。

source = []
target = []
value = []# Housing ➝ Income Type
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_HOUSING_TYPE']])target.append(label_index[row['NAME_INCOME_TYPE']])value.append(row['count'])# Income Type ➝ Own Realty
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_INCOME_TYPE']])target.append(label_index[row['OWN_REALTY']])value.append(row['count'])# Own Realty ➝ Credit Status
for _, row in df_grouped.iterrows():source.append(label_index[row['OWN_REALTY']])target.append(label_index[row['CREDIT_STATUS']])value.append(row['count'])

9.绘制桑基图

go.Sankey：创建一个桑基图对象。
node：设置节点的属性，如节点间距、厚度、线条颜色和宽度、标签等。
link：设置链接的属性，如起始节点索引、结束节点索引和链接的值。
fig.update_layout：更新图表的布局，设置标题和字体大小。
fig.show()：显示桑基图。

fig = go.Figure(data=[go.Sankey(node=dict(pad=15,thickness=20,line=dict(color="black", width=0.5),label=labels),link=dict(source=source,target=target,value=value))])fig.update_layout(title_text="House Type ➝ Income Type ➝ Owns Property ➝ Credit Status", font_size=12)
fig.show()

（4）结果

House Type → Income Type → Owns Property → Credit Status

4、桑基图2

同理根据以下代码绘制结果如图

import pandas as pd
import plotly.graph_objects as go# 读取数据
app = pd.read_csv("application_record.csv")
credit = pd.read_csv("credit_record.csv")# 一、处理收入分组
def map_income_level(income):if income < 50000:return "Very Low income"elif income < 100000:return "Low income"elif income < 150000:return "Lower - middle income"elif income < 200000:return "Middle income"elif income < 300000:return "Upper - middle income"elif income < 400000:return "Moderate - high income"elif income < 600000:return "High income"else:return "Very High income"app["INCOME_LEVEL"] = app["AMT_INCOME_TOTAL"].apply(map_income_level)# 二、处理信用记录中的月收支平衡时间段
def map_month_balance(group):earliest = group['MONTHS_BALANCE'].min()if earliest >= -6:return "Within 6 months"elif earliest >= -12:return "6 - 12 months ago"elif earliest >= -24:return "1 - 2 years ago"elif earliest >= -36:return "2 - 3 years ago"elif earliest >= -48:return "3 - 4 years ago"elif earliest >= -60:return "4 - 5 years ago"else:return "Over 5 years ago"credit_time = credit.groupby('ID').apply(map_month_balance).reset_index()
credit_time.columns = ['ID', 'MONTH_BALANCE_GROUP']# 三、合并数据
merged = pd.merge(app, credit_time, on='ID')# 四、选取需要的字段
df = merged[['CODE_GENDER', 'INCOME_LEVEL', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'MONTH_BALANCE_GROUP']]# 五、分组统计
df_grouped = df.groupby(['CODE_GENDER', 'INCOME_LEVEL', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'MONTH_BALANCE_GROUP']).size().reset_index(name='count')# 六、构建节点标签及索引映射
labels = pd.concat([df_grouped['CODE_GENDER'],df_grouped['INCOME_LEVEL'],df_grouped['NAME_EDUCATION_TYPE'],df_grouped['NAME_FAMILY_STATUS'],df_grouped['MONTH_BALANCE_GROUP']
]).unique().tolist()label_index = {label: i for i, label in enumerate(labels)}# 七、构建桑基图的 source、target、value
source, target, value = [], [], []# Gender ➝ Income Level
for _, row in df_grouped.iterrows():source.append(label_index[row['CODE_GENDER']])target.append(label_index[row['INCOME_LEVEL']])value.append(row['count'])# Income Level ➝ Education Level
for _, row in df_grouped.iterrows():source.append(label_index[row['INCOME_LEVEL']])target.append(label_index[row['NAME_EDUCATION_TYPE']])value.append(row['count'])# Education Level ➝ Marital Status
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_EDUCATION_TYPE']])target.append(label_index[row['NAME_FAMILY_STATUS']])value.append(row['count'])# Marital Status ➝ Month Balance
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_FAMILY_STATUS']])target.append(label_index[row['MONTH_BALANCE_GROUP']])value.append(row['count'])# 八、绘制桑基图
fig = go.Figure(data=[go.Sankey(node=dict(pad=15,thickness=20,line=dict(color="black", width=0.5),label=labels),link=dict(source=source,target=target,value=value))])fig.update_layout(title_text="Gender / Income Level / Education Level / Marital Status / Month Balance",font_size=12,height=700
)
fig.show()

在这里插入图片描述

5、存在的问题——信用状态的判断

最近一个月的信用状态？
逾期最严重的信用状态？

# 图2 House Type → Income Type →  Owns Property →  Credit Statusimport pandas as pd
import plotly.graph_objects as go# 读取数据
app = pd.read_csv("application_record.csv")
credit = pd.read_csv("credit_record.csv")# 信用状态分类
def classify_credit_status(group):status_list = group['STATUS'].tolist()# 如果列表中只有一种状态，直接返回该状态对应的分类if len(set(status_list)) == 1:status = status_list[0]if status == '0':return '1-29 days due'elif status == '1':return '30-59 days due'elif status == '2':return '60-89 days due'elif status == '3':return '90-119 days due'elif status == '4':return '120-149 days due'elif status == '5':return 'Over 150 days due or bad debt'elif status == 'C':return 'Paid off this month'elif status == 'X':return 'No loan this month'# 如果列表中有多种状态，需要根据规则判断else:if any(s in ['1', '2', '3', '4', '5'] for s in status_list):# 优先考虑逾期严重的情况if '5' in status_list:return 'Over 150 days due or bad debt'elif '4' in status_list:return '120-149 days due'elif '3' in status_list:return '90-119 days due'elif '2' in status_list:return '60-89 days due'elif '1' in status_list:return '30-59 days due'elif all(s in ['C', '0'] for s in status_list):return 'Paid off'elif 'X' in status_list:return 'No loan this month'else:return 'Complex status'credit_status = credit.groupby('ID').apply(classify_credit_status).reset_index()
credit_status.columns = ['ID', 'CREDIT_STATUS']# 合并
merged = pd.merge(app, credit_status, on='ID')# 创建字段
merged['OWN_REALTY'] = merged['FLAG_OWN_REALTY'].map({'Y': 'Yes Property', 'N': 'No Property'})df = merged[['NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OWN_REALTY', 'CREDIT_STATUS']]# 汇总
df_grouped = df.groupby(['NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OWN_REALTY', 'CREDIT_STATUS']).size().reset_index(name='count')# 标签和映射
labels = pd.concat([df_grouped['NAME_HOUSING_TYPE'], df_grouped['NAME_INCOME_TYPE'],df_grouped['OWN_REALTY'], df_grouped['CREDIT_STATUS']]).unique().tolist()
label_index = {label: i for i, label in enumerate(labels)}# 构建链接
source = []
target = []
value = []# Housing ➝ Income Type
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_HOUSING_TYPE']])target.append(label_index[row['NAME_INCOME_TYPE']])value.append(row['count'])# Income Type ➝ Own Realty
for _, row in df_grouped.iterrows():source.append(label_index[row['NAME_INCOME_TYPE']])target.append(label_index[row['OWN_REALTY']])value.append(row['count'])# Own Realty ➝ Credit Status
for _, row in df_grouped.iterrows():source.append(label_index[row['OWN_REALTY']])target.append(label_index[row['CREDIT_STATUS']])value.append(row['count'])# 绘制桑基图
fig = go.Figure(data=[go.Sankey(node=dict(pad=15,thickness=20,line=dict(color="black", width=0.5),label=labels),link=dict(source=source,target=target,value=value))])fig.update_layout(title_text="House Type ➝ Income Type ➝ Owns Property ➝ Credit Status", font_size=12)
fig.show()

所有的信用状态全部考虑的数据流向和流量是否准确（因为会把重复的用户计入）

import pandas as pd
import plotly.graph_objects as go# 读取数据
app = pd.read_csv("application_record.csv")
credit = pd.read_csv("credit_record.csv")# 按 ID 合并数据集
merged = pd.merge(app, credit, on='ID')# 创建新字段 OWN_REALTY
merged['OWN_REALTY'] = merged['FLAG_OWN_REALTY'].map({'Y': 'Yes Property', 'N': 'No Property'})# 定义状态映射字典
status_mapping = {'0': '1-29 days due','1': '30-59 days due','2': '60-89 days due','3': '90-119 days due','4': '120-149 days due','5': 'Over 150 days due or bad debt','C': 'Paid off this month','X': 'No loan this month'
}# 将 STATUS 列映射为描述性状态
merged['CREDIT_STATUS'] = merged['STATUS'].map(status_mapping)# 选择用于绘制桑基图的列
columns = ['NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OWN_REALTY', 'CREDIT_STATUS']# 分组统计每个组合的数量
df_grouped = merged.groupby(columns).size().reset_index(name='count')# 生成所有唯一的标签
labels = pd.concat([df_grouped[col] for col in columns]).unique().tolist()
# 为每个标签分配一个索引
label_index = {label: i for i, label in enumerate(labels)}# 初始化存储链接信息的列表
source = []
target = []
value = []# 构建所有可能的链接
for i in range(len(columns) - 1):for _, row in df_grouped.iterrows():source.append(label_index[row[columns[i]]])target.append(label_index[row[columns[i + 1]]])value.append(row['count'])# 创建桑基图对象
fig = go.Figure(data=[go.Sankey(node=dict(pad=15,thickness=20,line=dict(color="black", width=0.5),label=labels),link=dict(source=source,target=target,value=value))])# 更新图表布局，设置标题和字体大小
fig.update_layout(title_text="House Type ➝ Income Type ➝ Owns Property ➝ Credit Status", font_size=12)
# 显示桑基图
fig.show()