PySpark - OneHotEncoder

news/2025/11/23 17:41:32/文章来源:https://www.cnblogs.com/zhangzhihui/p/19261112

 

from pyspark.ml.feature import OneHotEncoderdata = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)]
columns = ["input1", "input2"]
df = spark.createDataFrame(data, columns)encoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"]
)encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

 

+------+------+-------------+-------------+
|input1|input2|output1      |output2      |
+------+------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |
+------+------+-------------+-------------+

 

 

OneHotEncoder output format

Each encoded column is a SparseVector, shown as:

(size, [indices], [values])

Where:

  • size = number of categories minus 1
    (because OneHotEncoder drops the last category to avoid multicollinearity)

  • indices = which category this row belongs to

  • values = always [1.0]


📌 Vocabulary sizes in your case

From the output:

For output1

(2,[0],[1.0]) → size = 2  

Meaning input1 has 3 categories:

0.0 → index 0
1.0 → index 1
2.0 → index dropped (all zeros)

For output2

(1,[0],[1.0]) → size = 1

Meaning input2 has 2 categories:

0.0 → index 0
1.0 → dropped (all zeros)

 


🔍 Now decode your rows


Row 1

input1 = 0.0
input2 = 1.0output1 = (2,[0],[1.0])
output2 = (1,[],[])
  • input1=0 → category index 0 → one-hot at index 0
    (2,[0],[1.0])

  • input2=1 is the last category, so encoder drops it
    → all zeros (1,[],[])


Row 2

input1 = 1.0
input2 = 0.0output1 = (2,[1],[1.0])
output2 = (1,[0],[1.0])
  • input1=1 → index 1 → (2,[1],[1.0])

  • input2=0 → index 0 → (1,[0],[1.0])


Row 3

input1 = 2.0
input2 = 1.0output1 = (2,[],[])
output2 = (1,[],[])
  • input1=2 → last category, dropped → (2,[],[])

  • input2=1 → last category, dropped → (1,[],[])


🎯 Summary

output1

  • 3 categories in input1 → output vector size = 2

  • Encoded as:

    input1=0.0 → (2,[0],[1.0])
    input1=1.0 → (2,[1],[1.0])
    input1=2.0 → (2,[],[])     # dropped category

output2

  • 2 categories in input2 → output vector size = 1

  • Encoded as:

    input2=0.0 → (1,[0],[1.0])
    input2=1.0 → (1,[],[])     # dropped category

If you want, I can also show you how to:

✔ keep all categories (no drop)
✔ extract the category → vector index mapping
✔ convert SparseVector to dense vector for readability

 
You said:
Yes, I want.
 
ChatGPT said:

Great — here are the three things you asked for, clearly explained with ready-to-use PySpark code:

(A) Keep all categories (no drop)
(B) Get the category → vector index mapping
(C) Convert SparseVector to DenseVector for readability


🅰 A. Keep all categories (no drop)

By default, PySpark OneHotEncoder drops the last category to avoid multicollinearity (dummy variable trap).

To keep all categories:

from pyspark.ml.feature import OneHotEncoderencoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"],dropLast=False   # ← keep every category
)encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

If a column has k categories, the output vector will have k elements (instead of k-1).


🅱 B. Get category → vector index mapping

After fitting:

encoder_model.categorySizes

This returns the number of categories for each input column.

Example output:

[3, 2]

Meaning:

  • input1 has 3 distinct categories → output1 vector length = 2 (or 3 if dropLast=False)

  • input2 has 2 distinct categories → output2 vector length = 1 (or 2 if dropLast=False)

You can also get full mapping:

1. Extract labels for input columns

If input columns were originally string -> indexed -> encoded,
you can inspect the StringIndexer model.

But since your inputs were numeric categories, Spark assumes the category index = the value itself.

So the mapping is simply:

Category value → vector index

Unless categories are not dense (e.g., 10, 20, 50), in which case Spark remaps them internally.

To see mapping explicitly:

for i, size in enumerate(encoder_model.categorySizes):print(f"Column {i} categories: 0..{size - 1}")

Example output:

Column 0 categories: 0..2
Column 1 categories: 0..1

If you want a fully explicit mapping:

for col, size in zip(["input1", "input2"], encoder_model.categorySizes):print(f"Mapping for {col}:")for idx in range(size):print(f"  category {idx} → vector index {idx}")

 


🅲 C. Convert SparseVector to DenseVector

Sometimes you want to see the full vector (e.g., [0,1,0]) instead of:

(3,[1],[1.0])

Use a UDF:

from pyspark.ml.linalg import DenseVectorto_dense = F.udf(lambda v: DenseVector(v.toArray()), DenseVector())df_dense = encoded_df.withColumn("output1_dense", to_dense("output1")) \.withColumn("output2_dense", to_dense("output2"))df_dense.show(truncate=False)

 

Example output

 
+------+------+-------------+-------------+-------------+-------------+ |input1|input2|output1 |output2 |output1_dense|output2_dense| +------+------+-------------+-------------+-------------+-------------+ |0.0 |1.0 |(2,[0],[1.0])|(1,[],[]) |[1.0,0.0] |[0.0] | |1.0 |0.0 |(2,[1],[1.0])|(1,[0],[1.0])|[0.0,1.0] |[1.0] | |2.0 |1.0 |(2,[],[]) |(1,[],[]) |[0.0,0.0] |[0.0] | +------+------+-------------+-------------+-------------+-------------+

This makes the encoding much easier to inspect.


🎉 Summary

✔ Keep all categories

 
dropLast=False

✔ See number of categories per column

 
encoder_model.categorySizes

✔ Convert SparseVector → DenseVector

 
DenseVector(v.toArray())

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/974082.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

.NET 10 中 C# 14 和 F# 10 的新情况

C# 14 和 F# 10 将随 .NET 10 正式发布,这两个微软主力.NET语言的更新聚焦于生产力、代码清晰度和性能提升。本次发布通过协调一致的改进,优化了常见编码模式,同时实现了整个.NET工具链的增强。C#14 亮点C# 14版本增…

P10687 True Liars 个人题解

题目传送门 题目大意: 给你两个神圣种族和邪恶种族的人数以及询问次数,其中神圣种族的人说真话,邪恶种族的人说假话,请你判断那几个是神圣种族的人。 Solution: 题解区已经有很多带边权并查集的做法了,这里我用的…

Kali资料

Kali资料解决Kali Linux APT更新中的GPG签名错误问题

题解:Luogu P14522 【MX-S11-T3】空之碎物

题意 定义 \(\ominus\) 为二进制不退位减法。对于一个可重集 \(S\),你可以进行若干次操作,每次操作可以选择 \(S\) 中的两个数 \(x,y\),合并成 \(x\ominus y\) 或 \(y\ominus x\)。定义 \(f(S)\) 为将 \(S\) 合并至…

10分钟,无需公网 IP!零门槛搭建 NapCatQQ 趣味 AI 人机,聊天互动超简单

10分钟,无需公网 IP!零门槛搭建 NapCatQQ 趣味 AI 人机,聊天互动超简单无需公网 IP 即可打造 QQ 智能人机:核心依赖 NapCat(接收 QQ 消息)与 AstrBot(提供 AI 能力)容器,通过 WebSocket 建立连接,配置硅基流…

1088. Rational Arithmetic (20)

1088. Rational Arithmetic (20)#include <iostream>using namespace std;long long getsame(long long a, long long b) {if(b != 0){return getsame(b, a % b);}else{return a;} }void simplify(long long &am…

1087. All Roads Lead to Rome (30)

1087. All Roads Lead to Rome (30)#include <iostream> #include <vector> #include <string.h>using namespace std;struct node {int next, cost; };vector<node> v[27000]; vector<int…

解码UDP

UDP 协议基础认知 UDP(User Datagram Protocol,用户数据报协议)是传输层核心协议之一,基于 IP 协议实现跨网络主机进程间的无连接数据传输。它面向事务提供简单通信服务,不保证数据交付、有序性和重复防护,也不提…

人工智能之数据分析 numpy:第六章 数组基本操作

人工智能之数据分析 numpy:第六章 数组基本操作人工智能之数据分析 numpy 第六章 数组基本操作@目录人工智能之数据分析 numpy前言一、修改数组形状(Reshaping)1. reshape()2. resize()3. ravel() 与 flatten()二、…

2025中山办公场地租赁优选:中山西区金嘉创新港,一站式创业空间,赋能企业成长新机遇

随着中山市产业升级与创新创业浪潮的蓬勃发展,优质办公空间已成为企业发展的重要基石。在2025年中山商业地产市场中,中山西区金嘉创新港凭借多元化的空间解决方案、完善的配套服务体系及卓越的区位优势,成为各类企业…

国产数据库替代MongoDB:政务电子证照新选择 - 教程

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

读书笔记《投资的未来》,估算收益率

比较IBM和新泽西标准石油两家公司,一个新兴的,受追捧的,一个传统的,但结果,是石油公司胜出。尽管两只股票的业绩都不错,但是1950~2003年,新泽西标准石油的投资者每年可以取得14.42%的年收益率,这比IBM提供的1…

使用代码查询快递信息的方法(与查询天气的方式雷同)

第一步:在标签中添加JS文件具体内容如下:第二步:写出大概的框架第三步:写JS部分(1)定义appkey和API地址(2)校验输入和显示加载状态(3)调用API和解析返回数据(4)展示拼接内容和判断内容是否正确第四步:保存并运行查…

1101. Quick Sort (25)

1101. Quick Sort (25)#include <iostream> #include <vector> #include <algorithm>using namespace std;int num[100010], low[100010], high[100010];int main() {int n;scanf("%d", &…

1100. Mars Numbers (20)

1100. Mars Numbers (20)#include <iostream> #include <string.h>using namespace std;char ch[2][13][5] = {"tret", "jan", "feb", "mar", "apr",…

解码网络编程基础

进程间通信方式 基础概念 程序是数据和指令的集合,运行时成为进程,操作系统会为其分配资源并记录参数。同一主机内进程通信可通过管道、信号、消息队列、信号量集、共享内存实现,这些方式依赖主机本地系统资源,无法…

C++的3种继承方式

C++的3种继承方式 在 C++ 中,继承方式(public、protected、private)决定了基类成员在派生类中的访问权限,以及派生类对象对基类成员的访问权限。正确选择继承方式是实现封装、复用和多态的关键。以下是三种继承方式…

1082. Read Number in Chinese (25)

1082. Read Number in Chinese (25)#include <iostream> #include <string.h>using namespace std;int first = 1;void setfirst() {if(first == 1){first = 0;}else{printf(" ");} }int main()…

1081. Rational Sum (20)

1081. Rational Sum (20)#include <iostream>using namespace std;long long getsame(long long a, long long b) {if(b != 0){return getsame(b, a % b);}else{return a;} }void simplify(long long &a, lo…

1067. Sort with Swap(0) (25)

1067. Sort with Swap(0) (25)#include <iostream>using namespace std;int index[100010], num[100010];int main() {int n;scanf("%d", &n);int i, count = 0;for(i = 0; i < n; i++){scanf(…