from pyspark.ml.feature import OneHotEncoderdata = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)] columns = ["input1", "input2"] df = spark.createDataFrame(data, columns)encoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"] )encoder_model = encoder.fit(df) encoded_df = encoder_model.transform(df) encoded_df.show(truncate=False)
+------+------+-------------+-------------+ |input1|input2|output1 |output2 | +------+------+-------------+-------------+ |0.0 |1.0 |(2,[0],[1.0])|(1,[],[]) | |1.0 |0.0 |(2,[1],[1.0])|(1,[0],[1.0])| |2.0 |1.0 |(2,[],[]) |(1,[],[]) | +------+------+-------------+-------------+
OneHotEncoder output format
Each encoded column is a SparseVector, shown as:
-
size= number of categories minus 1
(because OneHotEncoder drops the last category to avoid multicollinearity) -
indices= which category this row belongs to -
values= always[1.0]
📌 Vocabulary sizes in your case
From the output:
For output1
🔍 Now decode your rows
▶ Row 1
-
input1=0 → category index 0 → one-hot at index 0
→(2,[0],[1.0]) -
input2=1 is the last category, so encoder drops it
→ all zeros(1,[],[])
▶ Row 2
-
input1=1 → index 1 →
(2,[1],[1.0]) -
input2=0 → index 0 →
(1,[0],[1.0])
▶ Row 3
-
input1=2 → last category, dropped →
(2,[],[]) -
input2=1 → last category, dropped →
(1,[],[])
🎯 Summary
output1
-
3 categories in input1 → output vector size = 2
-
Encoded as:
output2
-
2 categories in input2 → output vector size = 1
-
Encoded as:
You said:
ChatGPT said: