PySpark - OneHotEncoder

from pyspark.ml.feature import OneHotEncoderdata = [(0.0, 1.0), (1.0, 0.0), (2.0, 1.0)]
columns = ["input1", "input2"]
df = spark.createDataFrame(data, columns)encoder = OneHotEncoder(inputCols=["input1", "input2"],outputCols=["output1", "output2"]
)encoder_model = encoder.fit(df)
encoded_df = encoder_model.transform(df)
encoded_df.show(truncate=False)

+------+------+-------------+-------------+
|input1|input2|output1      |output2      |
+------+------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |
+------+------+-------------+-------------+

OneHotEncoder output format

Each encoded column is a SparseVector, shown as:

size = number of categories minus 1
(because OneHotEncoder drops the last category to avoid multicollinearity)
indices = which category this row belongs to
values = always [1.0]

📌 Vocabulary sizes in your case

From the output:

For output1

🔍 Now decode your rows

▶ Row 1

input1=0 → category index 0 → one-hot at index 0
→ (2,[0],[1.0])
input2=1 is the last category, so encoder drops it
→ all zeros (1,[],[])

▶ Row 2

input1=1 → index 1 → (2,[1],[1.0])
input2=0 → index 0 → (1,[0],[1.0])

▶ Row 3

input1=2 → last category, dropped → (2,[],[])
input2=1 → last category, dropped → (1,[],[])

🎯 Summary

output1

3 categories in input1 → output vector size = 2

Encoded as:

output2

2 categories in input2 → output vector size = 1

Encoded as:

If you want, I can also show you how to:

✔ keep all categories (no drop)
✔ extract the category → vector index mapping
✔ convert SparseVector to dense vector for readability

Yes, I want.

Great — here are the three things you asked for, clearly explained with ready-to-use PySpark code:

✔ (A) Keep all categories (no drop)
✔ (B) Get the category → vector index mapping
✔ (C) Convert SparseVector to DenseVector for readability

🅰 A. Keep all categories (no drop)

By default, PySpark OneHotEncoder drops the last category to avoid multicollinearity (dummy variable trap).

To keep all categories:

🅱 B. Get category → vector index mapping

After fitting:

Example output:

input1 has 3 distinct categories → output1 vector length = 2 (or 3 if dropLast=False)
input2 has 2 distinct categories → output2 vector length = 1 (or 2 if dropLast=False)

You can also get full mapping:

1. Extract labels for input columns

If input columns were originally string -> indexed -> encoded,
you can inspect the StringIndexer model.

But since your inputs were numeric categories, Spark assumes the category index = the value itself.

So the mapping is simply:

To see mapping explicitly:

🅲 C. Convert SparseVector to DenseVector

Sometimes you want to see the full vector (e.g., [0,1,0]) instead of:

Example output

+------+------+-------------+-------------+-------------+-------------+
|input1|input2|output1      |output2      |output1_dense|output2_dense|
+------+------+-------------+-------------+-------------+-------------+
|0.0   |1.0   |(2,[0],[1.0])|(1,[],[])    |[1.0,0.0]    |[0.0]        |
|1.0   |0.0   |(2,[1],[1.0])|(1,[0],[1.0])|[0.0,1.0]    |[1.0]        |
|2.0   |1.0   |(2,[],[])    |(1,[],[])    |[0.0,0.0]    |[0.0]        |
+------+------+-------------+-------------+-------------+-------------+