from pyspark.ml.feature import MinMaxScalerscaler = MinMaxScaler(inputCol='features', outputCol='scaled_features') scaler_model = scaler.fit(df) scaled_df = scaler_model.transform(df) scaled_df.show(truncate=False)
+---+--------------+---------------+ |id |features |scaled_features| +---+--------------+---------------+ |0 |[1.0,0.1,-1.0]|(3,[],[]) | |1 |[2.0,1.1,1.0] |[0.5,0.1,0.5] | |2 |[3.0,10.1,3.0]|[1.0,1.0,1.0] | +---+--------------+---------------+
Please help me understand the values of the scaled_features column.
Here is what’s happening in your scaled_features column and why you see values like (3,[],[]) for the first row.
✅ How MinMaxScaler Works
MinMaxScaler rescales each feature column-wise to the range [0, 1]:

Applied per column, not per row.
📌 Your Input
| Feature Index | Values | Min | Max | Range |
|---|---|---|---|---|
| 0 | 1.0, 2.0, 3.0 | 1.0 | 3.0 | 2.0 |
| 1 | 0.1, 1.1, 10.1 | 0.1 | 10.1 | 10.0 |
| 2 | -1.0, 1.0, 3.0 | -1.0 | 3.0 | 4.0 |
📌 Compute Scaled Values
Row 1 (id = 1) → [2.0, 1.1, 1.0]
Index 0:
(2.0−1.0)/(3.0−1.0)=1/2=0.5
Index 1:
(1.1−0.1)/(10.1−0.1)=1/10=0.1
Index 2:
(1.0−(−1.0))/(3.0−(−1.0))=2/4=0.5
✔ Output: [0.5, 0.1, 0.5]
Row 2 (id = 2) → [3.0, 10.1, 3.0]
All are max values → scale to 1:
✔ [1.0, 1.0, 1.0]
❗ Why Row 0 → (3,[],[]) (all zeros)
For row 0:
[1.0, 0.1, -1.0]
Index 0:
(1.0−1.0)/2.0=0
Index 1:
(0.1−0.1)/10.0=0
Index 2:
(−1.0−(−1.0))/4.0=0
👉 All scaled values are 0:
But Spark uses SparseVector format to store vectors with many zeros:
This means:
-
size = 3
-
no non-zero indices =
[] -
no non-zero values =
[]
This is equivalent to:
✅ Summary
| id | features | scaled_features | Meaning |
|---|---|---|---|
| 0 | [1.0,0.1,-1.0] | (3,[],[]) | all zeros → min values |
| 1 | [2.0,1.1,1.0] | [0.5,0.1,0.5] | mid-range values |
| 2 | [3.0,10.1,3.0] | [1.0,1.0,1.0] | max values |