from pyspark.ml.feature import PCApca = PCA(k=2, inputCol='features', outputCol='pca_features') pca_model = pca.fit(df) pca_df = pca_model.transform(df) pca_df.show(truncate=False)
+---+--------------+-----------------------------------------+ |id |features |pca_features | +---+--------------+-----------------------------------------+ |0 |[1.0,0.1,-1.0]|[0.06466700238304013,-0.4536718845187467]| |1 |[2.0,1.1,1.0] |[-1.6616789696362084,1.2840650302335732] | |2 |[3.0,10.1,3.0]|[-10.87075006221038,0.19181523649833354] | +---+--------------+-----------------------------------------+
In the PCA transformer:
pca = PCA(k=2, inputCol="features", outputCol="pca_features") the parameter k means:
✅ k = number of principal components you want to keep
📌 What PCA does in PySpark
PCA takes your original feature vector (maybe 10, 100, or 1000 dimensions) and compresses it into k orthogonal components that capture the most variance in the data.
Example
If your input vector is 5-dimensional:
features = [x1, x2, x3, x4, x5] and you set:
k = 2 then PCA will output a 2-dimensional compressed vector:
pca_features = [pc1, pc2] These 2 values are linear combinations of the original 5 values.
📌 Meaning of choosing a value for k
k < number of original features
→ dimensionality reduction
→ you keep the most informative components
→ remove noise and redundancy
k = number of original features
→ no reduction; PCA just rotates the axes
k > number of original features
❌ not allowed; Spark will throw an error
📌 How to choose k?
Common strategies:
1. Based on explained variance
You can examine variance:
model = pca.fit(df)
print(model.explainedVariance)
This returns e.g.:
[0.92, 0.06, 0.01, 0.01]
→ PC1 explains 92% of the variation
→ PC2 explains 6%
→ PC3–PC4 explain almost nothing.
Then choose k = 1 or 2.
📌 Summary
| Parameter | Meaning |
|---|---|
| k | Number of PCA components to return |
| Output dimension | = k |
| k must be ≤ number of input features | True |
If you want, I can also explain:
-
how PCA results are computed
-
how to interpret
pca_features -
how to view the PCA loadings (eigenvectors) in PySpark
Just tell me!
You said:
ChatGPT said:
Great — here is a clear, practical explanation of how to interpret PCA results in PySpark, including:
-
How PCA computes the result vectors
-
How to interpret
pca_features -
How to view PCA loadings (eigenvectors)
-
A complete example
✅ 1. How PCA computes the results
Given an input vector:
features = [x1, x2, x3, ..., xn]
PCA finds a set of orthogonal directions (principal components):
PC1 = [w11, w12, ..., w1n]
PC2 = [w21, w22, ..., w2n]
...
These are eigenvectors of the covariance matrix.
The PCA output is:
pca_features = [PC1 ⋅ features,PC2 ⋅ features,...PCk ⋅ features
]
This is a projection of your original vector onto the new PCA axes.
✅ 2. How to interpret pca_features
Example
If you set:
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
Then each row gets:
3-dimensional input → 2-dimensional PCA output
Example:
input: [1.0, 2.0, 3.0]
output: [4.53, -0.22]
Interpretation:
-
PC1 value (4.53)
→ projection onto the direction that explains the MOST variance -
PC2 value (-0.22)
→ projection onto the second-most important direction
So PCA transforms your data into a coordinate system where:
-
PC1 contains the most information
-
PC2 contains the second most
-
…
-
PCk contains less important info
✅ 3. How to view PCA loadings (eigenvectors)
PySpark stores eigenvectors in the model:
model = pca.fit(df)
model.pc
This returns a matrix of size:
num_features × k
To view them in a readable form:
import numpy as nppc = np.array(model.pc.toArray())
print(pc)
Example output (for k=2 and 3 input features):
[[ 0.57, 0.82],[ 0.57, -0.41],[ 0.57, -0.41]]
Interpretation:
PC1 loading vector = [0.57, 0.57, 0.57]
-
All features contribute equally
-
PC1 measures overall magnitude
PC2 loading vector = [0.82, -0.41, -0.41]
-
PC2 separates feature1 from feature2 & feature3
-
PC2 measures differences in the features
✅ 4. How to print explained variance
model.explainedVariance
Example:
[0.92, 0.07, 0.01]
Meaning:
-
PC1 explains 92% of the variance → very important
-
PC2 explains 7% → somewhat important
-
PC3 explains 1% → noise
Often you choose k so that:
sum(variance[:k]) ≥ 0.95
📌 Complete Practical Example
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()data = [(Vectors.dense([1.0, 2.0, 3.0]),),(Vectors.dense([4.0, 6.0, 8.0]),),(Vectors.dense([9.0, 13.0, 17.0]),)
]df = spark.createDataFrame(data, ["features"])pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(df)result = model.transform(df)
result.show(truncate=False)print("Explained variance:", model.explainedVariance)
print("Principal components:\n", model.pc.toArray())
🎉 Summary
| Parameter | Meaning |
|---|---|
k |
Number of principal components to keep |
model.pc |
Eigenvectors (loadings) |
model.explainedVariance |
Variance explained by each PC |
pca_features |
Projection of original vector onto PC axes |
You now fully understand how PCA in PySpark works.