PySpark - PCA

from pyspark.ml.feature import PCApca = PCA(k=2, inputCol='features', outputCol='pca_features')
pca_model = pca.fit(df)
pca_df = pca_model.transform(df)
pca_df.show(truncate=False)

+---+--------------+-----------------------------------------+
|id |features      |pca_features                             |
+---+--------------+-----------------------------------------+
|0  |[1.0,0.1,-1.0]|[0.06466700238304013,-0.4536718845187467]|
|1  |[2.0,1.1,1.0] |[-1.6616789696362084,1.2840650302335732] |
|2  |[3.0,10.1,3.0]|[-10.87075006221038,0.19181523649833354] |
+---+--------------+-----------------------------------------+

In the PCA transformer:

the parameter k means:

✅ `k` = number of principal components you want to keep

📌 What PCA does in PySpark

PCA takes your original feature vector (maybe 10, 100, or 1000 dimensions) and compresses it into k orthogonal components that capture the most variance in the data.

Example

If your input vector is 5-dimensional:

and you set:

then PCA will output a 2-dimensional compressed vector:

These 2 values are linear combinations of the original 5 values.

📌 Meaning of choosing a value for k

k < number of original features

→ dimensionality reduction
→ you keep the most informative components
→ remove noise and redundancy

k = number of original features

→ no reduction; PCA just rotates the axes

k > number of original features

❌ not allowed; Spark will throw an error

📌 How to choose `k`?

Common strategies:

1. Based on explained variance

You can examine variance:

This returns e.g.:

→ PC1 explains 92% of the variation
→ PC2 explains 6%
→ PC3–PC4 explain almost nothing.

Then choose k = 1 or 2.

📌 Summary

Parameter	Meaning
k	Number of PCA components to return
Output dimension	= k
k must be ≤ number of input features	True

If you want, I can also explain:

how PCA results are computed
how to interpret pca_features
how to view the PCA loadings (eigenvectors) in PySpark

Just tell me!

Yes, I want.

Great — here is a clear, practical explanation of how to interpret PCA results in PySpark, including:

How PCA computes the result vectors
How to interpret pca_features
How to view PCA loadings (eigenvectors)
A complete example

✅ 1. How PCA computes the results

Given an input vector:

PCA finds a set of orthogonal directions (principal components):

These are eigenvectors of the covariance matrix.

The PCA output is:

This is a projection of your original vector onto the new PCA axes.

✅ 2. How to interpret `pca_features`

Example

If you set:

Then each row gets:

Example:

Interpretation:

PC1 value (4.53)
→ projection onto the direction that explains the MOST variance
PC2 value (-0.22)
→ projection onto the second-most important direction

So PCA transforms your data into a coordinate system where:

PC1 contains the most information
PC2 contains the second most
…
PCk contains less important info

✅ 3. How to view PCA loadings (eigenvectors)

PySpark stores eigenvectors in the model:

This returns a matrix of size:

To view them in a readable form:

Example output (for k=2 and 3 input features):

Interpretation:

PC1 loading vector = `[0.57, 0.57, 0.57]`

All features contribute equally
PC1 measures overall magnitude

PC2 loading vector = `[0.82, -0.41, -0.41]`

PC2 separates feature1 from feature2 & feature3
PC2 measures differences in the features

✅ 4. How to print explained variance

Example:

Meaning:

PC1 explains 92% of the variance → very important
PC2 explains 7% → somewhat important
PC3 explains 1% → noise

Often you choose k so that:

📌 Complete Practical Example

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()data = [(Vectors.dense([1.0, 2.0, 3.0]),),(Vectors.dense([4.0, 6.0, 8.0]),),(Vectors.dense([9.0, 13.0, 17.0]),)
]df = spark.createDataFrame(data, ["features"])pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(df)result = model.transform(df)
result.show(truncate=False)print("Explained variance:", model.explainedVariance)
print("Principal components:\n", model.pc.toArray())

🎉 Summary

Parameter	Meaning
`k`	Number of principal components to keep
`model.pc`	Eigenvectors (loadings)
`model.explainedVariance`	Variance explained by each PC
`pca_features`	Projection of original vector onto PC axes

You now fully understand how PCA in PySpark works.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/974275.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

组合博弈 sg函数 Nim游戏的板子默写

简单的取子用sg(x)==0 判断不就可以了吗！！！所有游戏单个子游戏的思想 1.sg(x)有向无环图上的棋子游戏每个棋子和它的出边都构成单独的有向无环图通过一个棋子的所有后继节点我们可以得到这个点的sg ans是所有点的…

详细介绍：Ribbon是如何与服务注册中心nacos交互的

pre { white-space: pre !important; word-wrap: normal !important; overflow-x: auto !important; display: block !important; font-family: "Consolas", "Monaco", "Courier New", …

$Day46(16)-F:\硕士阶段\Java\课程代码\后端\web-ai-code\web-ai-project02\tlias-web-management$