from pyspark.sql.functions import exprdf_filtered = df_filtered.withColumn('filtered_array', expr('filter(filtered_doc, x -> len(x) >= 4)'))
Please help me understand the expr() and filter() functions.
✅ What expr() Does
expr() lets you write Spark SQL expressions inside PySpark code.
Example:
👉 It is useful for higher-order functions, which are easier to express in SQL syntax than Python UDFs.
🧩 What filter() Does (Spark SQL Higher-Order Function)
filter(array, function) is a higher-order function in Spark SQL that:
-
Takes an array column
-
Applies a lambda function to every element
-
Keeps only the elements for which the function returns true
Syntax inside expr():
🧠 Understanding Your Line
✔ filtered_doc
An array column (probably array of words like ["this", "is", "text"]).
✔ filter(...)
Keeps only array elements that satisfy the condition.
✔ x -> len(x) >= 4
This is a lambda function:
-
xis each element of the array -
len(x)= length of the string -
Condition: keep words with 4 or more characters
✔ The result
If filtered_doc is:
📝 Equivalent Without expr()
You could use PySpark’s built-in functions:
🎯 Summary
expr()
-
Runs SQL expressions in PySpark
-
Useful for complex expressions and higher-order functions
filter()
-
Spark SQL higher-order function
-
Filters an array based on a lambda predicate
-
Used here to keep only words of length ≥ 4