Dynamically select columns by type
Generic function to select by type in spark
In pandas it is really easy to select only columns matching a certain data type:
df.select_dtypes(include=['float64'])
In spark, such a function is not included by default. However, it can easily be coded by hand:
val df = Seq(
(1, 2, "hello")
).toDF("id", "count", "name")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)