Spark descriptive name for cached dataframes
Concise names for cached tables.
Have you ever wondered where the cryptic names of cached dataframes and RDD in Spark’s web UI belong to?
Usually no specific name is set.
When you apply a df.cache
spark will auto generate the name as a snippet from the query plan.
But this is not very descriptive, especially if there are a number of cached tables or if the spark cluster is shared by several users.
However, there is a better way:
def namedCache(name: String, storageLevel: StorageLevel = MEMORY_AND_DISK)(
df: DataFrame): DataFrame = {
df.sparkSession.sharedState.cacheManager
.cacheQuery(df, Some(name), storageLevel)
df
}
one can simply explicitly pass a name and now it shows up in the web UI. This greatly simplifies debugging for me.