Spark and Hive 3
A tale of two metastores
Recently I experienced a cluster upgrade to the next major version of hadoop. Many component versions have been updated. In particular hive moved from 1.2.x to 3.x line. This introduces many changes as hive tries to behave more and more like a traditional database. In particular, hive introduced ACID tables.
By default Hive tries to optimize (i.e. compact and sort) the data automatically so users do not need to worry about this. Additionally, they do support UPSERTS as a slowly changing dimension, which could be suited for GDPR deletions. Unfortunately, spark as of 2.3 or 2.4 does not support reading these tables. In fact, distributors like Cloudera have introduced a second warehouse only used by spark - for a reason.
If you still want to keep everyone using the same warehouse, setting --conf spark.hadoop.metastore.catalog.default=hive
gets the job done.
The first catch is when one of the tables is ACID spark will only be able to read it corrently after a full major compaction. Otherwise, Spark might receive outdated data as it does not understand the delta files. The second and more subtle is: by not using LLAP’s ranger plugin the columnar ranger policies do not apply.
Both spark and hive remain interoperable using:
- external tables (the traditional approach)
- HiveWarehouseConnector
Using the HiveWarehouseConnector is slower than the old file access methodology. This also holds true for Databricks delta, where directly writing to parquet would be faster but not ACID. But in addition managed tables automatically manage the size of the partitions and collect statistics - which might be beneficial overall.
Contrary to what is found in most of the documentation about the warehouse connector it can work with both LLAP and Tez, though there are some restrictions in case of the Tez execution path.
The warehouse connector is not a drop-in replacement. I.e. sparks SQL optimization is broken at the boundaries:
hwx.sql("SELECT * from db.table).count
might run out of memory, whereas sparks native counterpart will work just fine.
summary
Spark and Hive have drifted apart and do not play well together in HDP 3.x.
By separating read and write paths you get the best of both worlds: Spark can write without LLAP for maximum performance to an external table. On the read path by applying the HiveWarehouseConnector data is transferred via LLAP for added security.