talk-spark-and-parquet-in-depth

https://databricks.com/session/spark-parquet-in-depth

https://youtu.be/_0Wpwj_gvzg

talk-spark-and-parquet-in-depth#bottlenecks1provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound). talk-spark-and-parquet-in-depth#bottlenecks1

At 6:20 Parque compressed with snappy

Binary format separately is encoded in compressed

840 can handle arbitrarily nested data

At 21 pushed filters, where the filters are pushed down

At maybe 2250: immutability of the files

At 2310 is Cassondra to gather the data, and when it is sufficiently historical write it out to Parque

talk-spark-and-parquet-in-depth#stream-collect-until-watermark-condition1At 23:40 in a streaming context: collect until watermark condition is met (time, size, number of rows, etc) talk-spark-and-parquet-in-depth#stream-collect-until-watermark-condition1

Referring Pages

data-architecture-glossary