https://databricks.com/session/spark-parquet-in-depth
talk-spark-and-parquet-in-depth#bottlenecks1provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound). talk-spark-and-parquet-in-depth#bottlenecks1
At 6:20 Parque compressed with snappy
Binary format separately is encoded in compressed
840 can handle arbitrarily nested data
At 21 pushed filters, where the filters are pushed down
At maybe 2250: immutability of the files
At 2310 is Cassondra to gather the data, and when it is sufficiently historical write it out to Parque
talk-spark-and-parquet-in-depth#stream-collect-until-watermark-condition1At 23:40 in a streaming context: collect until watermark condition is met (time, size, number of rows, etc) talk-spark-and-parquet-in-depth#stream-collect-until-watermark-condition1