http://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-parquet-hdfs.html
closely related to blog-post-query-parquet-files-in-sqlite
Includes a link to taxi-query-benchmarks
blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#optimizations1 2Other optimisations include dictionary encoding for columns where the number of unique values is in five-figure or fewer range, bit packing where small integers are stored together as a single, larger integer and run-length encoding where sequentially repeating values are stored with the value followed by the number of occurrences. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#optimizations1 2
blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#io-bound-vs-compute-bound1Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#io-bound-vs-compute-bound1
blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#multiple-tables-viewsBecause the 1.1 billion records sits across 56 virtual tables I've had to create views in SQLite federating the data I'm querying. The SELECT queries themselves have been modified to work with the semi-aggregated data. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#multiple-tables-views