blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs

http://tech.marksblogg.com/billion-nyc-taxi-rides-sqlite-parquet-hdfs.html

closely related to blog-post-query-parquet-files-in-sqlite

Includes a link to taxi-query-benchmarks

blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#optimizations1 2Other optimisations include dictionary encoding for columns where the number of unique values is in five-figure or fewer range, bit packing where small integers are stored together as a single, larger integer and run-length encoding where sequentially repeating values are stored with the value followed by the number of occurrences. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#optimizations1 2

blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#io-bound-vs-compute-bound1Where decompression is I/O or network bound it makes sense to keep the compressed data as compact as possible. That being said, there are cases where decompression is compute bound and compression schemes like Snappy play a useful role in lowering the overhead. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#io-bound-vs-compute-bound1

blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#multiple-tables-viewsBecause the 1.1 billion records sits across 56 virtual tables I've had to create views in SQLite federating the data I'm querying. The SELECT queries themselves have been modified to work with the semi-aggregated data. blog-post-1-billion-taxi-rides-parquet-sqlite-hdfs#multiple-tables-views

Referring Pages

data-architecture-glossary optimal-query-format

People

person-mark-litwintschik