blog-post-inside-capacitor-next-gen-column-storage

https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format

blog-post-inside-capacitor-next-gen-column-storage#techniques-and-encodings1 2various techniques and encodings, such as Run Length Encoding (RLE), Dictionary encoding, Bit-Vector encoding, Frame of Reference encoding, etc blog-post-inside-capacitor-next-gen-column-storage#techniques-and-encodings1 2

blog-post-inside-capacitor-next-gen-column-storage#columns-not-all-equal-for-rle1not all columns are born equal. Some might be very long strings, where shorter RLE runs are more beneficial than longer runs on the small integers column blog-post-inside-capacitor-next-gen-column-storage#columns-not-all-equal-for-rle1

blog-post-inside-capacitor-next-gen-column-storage#best-number-of-shards1We don't want too few shards, because we would like to take advantage of distributed processing capabilities of BigQuery, processing a table in parallel using potentially thousands of machines — each one reading individual shards. But we also don't want too many shards, because every unit of storage and processing has constant overhead blog-post-inside-capacitor-next-gen-column-storage#best-number-of-shards1

blog-post-inside-capacitor-next-gen-column-storage#background-data-improvement1BigQuery has background processes that constantly look at all the stored data and check if it can be optimized even further. Perhaps initially data was loaded in small chunks, and without seeing all the data, some decisions were not globally optimal. Or perhaps some parameters of the system have changed, and there are new opportunities for storage restructuring. Or perhaps, Capacitor models got more trained and tuned, and it possible to enhance existing data. Whatever the case might be, when the system detects an opportunity to improve storage, it kickstarts data conversion tasks. These tasks do not compete with queries for resources, they run completely in parallel, and don't degrade query performance. Once the new, optimized storage is complete, it atomically replaces old storage data — without interfering with running queries. Old data will be garbage-collected later blog-post-inside-capacitor-next-gen-column-storage#background-data-improvement1

References

paper-storing-and-querying-tree-structured-data-in-dremel

wiki-reed-solomon-error-correction

Referring Pages

data-architecture-glossary