talk-runaway-complexity-in-big-data

https://www.youtube.com/watch?v=ucHjyb6jv08

talk-runaway-complexity-in-big-data#human-fault-tolerance1At 3:20 and talks about human fault tolerance talk-runaway-complexity-in-big-data#human-fault-tolerance1

talk-runaway-complexity-in-big-data#data-loss-and-corruption1 2At 5:20 the worst mistakes you can make are data loss and data corruption talk-runaway-complexity-in-big-data#data-loss-and-corruption1 2

talk-runaway-complexity-in-big-data#immutable-no-index-needed-to-find-and-update1At 9:15 having immutable system means you don't have to have an index your data (because you don't have to find it to be able to update it) talk-runaway-complexity-in-big-data#immutable-no-index-needed-to-find-and-update1

At 12 the way you store, model, and query data is complected. They are fundamentally intertwined.

At 12:15 he talks about systems where those parts are disassociated and could be scaled independently.

At 17:25 he uses Apache thrift to define his schemas

Add 27:25 he talks about ElephantDB, which he wrote, and voldemort, which LinkedIn wrote

At 2830 you have the ideas of transformation and normalization, however they are disassociated

talk-runaway-complexity-in-big-data#complexity-isolation1At 32:40 he talks about the architecture as facilitating complexity isolation, because most of the complexity goes into the real-time, incremental processing. Eventually the batch layer overrides that. And the batch layer is much more simple. talk-runaway-complexity-in-big-data#complexity-isolation1

talk-runaway-complexity-in-big-data#eventual-accuracy1At 35:20 he talks about eventual accuracy. Basically, what it means is that the batch layer will eventually override the realtime system talk-runaway-complexity-in-big-data#eventual-accuracy1

talk-runaway-complexity-in-big-data#storage-vs-querying1 2 3At 39:20 he talks about how the data storage is separated from the querying, and how that helps with the normalization and transformation. You get a fully normalized schema in the storage, but you also get views that are optimized (potentially very de-normalized) for your queries (the batch views) talk-runaway-complexity-in-big-data#storage-vs-querying1 2 3

talk-runaway-complexity-in-big-data#change-batch-viewsAt 39:45 it gives you flexibility in your batch views. If you realize you need a very different set of batch views, you can change your recompute algorithm or target and create totally new batch views. These batch views could even live on a completely different new system. talk-runaway-complexity-in-big-data#change-batch-views

talk-runaway-complexity-in-big-data#changing-needs-new-systemsAt 40:05 your needs change over time. As long as you are able to use a function on all of your data to recompute badge is, you will be able to satisfy your future needs as well. talk-runaway-complexity-in-big-data#changing-needs-new-systems

At 44:38 is the same idea as event sourcing sourcing

data-architecture-glossary#glossary

One huge thing I've learned: continuations are THE key building block for asynchronous, parallel, and reactive code. (#)

In short, CRDTs are objects that can be updated without expensive synchronization/consensus and they are guaranteed to converge eventually if all concurrent updates are commutative (see below) and if all updates are executed by each replica eventually

 

So we're somehow going to take the backend of MySQL (InnoDB) and introduce a variant that sits on top of a distributed storage subsystem. Once we've done that, network I/O becomes the bottleneck, so we also need to rethink how chatty network communications are.

 
 

Referring Pages

data-architecture-glossary new-data-architecture

People

person-nathan-marz