talk-functional-data-engineering

https://www.youtube.com/watch?v=4Spo2QRTz1k

At 1:50 Superset is really taking off at the moment

At 8:10 and I had important function will get you to a desired state.

At 11:25 data lineage is not a graph of tables, but rather it's a graph of partitions.

At 11:40 any row can figure out it's partition, so you can infer it's lineage and get his providence traceability, and all that good stuff.

At 12:05 pure ETEL tasks are like pure functions, item potent etc.

At 12:40 DePere ETEL tasks are deterministic. Given the same source partition they will create the same target partition

At 1420 a persistent staging area. A staging area is where the data is brought into the warehouse untransformed. There used to be a debate about whether you kept it there temporarily, or indefinitely. He says that since this stuff is so cheap now you should keep it forever.

At 14:50 put it into a read optimized file format like Parque or ORC and leave it there forever.

At 1850 he talks about a CD to slowly changing dimensions type to where you add a surrogate key for the natural key, and maintain and effective date

At 20 to 4 slowly changing dimensions, instead of using the type one or type two or type three approaches, simply Snapchat all the data each day, or at whatever granularity. People think that he is crazy for doing this because the dimensions change slowly and yet he is a snapshot in the entirety of the data every day. His response is that storage is cheap computer as cheap

At 2320 it's pretty common to have macros and SQL tools that will select the latest partition or limit by the latest partition now. Or, you might maintain a view that is something like a dimsuppliercurrent

At 2350 a nice side effect of having snapshots is that you can do Time series, for example if you wanted to see how many suppliers are in the dimension table for every day in the past you can do that

At 2710, if you have partitioned by event processing time because that allows you to close the loop on the processing more quickly, but you still want to be able to query by that time and take advantage of query pruning, then you may be able to sub partition by event time with in the partitions that are partitioned by event processing time

People

person-yada-lertlit person-maxime-beauchemin