talk-dataflow-a-unified-model#unified-model1Add 30 seconds the data flow model is a unified model, so it can express both batch and streaming operations talk-dataflow-a-unified-model#unified-model1
Add to 25 data shapes.
At 2:50 if she refers to daily logs where each day do you have another log of all of the data, as a "repetitive data structure
At 2:55 "that repetitive data structure is just a cheap way to represent a continuous data source quote
At 3:50 quote so we have to be able to handle this on ordered infinite data, and there are a few tensions that come into play "
At 4:25 do you want the results of those soon as possible. In fact he may even want speculative results.
At 4:30-ish she talks about three knobs, completeness, latency, and cost, and how for different use cases do you want to be able to tweak those knobs in different ways. If you tweak for very low latency, it starts to look like a streaming system, but if you don't care about latency and you care more about completeness, it starts to look like a batch system.
At 7:10 there are three papers: map reduce, flume Java, and mill wheel
At 8:35 in flume Java you use high-leve Intuitivel primitives describing the shape of your computation
At 10:30 because of maybe it is called logrolling, session information was distributed across two separate files, making an artificial break quote and artificial break "do I
At 11:40 to do aggregations on continuous data you have to window your data up in to finite sized chunks
talk-dataflow-a-unified-model#event-time-based-vs-processing-time-basedAt 12:25 do you want to put the events it to event time based windows rather than processing time based windows. talk-dataflow-a-unified-model#event-time-based-vs-processing-time-based
There are arbitrary delays between event time and processing time, called SKU
talk-dataflow-a-unified-model#watermark1At 14:05 the watermark is an indicator of the processing. It basically says that this time I don't expect any older events to show up. talk-dataflow-a-unified-model#watermark1
talk-dataflow-a-unified-model#watermark-tradeoffs1At 14:45 to heuristics the drive the watermark have to deal with the trade-offs. If you move the watermark forward too fast, you might miss some slightly stale events. If you move it forward to slowly you're processing will be delayed as you wait around for latent events that actually never materialize. talk-dataflow-a-unified-model#watermark-tradeoffs1
At 1730 P collections P. Parallel collections.
At 1730 build up the entire graph, then that graph is executed as a single unit on the system. That allows the system to do optimizations like function composition, etc.
At 1750 the P collections are homogeneous Li typed sets of data
At 1820 in the P collection every element has an implicit timestamp. Four elements where is somatically relevant timestamp is not available, it just uses the creation time of the element. For example if it's just part of the calculation of the number of words in all of Shakespeare's plays
Left off at 23:51
At 2350 talks about how you need to process in windows if you have continuous data.
At 2355 fixed window, sliding window, and sessions.
At 2520 triggers are white decide when the results are embedded. One way to trigger is based on the watermark
At 2630 customize the trigger to trigger at the watermark, but also every minute for speculative results. Nice visualization here.
At 2825 "part of the same result that is good any refined overtime.
At 2840 accumulate and retract. One way of doing refinement is to accumulate the records know if that, and when there is a refinement, send the accumulation, and a retraction about the previous value.
At 4305 when answering a question about spark streaming, she said spark streaming's when doing semantics are not quite as powerful as ours so we something couldn't do it on spark streaming.
At 44 she talks about sources and sinks.
At 46 when she did the batch operation to gather all the words from all of Shakespeare's plays earlier, she used a thing they call the "global window"
At 4730 when in batch mode if a node fails, Google cloud data flight is it is the standard map reduce fault tolerance mechanisms of retrain the entire bundle.
At 4820 we do exactly once processing modulo any side effects that are introduced by retrying a bundle.