At 2:50 pandas uses Apache arrow, and spark to three uses Apache arrow
At 3:40 Apache calcite is a sequel engine without storage that can be used as a library. So for example if you need portions of the overall cycle engine, like the ability to parse SQL queries, it can do it. It is used by a large number of big data platform is to provide the sequel capabilities.
talk-using-apache-arrow-for-relational-cache#lattices1 2At 4:10 lattices are a nice way of a nice structured way of dealing with materialized views. talk-using-apache-arrow-for-relational-cache#lattices1 2
talk-using-apache-arrow-for-relational-cache#distance-to-data-dtd1At 5:30 caching is reducing the distance to data (DTD). Distance is both time and resources required. talk-using-apache-arrow-for-relational-cache#distance-to-data-dtd1
At 6:10 when talking about cashing another key piece is the consumability of the day though. Is it formatted in such a way that it is readily consumable
Apache arrow calcite and Parque
talk-using-apache-arrow-for-relational-cache#consumption-efficiency1At 6:15 related to caching "is data consign design for efficient consumption? Is it very expensive to process the data", etc. talk-using-apache-arrow-for-relational-cache#consumption-efficiency1
talk-using-apache-arrow-for-relational-cache#data-relevance1At 6:30 how close is the data to the questions I want to be able to answer? He calls this relevance. If he wants to query how many books were sold in New York and the only information he has his all the books sold in the world, then he has to do lots of work. talk-using-apache-arrow-for-relational-cache#data-relevance1
talk-using-apache-arrow-for-relational-cache#six-kinds-of-caching1At 7:20 6 kinds of caching: 1: in memory file pinning, 2: columnar disk caching, 3: In memory block caching, 4: near CPU data caching, 5: cube relational cashing, 6: arbitrary relational caching talk-using-apache-arrow-for-relational-cache#six-kinds-of-caching1
At 8:20 if your performance is bound by your I/O throughput, then in memory file pinning can be very helpful
At 8:40 he talks about how if you are persisting to desk, you might use the compression because if you are bottlenecked on the Ayo of writing to desk, you won't be bottlenecked on the compression. That is not true if you were writing to memory.
talk-using-apache-arrow-for-relational-cache#moving-the-bottleneckAt 9:10 if you are moving the bottle neck as you do with any kind of performance analysis. talk-using-apache-arrow-for-relational-cache#moving-the-bottleneck
At 9:40 you share throughput benefits, but not memory management benefits when everyone uses a different in memory representation to do their actual work on
At 10:15 if you can figure out how to improve the medium and the consumability that is very powerful
At 11:30 call them in our stores require a lot more buffering before inserting. With Parque he says between 256 and 512 megs makes sense. This is because you are doing a lot of re-ordering of the data into columns
At12 while you are buffering you have to think about things like durability. Do you still need to maintain a log before you put stuff into the corner format?
At around 1220 to get around this some people will immediately store the data in rows, then the ETL them into a columnar format using something like Spark or something.
talk-using-apache-arrow-for-relational-cache#in-memory-block-cachingAt 13 in-memory block caching. This is similar no this is like the Linux page cache. talk-using-apache-arrow-for-relational-cache#in-memory-block-caching
At 14 Near CPU cache can you hold the data in a representation that is very similar to what you will actually use when running your calculations?
talk-using-apache-arrow-for-relational-cache#downside-of-near-cpu-cacheAt 15 the downside of near CPU cache is that the data is a far larger than that when stored in the compressed format. The aero format is maybe 10 times larger than what is in Parque. talk-using-apache-arrow-for-relational-cache#downside-of-near-cpu-cache
At 16 cube-based relational cache has been around for a long time basically think wiki-molap?
At 1620 pre-aggregate some pieces of information into Shew Boyd's, and when the user comes in looking for information, some thing
At 1650 interaction time for cube based caching is very by little. If he gets a Huber head, then it's basically instantaneous, but if not, then it can take a long time.
talk-using-apache-arrow-for-relational-cache#arbitrary-query-patternat 1710 it's kind of hard to get a cube data cash to satisfy arbitrary query patterns talk-using-apache-arrow-for-relational-cache#arbitrary-query-pattern
At 2030 he starts talking about and ask why plot of the trade-offs associated with the various techniques. Basically from the Rod data you can answer any question, but it will take you a while to answer the question, and from the derived data you can answer your question quickly, sorry from the derive to cash data you can answer your question quickly, but maybe only very narrowly, and if you have to ask another question you have to take a long time or set up a new cash, etc.
At 20to10 he shows a slide of the decisions that they made for drama yo, his company. And how they adopted five of the six cashing techniques, only not file pending. Pending pin
At 23 he talks about relational algebra.
talk-using-apache-arrow-for-relational-cache#projection1At 24 followed by a projection I only want three of the columns of data. talk-using-apache-arrow-for-relational-cache#projection1
At 2431 thing you can do to limit the data processing is to put the filter before the projection. That is a straightforward thing to do in relational algebra
At 2530 the idea of relational cashing is that there might be some shared state that he might want from that might speed up the queries that you do on pit. You don't have to do arrive for the entirety of the calculation every time, but there's some shared state the benefits many of your queries.
At 2730 people use the copy and pick strategy. Copy is where you do some sort of rollup and then anulus are expected to know which roll up to use to fit their wearies. If they pick one roll up, they might know they will be unlikely to want to change the innards of their aquarium after the fact
At 2810 relational cashing moves the ownership of the caches to the system, rather than the data engineer who is setting up the various roll ups.
At 2910 they use Parque to persist the relational caches
I had 3040 he mentions a change detection database as part of the overall architecture for the system.
Add 3215 relational cashing is basically materialized views outside the database
Add 33 there are reflections. And hey Rob reflection is basically a structural normalization. It may remove some of the columns sort the data, potentially partition it differently, but the data remains the same except for the simple transformations.
At 3315 aggregate reflections are where are you define dimensions and measures and roll the data up
Add 3340 the reflections are powerful because they can be based on a virtual data set. The reflections are persisted.
talk-using-apache-arrow-for-relational-cache#raw-reflections-like-covering-indexes1 2 3 4 5At 34 you can think of raw reflections as being similar to a covering index on real or virtual data, whereas aggregate reflections act more like cube relational caching talk-using-apache-arrow-for-relational-cache#raw-reflections-like-covering-indexes1 2 3 4 5
At 3750 refresh management. They wanted to be able to specify what the ideal update timeframe is, for example 10 minutes, but then also was the worst allowable time was, three hours, before it falls back to the sources and bypasses the cash.
At 3850 the refresh graph where Rob reflections are done earliest, then trailer stuff is done afterwards on a more performance source of data
Multiple update modes depending on mutation pattern at 40. The one they found most effective was partition updates.