talk-using-apache-arrow-for-relational-cache

https://youtu.be/KMl9Py8o3pk

At 2:50 pandas uses Apache arrow, and spark to three uses Apache arrow

At 3:40 Apache calcite is a sequel engine without storage that can be used as a library. So for example if you need portions of the overall cycle engine, like the ability to parse SQL queries, it can do it. It is used by a large number of big data platform is to provide the sequel capabilities.

talk-using-apache-arrow-for-relational-cache#lattices1 2At 4:10 lattices are a nice way of a nice structured way of dealing with materialized views. talk-using-apache-arrow-for-relational-cache#lattices1 2

talk-using-apache-arrow-for-relational-cache#distance-to-data-dtd1At 5:30 caching is reducing the distance to data (DTD). Distance is both time and resources required. talk-using-apache-arrow-for-relational-cache#distance-to-data-dtd1

At 6:10 when talking about cashing another key piece is the consumability of the day though. Is it formatted in such a way that it is readily consumable

Apache arrow calcite and Parque

talk-using-apache-arrow-for-relational-cache#consumption-efficiency1At 6:15 related to caching "is data consign design for efficient consumption? Is it very expensive to process the data", etc. talk-using-apache-arrow-for-relational-cache#consumption-efficiency1

talk-using-apache-arrow-for-relational-cache#data-relevance1At 6:30 how close is the data to the questions I want to be able to answer? He calls this relevance. If he wants to query how many books were sold in New York and the only information he has his all the books sold in the world, then he has to do lots of work. talk-using-apache-arrow-for-relational-cache#data-relevance1

talk-using-apache-arrow-for-relational-cache#six-kinds-of-caching1At 7:20 6 kinds of caching: 1: in memory file pinning, 2: columnar disk caching, 3: In memory block caching, 4: near CPU data caching, 5: cube relational cashing, 6: arbitrary relational caching talk-using-apache-arrow-for-relational-cache#six-kinds-of-caching1

At 8:20 if your performance is bound by your I/O throughput, then in memory file pinning can be very helpful

At 8:40 he talks about how if you are persisting to desk, you might use the compression because if you are bottlenecked on the Ayo of writing to desk, you won't be bottlenecked on the compression. That is not true if you were writing to memory.

talk-using-apache-arrow-for-relational-cache#moving-the-bottleneckAt 9:10 if you are moving the bottle neck as you do with any kind of performance analysis. talk-using-apache-arrow-for-relational-cache#moving-the-bottleneck

At 9:40 you share throughput benefits, but not memory management benefits when everyone uses a different in memory representation to do their actual work on

At 10:15 if you can figure out how to improve the medium and the consumability that is very powerful

At 11:30 call them in our stores require a lot more buffering before inserting. With Parque he says between 256 and 512 megs makes sense. This is because you are doing a lot of re-ordering of the data into columns

At12 while you are buffering you have to think about things like durability. Do you still need to maintain a log before you put stuff into the corner format?

At around 1220 to get around this some people will immediately store the data in rows, then the ETL them into a columnar format using something like Spark or something.

talk-using-apache-arrow-for-relational-cache#in-memory-block-cachingAt 13 in-memory block caching. This is similar no this is like the Linux page cache. talk-using-apache-arrow-for-relational-cache#in-memory-block-caching

At 14 Near CPU cache can you hold the data in a representation that is very similar to what you will actually use when running your calculations?

talk-using-apache-arrow-for-relational-cache#downside-of-near-cpu-cacheAt 15 the downside of near CPU cache is that the data is a far larger than that when stored in the compressed format. The aero format is maybe 10 times larger than what is in Parque. talk-using-apache-arrow-for-relational-cache#downside-of-near-cpu-cache

At 16 cube-based relational cache has been around for a long time basically think wiki-molap?

At 1620 pre-aggregate some pieces of information into Shew Boyd's, and when the user comes in looking for information, some thing

At 1650 interaction time for cube based caching is very by little. If he gets a Huber head, then it's basically instantaneous, but if not, then it can take a long time.

talk-using-apache-arrow-for-relational-cache#arbitrary-query-patternat 1710 it's kind of hard to get a cube data cash to satisfy arbitrary query patterns talk-using-apache-arrow-for-relational-cache#arbitrary-query-pattern

At 2030 he starts talking about and ask why plot of the trade-offs associated with the various techniques. Basically from the Rod data you can answer any question, but it will take you a while to answer the question, and from the derived data you can answer your question quickly, sorry from the derive to cash data you can answer your question quickly, but maybe only very narrowly, and if you have to ask another question you have to take a long time or set up a new cash, etc.

At 20to10 he shows a slide of the decisions that they made for drama yo, his company. And how they adopted five of the six cashing techniques, only not file pending. Pending pin

At 23 he talks about relational algebra.

talk-using-apache-arrow-for-relational-cache#projection1At 24 followed by a projection I only want three of the columns of data. talk-using-apache-arrow-for-relational-cache#projection1

At 2431 thing you can do to limit the data processing is to put the filter before the projection. That is a straightforward thing to do in relational algebra

At 2530 the idea of relational cashing is that there might be some shared state that he might want from that might speed up the queries that you do on pit. You don't have to do arrive for the entirety of the calculation every time, but there's some shared state the benefits many of your queries.

At 2730 people use the copy and pick strategy. Copy is where you do some sort of rollup and then anulus are expected to know which roll up to use to fit their wearies. If they pick one roll up, they might know they will be unlikely to want to change the innards of their aquarium after the fact

At 2810 relational cashing moves the ownership of the caches to the system, rather than the data engineer who is setting up the various roll ups.

At 2910 they use Parque to persist the relational caches

I had 3040 he mentions a change detection database as part of the overall architecture for the system.

Add 3215 relational cashing is basically materialized views outside the database

Add 33 there are reflections. And hey Rob reflection is basically a structural normalization. It may remove some of the columns sort the data, potentially partition it differently, but the data remains the same except for the simple transformations.

At 3315 aggregate reflections are where are you define dimensions and measures and roll the data up

Add 3340 the reflections are powerful because they can be based on a virtual data set. The reflections are persisted.

talk-using-apache-arrow-for-relational-cache#raw-reflections-like-covering-indexes1 2 3 4 5At 34 you can think of raw reflections as being similar to a covering index on real or virtual data, whereas aggregate reflections act more like cube relational caching talk-using-apache-arrow-for-relational-cache#raw-reflections-like-covering-indexes1 2 3 4 5

At 3750 refresh management. They wanted to be able to specify what the ideal update timeframe is, for example 10 minutes, but then also was the worst allowable time was, three hours, before it falls back to the sources and bypasses the cash.

At 3850 the refresh graph where Rob reflections are done earliest, then trailer stuff is done afterwards on a more performance source of data

Multiple update modes depending on mutation pattern at 40. The one they found most effective was partition updates.

Referring Pages

data-architecture-glossary

People

person-jacques-nadeau