blog-post-functional-data-engineering

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

it's important to insulate compute logic changes from data changes and have control over all of the moving parts.

blog-post-functional-data-engineering#importance-of-clarity1As ETL pipelines grow in complexity, and as data teams grow in numbers, using methodologies that provide clarity isn't a luxury, it's a necessity. blog-post-functional-data-engineering#importance-of-clarity1

blog-post-functional-data-engineering#reproducibility1 2To put it simply, immutable data along with versioned logic are key to reproducibility. blog-post-functional-data-engineering#reproducibility1 2

blog-post-functional-data-engineering#elt-overwrite1In the context of a SQL ELT-type approach which has become common nowadays, it is likely to be simply overwriting a portion of a table (partition) blog-post-functional-data-engineering#elt-overwrite1

blog-post-functional-data-engineering#size-of-unit-of-work1In some cases it may appear difficult to avoid side-effects. One potential solution is to re-think the size of the unit of work. Can the task become pure if it encompasses other related tasks? Can it be purified by breaking down into a set of smaller tasks? blog-post-functional-data-engineering#size-of-unit-of-work1

If temporary tables or dataframes are used, they should be implemented in a way that task instances cannot interfere with one another so that they can be parallelized.

blog-post-functional-data-engineering#partitioning-scheme-different-physical-tablesTo work around this limitation, one option is to implement your own partitioning scheme by using different physical tables as the output of your pure tasks that can be UNIONed ALL into views that act as logical tables blog-post-functional-data-engineering#partitioning-scheme-different-physical-tables

blog-post-functional-data-engineering#staging-areaThe staging area is the conceptual loading dock of your data warehouse, and while in the case a real physical retail-type warehouse you'd want to use a transient staging area and keep it unobstructed, in most modern data warehouse you'll want to accumulate and persist all of your source data there, and keep it unchanged forever. blog-post-functional-data-engineering#staging-area

blog-post-functional-data-engineering#parameter-tablesAlso note that in many cases business rules changes over time are best expressed with data as opposed to code. In those cases it's desirable to store this information in parameter tables using effective dates, and have logic that joins and apply the right parameters for the facts being processed. blog-post-functional-data-engineering#parameter-tables

blog-post-functional-data-engineering#persistent-immutable-staging-area-backfill-derived-dataGiven a persistent immutable staging area and pure tasks, in theory it's possible to recompute the state of the entire warehouse from scratch (not that you should), and get to the exact same state. Knowing this, the retention policy on derived tables can be shorter, knowing that it's possible to backfill historical data at will. blog-post-functional-data-engineering#persistent-immutable-staging-area-backfill-derived-data

blog-post-functional-data-engineering#conditional-logic-within-taskThe solution here is to apply conditional logic within the task with a certain effective date, and depending on the slice of data getting computed, the extra tax logic would apply where needed. blog-post-functional-data-engineering#conditional-logic-within-task

blog-post-functional-data-engineering#parameter-tables-and-effective-dates1 2Also note that in many cases business rules changes over time are best expressed with data as opposed to code. In those cases it's desirable to store this information in parameter table using effective dates, and have logic that joins and apply the right parameters for the facts being processed blog-post-functional-data-engineering#parameter-tables-and-effective-dates1 2

blog-post-functional-data-engineering#dimension-snapshotsThe dimension table becomes a collection of dimension snapshots where each partition contains the full dimension as-of a point in time. blog-post-functional-data-engineering#dimension-snapshots

Now that storage and compute are dirt cheap compared to engineering time, snapshoting dimensions make sense in most cases.

blog-post-functional-data-engineering#unit-of-work-single-partition1the unit of work should be directly aligned to output to a single partition. This makes it trivial to map each logical table to a task, and each partition to a task instance. blog-post-functional-data-engineering#unit-of-work-single-partition1

blog-post-functional-data-engineering#workflow-is-task-instances-and-partitionsWhile you may think of your workflow as a directed acyclic graph (DAG) of tasks, and of your data lineage as a graph made of tables as nodes, the functional approach allows you to conceptualize a more accurate picture made out of task instances and partitions blog-post-functional-data-engineering#workflow-is-task-instances-and-partitions

blog-post-functional-data-engineering#partitions-are-smallest-unitIn this more detailed graph, we move away from individual rows or cells being the "atomic state" that can be mutated to a place where partitions are the smallest unit that can be changed by tasks. blog-post-functional-data-engineering#partitions-are-smallest-unit

blog-post-functional-data-engineering#lineage-of-partitions-related-to-task-instances1The lineage of any given row can be mapped to a specific task instance through its partition, and by following the graph upstream it's possible to understand the full lineage as a set of partitions and related task instances. blog-post-functional-data-engineering#lineage-of-partitions-related-to-task-instances1

blog-post-functional-data-engineering#avoid-using-past-dependenciesGiven that backfills are common and that past dependencies lead to high-depth DAGs with limited parallelization, it's a good practice to avoid modeling using past-dependencies whenever possible. blog-post-functional-data-engineering#avoid-using-past-dependencies

blog-post-functional-data-engineering#late-arriving-facts-partitioned-by-processing-time1To bring clarity around this not-so-special case (of late arriving facts), the first thing to do is to dissociate the event's time from the event's reception or processing time. Where late arriving facts may exist and need to be handled, time partitioning should always be done on event processing time. This allows for landing immutable blocks of data without delays, in a predictable fashion. blog-post-functional-data-engineering#late-arriving-facts-partitioned-by-processing-time1

blog-post-functional-data-engineering#partition-pruningWhen defining your partitioning scheme based on event-processing time, it means that your data is not longer partitioned on event time, and means that your queries that will typically have predicates on event time won't be able to benefit from partition pruning (where the database only bothers to read a subset of the partitions). It's clearly an expensive tradeoff. blog-post-functional-data-engineering#partition-pruning

blog-post-functional-data-engineering#partition-multiplication1 2One option is to partition on both time dimensions, this should lead to a relatively low factor of partition multiplication, but raises the complexity of the model. blog-post-functional-data-engineering#partition-multiplication1 2

For example, if your engine is processing ORC or Parquet files, the execution will be limited to reading the file header before moving on to the next file.

Perhaps we care more about our aggregate landing early in the day than we care about accuracy. Given this preference, it's possible to join onto the latest partition available at the time the other dependencies are met.

Referring Pages

data-architecture-glossary talk-functional-data-engineering

People

person-maxime-beauchemin