Data pipelines are widely used to support business needs. As the business expands, both the volumn and diversity of its data increase. This calls for a system of data pipelines, each channelling data for a specific area of the business that may interact in various ways with other business areas. How to manage this system of data pipelines? If they share a common structure, is it worth it to maintain this structure in a separate framework package? How to manage the framework package together with the data pipelines that use it?
If a project depends on an external project, how to manage the dependency in sbt?
Recently, our data infrastructure team deployed a new version of Spark, called Spark Magnet. It is said to offer 30% to 50% improvement in performance, compared to the original Spark 3.0.
Spark Magnet is a patch to Spark 3 that improves shuffle efficiency:
Provided by LinkedIn’s data infrastructure team, it makes use of the Magnet shuffle service, which is a novel shuffle mechanism built on top of Spark’s native shuffle service. It improves shuffle efficiency by addressing several major bottlenecks with reasonable trade-offs.
Spark partitions are important for parallelism.
The concept of variance is used to describe relations among subtyping. Subtyping is a relation defined between two types S
and T
, such that if S
is a subtype of T
(S <: T
), then a thing of type S
can be used in any context where a thing of type T
is expected. E.g., if Cat
is a subtype of Animal
, a Cat
can replace an Animal
.
But how should the following be related?
Collection[Cat]
vs Collection[Animal]
get_cat: () -> Cat
vs get_animal: () -> Animal
print_cat: Cat -> ()
vs print_animal: Animal -> ()
This is where variance comes in. Variance refers to how subtyping of a more complex type relates to subtyping of its component type.