Welcome to largecats' blog

Stream Processing 101

2021-06-10

work

flink

This post is based on a sharing I did on general concepts in stream processing.

Read All
Spark Data Pipeline Framework Management

2021-03-30

work

spark python scala

Data pipelines are widely used to support business needs. As the business expands, both the volumn and diversity of its data increase. This calls for a system of data pipelines, each channelling data for a specific area of the business that may interact in various ways with other business areas. How to manage this system of data pipelines? If they share a common structure, is it worth it to maintain this structure in a separate framework package? How to manage the framework package together with the data pipelines that use it?

Read All
sbt Multi-Project Dependency

2021-03-30

work

spark sbt scala

If a project depends on an external project, how to manage the dependency in sbt?

Read All
Spark Magnet: Push-based Shuffle

2021-03-07

work

spark

Recently, our data infrastructure team deployed a new version of Spark, called Spark Magnet. It is said to offer 30% to 50% improvement in performance, compared to the original Spark 3.0.

Spark Magnet is a patch to Spark 3 that improves shuffle efficiency:

^{Spark Magnet's JIRA ticket.}

Provided by LinkedIn’s data infrastructure team, it makes use of the Magnet shuffle service, which is a novel shuffle mechanism built on top of Spark’s native shuffle service. It improves shuffle efficiency by addressing several major bottlenecks with reasonable trade-offs.

Read All
Spark Partitions

2021-01-02

work

spark

Spark partitions are important for parallelism.

Read All
Variances in Type Systems

2020-12-01

work

scala
The concept of variance is used to describe relations among subtyping. Subtyping is a relation defined between two types S and T, such that if S is a subtype of T (S <: T), then a thing of type S can be used in any context where a thing of type T is expected. E.g., if Cat is a subtype of Animal, a Cat can replace an Animal.

But how should the following be related?
- Collection[Cat] vs Collection[Animal]
- get_cat: () -> Cat vs get_animal: () -> Animal
- print_cat: Cat -> () vs print_animal: Animal -> ()
This is where variance comes in. Variance refers to how subtyping of a more complex type relates to subtyping of its component type.
Read All

1/5

Welcome to largecats' blog

Stream Processing 101

Spark Data Pipeline Framework Management

sbt Multi-Project Dependency

Spark Magnet: Push-based Shuffle

Spark Partitions

Variances in Type Systems