Welcome to largecats' blog

Caching in Spark

2020-10-17

work

spark YARN

Spark’s caching mechanism can be leveraged to optimize performance. Here are some facts and caveats about caching.

Read All
Solving Spark timeout errors

2020-10-09

work

spark

Timeout errors may occur while the Spark application is running or even after the Spark application has finished. Below are some common timeout errors and their solutions.

Read All
Collecting Log in Spark Cluster Mode

2020-09-21

work

spark YARN shell-scripting
Spark has 2 deploy modes, client mode and cluster mode. Cluster mode is ideal for batch ETL jobs submitted via the same “driver server” because the driver programs are run on the cluster instead of the driver server, thereby preventing the driver server from becoming the resource bottleneck. But in cluster mode, the driver server is only responsible for running a client process that submits the application, after which the driver program would be run on a different machine in the cluster. This poses the following challenges:
1. We can’t access the driver program’s log from the driver server (only the client process’ log is available to the driver server).
2. We can’t terminate the spark application via Ctrl-C or by marking success/killing tasks in the Airflow scheduler (doing so will only kill the client process running on the driver server, not the spark application itself).
We want to make use of cluster mode’s advantage and find workarounds to:
1. Access and store logs for recent as well as historical jobs conveniently;
2. View log conveniently in real time;
3. Kill applications via keyboard or Airflow.
Read All
Caching and Unpersisting Pyspark RDD

2019-12-06

work

pyspark

A cached RDD can only be unpersisted through a variable referencing it.

Read All
Two's Complement

2019-11-18

computer-systems

computer-systems

Definition. The two’s complement of an $n$-bit binary number $x$ is given by $\bar{x}+1$, where $\bar{x}$ is constructed from $x$ by inverting the $0$s and $1$s.

Definition. The two’s complement representation of a signed $n$-bit number $a_{n-1}\cdots a_1a_0$ (or simply “two’s complement number”) is given by
\[-2^{n-1} a_{n-1} + 2^{n-2}a_{n-2}+ \cdots 2^1 a_1 + 2^0a_0\]
where the first bit is the sign bit.

Theorem. Let $x_c$ denote the two’s complement of an $n$-bit binary number $x$. Then $x + x_c = 2^n$ in decimal.

Read All
(Py)Spark UDF Caveats

2019-10-29

life-saver work

pyspark spark

User defined function (udf) is a feature in (Py)Spark that allows user to define customized functions with column arguments. This post summarizes some pitfalls when using udfs.

Read All

2/5

Welcome to largecats' blog

Caching in Spark

Solving Spark timeout errors

Collecting Log in Spark Cluster Mode

Caching and Unpersisting Pyspark RDD

Two's Complement

(Py)Spark UDF Caveats