Spark’s caching mechanism can be leveraged to optimize performance. Here are some facts and caveats about caching.
Timeout errors may occur while the Spark application is running or even after the Spark application has finished. Below are some common timeout errors and their solutions.
Spark has 2 deploy modes, client mode and cluster mode. Cluster mode is ideal for batch ETL jobs submitted via the same “driver server” because the driver programs are run on the cluster instead of the driver server, thereby preventing the driver server from becoming the resource bottleneck. But in cluster mode, the driver server is only responsible for running a client process that submits the application, after which the driver program would be run on a different machine in the cluster. This poses the following challenges:
We want to make use of cluster mode’s advantage and find workarounds to:
A cached RDD can only be unpersisted through a variable referencing it.
Definition. The two’s complement of an $n$-bit binary number $x$ is given by $\bar{x}+1$, where $\bar{x}$ is constructed from $x$ by inverting the $0$s and $1$s.
Definition. The two’s complement representation of a signed $n$-bit number $a_{n-1}\cdots a_1a_0$ (or simply “two’s complement number”) is given by
\[-2^{n-1} a_{n-1} + 2^{n-2}a_{n-2}+ \cdots 2^1 a_1 + 2^0a_0\]where the first bit is the sign bit.
Theorem. Let $x_c$ denote the two’s complement of an $n$-bit binary number $x$. Then $x + x_c = 2^n$ in decimal.
User defined function (udf) is a feature in (Py)Spark that allows user to define customized functions with column arguments. This post summarizes some pitfalls when using udfs.