Member-only story

Pyspark on kubernetes

TarrantRo
8 min readDec 11, 2020

--

Introduction

Spark is a fast and general-purpose cluster computing system which means by definition compute is shared across a number of interconnected nodes in a distributed fashion.

We are going to deploy spark on AKS in client mode because pyspark seems to only support client mode.

First, we will have a quick introduce about client mode and cluster mode.

source: https://blog.knoldus.com/cluster-vs-client-execution-modes-for-a-spark-application/

In the cluster mode, the Spark driver or spark application master will get started in any of the worker machines. The Spark driver will run the main function of application and create a so called SparkContext. SparkContext will connect to the Spark cluster manager for task schedule, resources apply and monitor. And it will also terminate the driver once the executor finishs its jobs.

In the client mode, the client who is submitting the spark application will start the driver and it will maintain the spark context. So, till the particular job execution gets over, the management of the task will be done by the driver. Also, the client should be in touch with the cluster. The client will have to be online until that particular job gets completed.

Because the pyspark only supports client mode(or as far as I know). This document will only apply the client mode.

Infrastructure

--

--

TarrantRo
TarrantRo

Written by TarrantRo

IT guy who love movies, Japanese manga. Have some experiences in Linux system, container/k8s, devops, cloud, etc.

Responses (2)