Data SDK for Python with Spark is a tool for Data Scientists who want to use Spark for large scale analysis of the platform data or who know they want to implement their solution in a pipeline and would prefer to use the same language/framework for analysis to simplify deployment to production.
The SDK uses the Sparkmagic extension for Jupyter in order to run spark jobs on Spark, either in local mode or in cluster mode:
You can find the steps to install and configure the Data SDK for Python with Spark here.
Data SDK for Python with Local Spark works only for Linux/MacOS. EMR Spark Cluster option is available for all platforms. To configure the SDK, do the following: