Get started with Spark for Data Engineering in Azure Synapse Analytics

3 min readJan 2, 2023

The Apache Spark engine has become more and more prevalent in the data engineering space. And for good reason — it enables data engineering at scale in the cloud, is welcoming to data professionals of various backgrounds (Python, Scala, SQL, Java, R) and works great in unison with Delta Lake.

This article provides a list of resources for anyone looking to get started with Spark, in particular in the context of Azure Synapse Analytics. Spark is featured as one of Synapse’s main tools for data platform development hence the focus in this particular service.

With that said, the Databricks and Synapse implementations of Spark are in parity (for the most part) — so if you come across Databricks resources online, the content should apply for the most part to Synapse.

What is Spark?

Okay, but what is Spark? This engine can be used across data engineering, data analysis, and data science (machine learning), It’s quite a powerful tool that we, data professionals, can leverage across all data-related activities, with a special emphasis in the data engineering and science workloads.

With that said, most of the resources below will be geared towards data engineering, with a heavy focus on the delta lake architecture (more on that later).

I recommend watching the following video to get a better grasp of what is Spark and how you can get started in Synapse https://www.youtube.com/watch?v=SPOQzcbTgvQ

First steps with Spark in Azure Synapse Analytics

You now have an idea of what is Spark, great — but how you can develop with it? That’s the purpose of the next few links: hands-on exercises and reference documentation for your first couple hours with Spark!

PySpark documentation: https://spark.apache.org/docs/latest/api/python/
Basic dataframe operations: https://learn.microsoft.com/en-gb/azure/databricks/getting-started/dataframes-python
Microsoft Spark utilities (super handy in any Synapse Spark notebook) :https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python
Microsoft learning path for Spark under the Data Engineer Data Engineering on Microsoft Azure certification: https://learn.microsoft.com/en-us/training/paths/perform-data-engineering-with-azure-synapse-apache-spark-pools/

The Delta Lake Architecture

Now that you have a taste of working with Spark (especially PySpark and Spark SQL), you can move on to the Delta Lake architecture.

I obviously recommend reading through the official documentation to get a better understanding of Delta Lake, but a quick bullet point list of features is:

Takes the best features of data warehouse and data lake to build the data lakehouse — big data is stored in data lake but with all the capabilities of a data warehouse
Data is stored in the Delta format — a derivative of parquet that provides ACID transactions i.e. data validity as the data changes
Provides a unified location for data, i.e., data does not leave the data lake throughout the ingestion process and is consumed directly from there

Guy in a Cube have a nice video describing lakehouses: https://www.youtube.com/watch?v=_fCKsorBnTg

And they also have a short hands-on series on how to build a lakehouse with Spark and Synapse

Getting started with Spark for delta lake https://www.youtube.com/watch?v=3ef985a0Veg
Bronze to Silver processing https://www.youtube.com/watch?v=TGIP4w61Tyc
Silver to Gold processinghttps://www.youtube.com/watch?v=6aXO_zOMP1U

Get started with Spark for Data Engineering in Azure Synapse Analytics

What is Spark?

First steps with Spark in Azure Synapse Analytics

The Delta Lake Architecture

Written by José Fernando Costa

No responses yet