Get started with Spark for Data Engineering in Azure Synapse Analytics
The Apache Spark engine has become more and more prevalent in the data engineering space. And for good reason — it enables data engineering at scale in the cloud, is welcoming to data professionals of various backgrounds (Python, Scala, SQL, Java, R) and works great in unison with Delta Lake.
This article provides a list of resources for anyone looking to get started with Spark, in particular in the context of Azure Synapse Analytics. Spark is featured as one of Synapse’s main tools for data platform development hence the focus in this particular service.
With that said, the Databricks and Synapse implementations of Spark are in parity (for the most part) — so if you come across Databricks resources online, the content should apply for the most part to Synapse.
What is Spark?
Okay, but what is Spark? This engine can be used across data engineering, data analysis, and data science (machine learning), It’s quite a powerful tool that we, data professionals, can leverage across all data-related activities, with a special emphasis in the data engineering and science workloads.
With that said, most of the resources below will be geared towards data engineering, with a heavy focus on the delta lake architecture (more on that later).
I recommend watching the following video to get a better grasp of what is Spark and how you can get started in Synapse https://www.youtube.com/watch?v=SPOQzcbTgvQ
First steps with Spark in Azure Synapse Analytics
You now have an idea of what is Spark, great — but how you can develop with it? That’s the purpose of the next few links: hands-on exercises and reference documentation for your first couple hours with Spark!
- PySpark documentation: https://spark.apache.org/docs/latest/api/python/
- Basic dataframe operations: https://learn.microsoft.com/en-gb/azure/databricks/getting-started/dataframes-python
- Microsoft Spark utilities (super handy in any Synapse Spark notebook) :https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=programming-language-python
- Microsoft learning path for Spark under the Data Engineer Data Engineering on Microsoft Azure certification: https://learn.microsoft.com/en-us/training/paths/perform-data-engineering-with-azure-synapse-apache-spark-pools/
The Delta Lake Architecture
Now that you have a taste of working with Spark (especially PySpark and Spark SQL), you can move on to the Delta Lake architecture.
I obviously recommend reading through the official documentation to get a better understanding of Delta Lake, but a quick bullet point list of features is:
- Takes the best features of data warehouse and data lake to build the data lakehouse — big data is stored in data lake but with all the capabilities of a data warehouse
- Data is stored in the Delta format — a derivative of parquet that provides ACID transactions i.e. data validity as the data changes
- Provides a unified location for data, i.e., data does not leave the data lake throughout the ingestion process and is consumed directly from there
Guy in a Cube have a nice video describing lakehouses: https://www.youtube.com/watch?v=_fCKsorBnTg
And they also have a short hands-on series on how to build a lakehouse with Spark and Synapse
- Getting started with Spark for delta lake https://www.youtube.com/watch?v=3ef985a0Veg
- Bronze to Silver processing https://www.youtube.com/watch?v=TGIP4w61Tyc
- Silver to Gold processinghttps://www.youtube.com/watch?v=6aXO_zOMP1U