When you see the word cluster, your mind should navigate to the Apache Spark context. When running your data analytics on the Azure platform, you then need to determine if you are in the Azure Synapse Analytics or Azure Databricks context, both of which give you access to running data analytics on Apache Spark clusters. In Figure 9.14, you saw a pipeline run that included a notebook, written in Scala, that ran on an Apache Spark cluster in Azure Synapse Analytics. It is also obvious in Figure 9.22 that this activity took the longest of all the activities in the pipeline. When you hover over the Write to ADLS activity, you will see glasses like you saw for the Azure SQL activity in Figure 9.14. Clicking them will open the notebook, allowing you to look at the Scala code and see what might be taking so long. Instead of clicking the glasses linked to Write to ADLS, when you select the activity name itself, the Apache Spark applications log for the job opens, as shown in Figure 9.16. When you select the Spark History Server menu item, you are provided some Apache Spark cluster environment metrics similar, but not identical, to that shown in Figure 9.21. Lastly, Apache Spark pools that run in the Azure Synapse Analytics workspace have the standard monitoring features like alerts, metrics, diagnostics settings, and logs.
To find cluster performance data, the recommended option is to use Azure Monitor Log Analytics from the Azure portal. Log Analytics can query data logged from your Diagnostics Settings configuration that focuses on the workspace, clusters, jobs, notebook, and metrics. You can find cluster performance metrics in the Azure Databricks workspace, as shown in Figure 9.21. Job metrics are also available from the Workflows navigation menu. Select a specific job to drill down to the specific details, including time of execution, duration, status, and compute cluster consumption.
Measure Performance of Data Movement
Chapter 2 introduced the hash, round‐robin, and replicated table distributions (refer to Figure 2.10 and Figure 2.11). You also learned about the Data Movement Service (DMS), which helps you optimize the location of data across the nodes so that queries perform as optimally as possible (refer to Figure 2.9). A reason for data movement across nodes and/or tables has to do with skewing, which can result in excess load on one node. If skewing is detected, then configuring a more even distribution of the data across nodes would improve performance. Executing the DBCC command PDW_SHOWSPACEUSED will render the number of rows, reserved disk space, and disk space for all tables on a database or for a single table. For example, the following snippet would be helpful in gathering those insights from the FactREADING table:
DBCC PDW_SHOWSPACEUSED (‘[SQLPool].[brainwaves].[FactREADING]’);
Another DBCC command, PDW_SHOWEXECUTIONPLAN, will show the execution plan for a query on a specific dedicated SQL pool node. This can identify query performance details in general but is also helpful in identifying skewing. You can use that information, in combination with output from a DMV named sys.dm_pdw_dms_workers, to drill into data distributions and determine if one is working harder than another. Comparing the number of rows processed and the duration across the distributions on a node will identify the one or more that are suffering. For example, if one distribution is processing a factor of 10 more rows when compared to other distributions, some data needs to be redistributed. Also, if the same number of rows are queried but one distribution takes much longer than another, there needs to be a reconfiguration or redistribution of the data to resolve those pressure points.