Introduction to Azure Synapse Dedicated SQL pools (formerly SQL DW)
by Johan Åhlén • Updated
Azure Synapse is a super-scalable analytics service for enterprise data warehousing and Big Data analytics. It gives you two options: Dedicated SQL pool or Serverless on-demand resources. This article will describe the Dedicated SQL pool.
On the surface it may look similar to an ordinary SQL Database, but inside the Azure Synapse Dedicated SQL pools is a tremendously powerful Massively Parallel Processing (MPP) engine. Originally there was the SQL Server Parallel Data Warehouse (PDW) applicance, which cost in the $1 million USD range (depending on number of compute nodes). So, with the Azure Synapse Dedicated SQL pools you can get the potential capacity of a multi-million dollar server. First when introduced it was called Azure SQL DW (or SQL Data Warehouse), and you still see that name in some places (for example if you type SELECT @@VERSION).
The figure describes the general architecture. Basically, the SQL queries (from users and applications) comes into the control node. The control node then distributes the query to the compute nodes and assembles the results. Data is transferred between nodes using the Data Movement Service (DMS).
Creating your first Azure Synapse Dedicated SQL Pool
You can easily create a Dedicated Pool like any other resource through the Azure Portal or through Azure Resource Manager. Just search for "sql pool", and it should appear in the list.
The performance depends on the number of DWU units you choose (or cDWU as they are called). The current minimum is 100 DWU, which means you will get one compute node. At 200 DWU you will get two compute nodes (which gives you about twice the performance of a single compute node), and so on. Dedicated SQL pools splits your data into a minimum of 60 distributions (databases), which will get evenly distributed among your compute nodes. The current maximum is 30000 DWUs.
The price is currently around $1.20/hour for 100 DWUs. For 200 DWUs, it will be twice as much, and so on.
You can pause your Dedicated SQL Pool anytime to suspend costs (although you still pay for the storage) and then resume it again when you need it. So you can use it to process a batch of data and then pause it after you have finished. If you still need the data to be readable, you can scale down to 100 DWU:s instead of pausing.
The options are similar to creating an ordinary Azure SQL Database. You need to select a server (or create a new one) and you need to select the performance level.
Note that you easily can change the performance level later. For your first testing, I recommend starting with a minimum of 200 DWU:s, so you get at least 2 compute nodes.
The next step is the Networking. It is probably easiest for you to test it if you choose a public endpoint. You can always change these settings later after the Dedicated SQL Pool has been created.
The most important setting is the SQL pool collation. This setting can't be changed later.
Finally done! Creating the SQL pool should not take more than 10 minutes.
You can connect through the Azure portal, but my preferred way to connect to the Azure Synapse Dedicated SQL Pool is to use Management Studio. You can connect to the SQL Pool just like any database.
To test the connection you can select your database and run this command:
It should return something like "Microsoft Azure SQL Data Warehouse - 10.0.15554.0 Dec 10 2020 03:11:10 Copyright (c) Microsoft Corporation".
Creating your first billion rows
CREATE TABLE dbo.Billion WITH ( DISTRIBUTION = ROUND_ROBIN ,CLUSTERED COLUMNSTORE INDEX ) AS WITH ten(a) AS ( SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 ) SELECT r = 1 FROM ten AS t1 CROSS JOIN ten AS t2 CROSS JOIN ten AS t3 CROSS JOIN ten AS t4 CROSS JOIN ten AS t5 CROSS JOIN ten AS t6 CROSS JOIN ten AS t7 CROSS JOIN ten AS t8 CROSS JOIN ten AS t9;
Why all the "UNION ALL"? The T-SQL language in Dedicated SQL Pools has some limitations compared to single SQL databases. Therefore you can't (at least currently) run recursive CTEs or use a Table Value Constructor.
Note the settings DISTRIBUTION = ROUND_ROBIN and CLUSTERED COLUMNSTORE INDEX. We'll talk about the later.
Check the distribution of your rows
To check the space used and distribution of your rows, run this command:
The first column shows the number of rows from your table. The RESERVED_SPACE tells you how many KBs are consumed. The PDW_NODE_ID tells you which node the data is stored on. Since we used ROUND_ROBIN distribution, the rows will be very equally distributed.
In a production environment, you probably wish to govern resources so they are distributed efficiently between users, ETL-tools, etc. You can govern the resources either from the Azure Portal or by T-SQL. For instance you can create a workload group for your ETL processes.
From the Workload Management page in the Azure Portal, you can also click "View Queries" to see current query activity. From there you can also view the query plans for your queries.
Distribution and table type
When you run the CREATE TABLE AS SELECT statement, you can choose distribution type and table type.
- ROUND_ROBIN (table is spread evenly among the distributions)
- HASH (table is distributed using a hash key)
- REPLICATE (the whole table is copied to all nodes, which means it will take more diskspace)
In most cases you should use HASH for fact tables and REPLICATE for dimension tables. The choice of HASH key is very important.
- CLUSTERED COLUMNSTORE INDEX
- CLUSTERED INDEX
The table types are the same as for ordinary SQL Server databases. HEAPS are usually the fastest for writes. CLUSTERED COLUMNSTORE INDEXES are usually fastest for reads.
Note that the MPP architecture doesn't allow PRIMARY KEYS or FOREIGN KEYS on your tables. Also IDENTITY columns should be avoided if possible.
Creating a fact table
Let's now create a fact table. We can use the previous table (dbo.Billion) as source to get a billing fact rows.
CREATE TABLE dbo.Fact_Sales WITH ( DISTRIBUTION = ROUND_ROBIN ,CLUSTERED COLUMNSTORE INDEX ) AS SELECT SalesID = a ,SalesDate = DATEADD(DAY, b, CAST('2017-01-01' AS DATE)) ,CustomerID = b / 2 ,ChannelID = b / 50 + 1 ,Amount = CAST(b AS BIGINT) FROM dbo.Billion CROSS APPLY ( SELECT a = NEWID() ) AS ca1 CROSS APPLY ( SELECT b = CAST(CAST(a AS BINARY(1)) AS INT) ) AS ca2;
Creating dimension tables
For simplicity, I will only use T-SQL to create our sample dimension tables.
CREATE TABLE dbo.Dim_Customer WITH ( DISTRIBUTION = REPLICATE ,CLUSTERED INDEX (CustomerID) ) AS WITH sixteen(i) AS ( SELECT 0 UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12 UNION ALL SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15 ), twohundredfiftysix(i) AS ( SELECT t1.i * 16 + t2.i FROM sixteen AS t1 CROSS JOIN sixteen AS t2 ) SELECT CustomerID = i ,CustomerName = 'Customer ' + CAST(i AS VARCHAR(3)) FROM twohundredfiftysix;
And the second dimension table:
CREATE TABLE dbo.Dim_Channel WITH ( DISTRIBUTION = REPLICATE ,HEAP ) AS SELECT ChannelID ,ChannelName FROM ( SELECT ChannelID = 1, ChannelName = 'Webshop' UNION ALL SELECT ChannelID = 2, ChannelName = 'Telemarketing' UNION ALL SELECT ChannelID = 3, ChannelName = 'Distributor' UNION ALL SELECT ChannelID = 4, ChannelName = 'Mailorder' UNION ALL SELECT ChannelID = 5, ChannelName = 'VIP store' UNION ALL SELECT ChannelID = 6, ChannelName = 'Educational' ) AS t; DBCC PDW_SHOWSPACEUSED('dbo.Dim_Channel');
The whole list of tables should now look like this. Note that the icon is different on DISTRIBUTED tables than on REPLICATED tables.
Querying your data
You can now try your first query.
SELECT COUNT(*) FROM dbo.Fact_Sales;
Let's also aggregate the Amount column.
SELECT [Amount] = SUM(Amount), [Count] = COUNT(*) FROM dbo.Fact_Sales;
Finally, let's do some joins and GROUP BY.
SELECT CustomerName, ChannelName, [Amount] = SUM(Amount), [Count] = COUNT(*) FROM dbo.Fact_Sales AS s JOIN dbo.Dim_Customer AS cu ON s.CustomerID = cu.CustomerID JOIN dbo.Dim_Channel AS ca ON s.ChannelID = ca.ChannelID GROUP BY cu.CustomerName, ca.ChannelName;
This last query will be a little slows the first time you run it because the replicated tables have to be copied to all nodes.
Using a BI client tool
Azure Synapse Dedicated SQL Pools support most BI tools, since it is mostly compatible with an ordinary SQL Server. It also supports loading the data into SQL Server Analysis Services (SSAS), which you could even run in Direct Query-mode to get real-time data and push the execution of query logic so it gets executed inside the SQL Pool.
With Power BI, it's very simple to connect. You just click Get Data, and then there is a connector in the Azure tab.
There are many Dynamic Management Views (DMVs) for monitoring the progress and status of your queries.
You can show all running queries with this command (it gives the same information you can find in the Azure Portal if you click View queries):
SELECT * FROM sys.dm_pdw_exec_requests;
Or you can get a list of your nodes by using this query:
SELECT * FROM sys.dm_pdw_nodes;
- Azure Synapse Analytics Documentation
- Data loading
- Guidance for distributed tables
- Dynamic Management Views (DMVs)