Reading Time: 13 minutes

In this next post in this series, we will take the data that has been transformed into a machine-readable form in Azure Databricks and store it in Azure SQL Database to be used later in reporting.

If you missed it, here is a link to Part 3.

All code used in this post is available in my GitHub here

Creating a logical Azure SQL server and database

You can ignore this step if not relevant but otherwise read on. We’ll need to create a logical Azure SQL server first to host our database and then the database afterwards.

  1. Open a PowerShell Core terminal in either Windows Terminal or a PowerShell Core terminal as an administrator/or with elevated permissions
  2. Install the Az PowerShell module. If this is already installed, then check for updates. Press A and Enter to trust the repository as it downloads and installs the latest modules. This may take a few minutes
    ##Run if Az PowerShell module is not already installed
     Install-Module Az -AllowClobber 
    
    ##Run if Az PowerShell module is already installed but update check is required 
    Update-Module Az
  3. Import the Az PowerShell module into the current session
    Import-Module Az
  4. Authenticate to Azure as shown below. If the tenant or subscription is different to the default associated with your Azure AD identity, this can be specified in the same line of code with the -Tenant or -Subscription switches. To find out the Tenant GUID, open the Azure Portal ensure that you are signed into the correct tenant and go to Azure Active Directory. The Directory GUID should be displayed on the home blade. To find the Subscription GUID, switch to the correct tenant in the Azure Portal and type in Subscriptions in the search bar and press Enter. The GUIDs for all subscriptions that you have access to in that tenant will be displayed.
    Connect-AzAccount
  5. Let’s set some variables set that we’ll reuse throughout the process
    $resourceGroup = '<The same resource group that we have been using so far>'
    $serverName = $resourceGroup + '-sql'
    $databaseName = 'reportingDatabase'
    $location = '<The same location all the other resources have been deployed to e.g. uksouth>'
    ## currentUser is needed so we are able to login with our current user with Azure Active Directory authentication
    $currentUser = (Get-AzContext).Account
    $publicIpAddress = (Invoke-WebRequest -uri 'https://api.ipify.org/').Content
    
    ##We'll need to save the connection string of in our key vault for accessing the SQL database from Azure Data Factory
    $keyVault = '<The key vault we have been using>'
  6. Next, we’ll need to set the variable we’ll use for the administrator credentials
    ##You'll be prompted to enter a username and password that will be stored in this variable as a secure string
    
    $sqlAdminCredential = Get-Credential
  7. Now all the variables have been set, we can get started setting up the logical SQL server
    New-AzSqlServer -ServerName $serverName -SqlAdministratorCredentials $sqlAdminCredential -Location $location -ServerVersion '12.0' -PublicNetworkAccess 'enabled' -MinimalTlsVersion '1.2' -ExternalAdminName $currentUser -ResourceGroupName $resourceGroup
  8. The server has now we been created, now we need to create a means for connecting to it in SQL Server Management Studio (SSMS) by implementing a server-level firewall rule using the public ip address variable we set earlier. If you use the public endpoint rather than private, run the second block of code to allow Azure Services to connect to your SQL server
    New-AzSqlServerFirewallRule -FirewallRuleName 'AllowMyAccess' -StartIpAddress $publicIpAddress -EndIpAddress $publicIpAddress -ServerName $serverName -ResourceGroupName $resourceGroup
    
    ##Only run if using a public endpoint for your SQL server
    
    New-AzSqlServerFirewallRule -AllowAllAzureServices 'AllowMyAccess' -ServerName $serverName -ResourceGroupName $resourceGroup
  9. You should now be able to connect to your server in SSMS using your Azure Active Directory (AAD) credentials. The sqladmin credentials will not work in later steps as you will need to authenticate to AAD for setting up service principal access for ADF later
  10. Go back to your Windows Terminal and we’ll proceed with creating the database
    ##The bytesize shown is 2GB which at the time of writing is the smallest Azure SQL Database offering in the Basic series of the DTU pricing model
    New-AzSqlDatabase -DatabaseName $databaseName -ResourceGroupName $resourceGroup -ServerName $serverName -BackupStorageRedundancy 'Local' -Edition 'Basic' -LicenseType 'BasePrice' -MaxSizeBytes 2147483648
  11. If you go back to your SSMS object explorer and refresh, you should see newly created database. It should also show in the Azure Portal in your resource group

Assigning the correct permissions for the Data Factory managed identity on the SQL Server/Database

  1. Open a new query in the context of the master database and run below
    --1. Create SQL Server login for the service princpal for the Azure Data Factory instance in the context of the master database
    
    CREATE LOGIN [<Name of our data factory>] FROM EXTERNAL PROVIDER
  2. Open a new query in the context of the reportingDatabase database
    --2. Create database user for the login that was just created in the context of the reportingDatabase
    CREATE USER [<Name of our data factory>] FROM LOGIN [<Name of our data factory>]
    
    --3. Add database user to database reader and writer roles 
    ALTER ROLE [db_datareader] ADD MEMBER [<Name of our data factory>]
    ALTER ROLE [db_datawriter] ADD MEMBER [<Name of our data factory>]
    
    --4. Create database role to execute stored procedures that we'll need to merge in the data on an ongoing basis to avoid data duplication
    CREATE ROLE [db_spexector]
    GRANT EXECUTE TO [db_spexector]
    ALTER ROLE [db_spexector] ADD MEMBER [<Name of our data factory>]

Storing the connection string in Azure Key Vault

  1. Go back to your PowerShell session
  2. Run the below command to save the connection string in your key vault
    ##Save the connection string to our SQL database in Azure Key Vault
    $connectionString = ConvertTo-SecureString -String ('Data Source=' + $serverName + '.database.windows.net;Initial Catalog=' + $databaseName + ';') -AsPlainText
    
    Set-AzKeyVaultSecret -VaultName $keyVault -Name 'sql-connection-string' -SecretValue $connectionString

Adding the Azure SQL Database as a linked service in Azure Data Factory

  1. Open your previously created Data Factory
  2. Click on the Azure Data Factory manage icon, toolbox with spanner icon
  3. Click on the Linked services blade
  4. Click on + New
  5. Choose Azure SQL Database
  6. Give your linked service a name such as LS_Reporting_SQL
  7. Change the connection type to be Azure Key Vault
  8. Select the key vault linked service we created earlier
  9. Choose the secret we just created in PowerShell
  10. Ensure the Authentication type is set to be Sql Authentication or Managed Identity
  11. Click Test connection to ensure Data Factory can connect to your database. If not, make sure that the relevant firewall rules are in place on the SQL server to allow access from Data Factory
  12. Click Create

Populating the database with dimension tables and creating staging/fact tables

  1. In the case of our fact dataset, there are some dimension table we would want to import to help slice the data, DimDate and DimHTTPCode. Here’s the code that I will use to create and populate these tables
    --Run in the context of reportingDatabase
    --1. Create DimDate table
    
    CREATE TABLE [dbo].[DimDate] 
    (
        [Date_Value] DATE NOT NULL,
        [Day_Number_Month] TINYINT NOT NULL,
        [Day_Number_Week] TINYINT NOT NULL,
        [Week_Number_CY] TINYINT NOT NULL,
        [Month_Number_CY] TINYINT NOT NULL,
        [Quarter_Number_CY] TINYINT NOT NULL,
        [Day_Name_Long] NVARCHAR(9) NOT NULL,
        [Day_Name_Short] NCHAR(3) NOT NULL,
        [Month_Name_Long] NVARCHAR(9) NOT NULL,
        [Month_Name_Short] NCHAR(3) NOT NULL,
        [Quarter_Name_CY] NCHAR(2) NOT NULL,
        [Year_Quarter_Name_CY] NVARCHAR(7) NOT NULL,
        [IsWeekday] BIT NOT NULL,
        [IsWeekend] BIT NOT NULL,
    
    CONSTRAINT [PK_DimDate] PRIMARY KEY CLUSTERED 
    (
        [Date_Value] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
    ) ON [PRIMARY]
    GO
    
    --Populate DimDate
    
    DECLARE @StartDate AS DATE
    SET @StartDate = '2020-01-01'
    
    DECLARE @EndDate AS DATE
    SET @EndDate = '2025-12-31'
    
    DECLARE @LoopDate AS DATE
    SET @Loopdate = @StartDate
    
    --Sets Monday to be the first day of the week
    SET DATEFIRST 1
    
    WHILE @Loopdate < @EndDate
    BEGIN
        INSERT INTO [dbo].[DimDate]
        VALUES (
            -- [Date_Value]
            @LoopDate,
            --[Day_Number_Month]
            DATEPART(DAY, @LoopDate),
            -- [Day_Number_Week]
            DATEPART(WEEKDAY, @LoopDate),
            -- [Week_Number_CY]		
            DATENAME(WEEK, @LoopDate),
            -- [Month_Number_CY]
            MONTH(@LoopDate),
            --[Quarter_Number_CY]
            DATEPART(QUARTER, @LoopDate),
            -- [Day_Name_Long]
            DATENAME(WEEKDAY, @LoopDate),
            --[Day_Name_Short] 
            FORMAT(@LoopDate, 'ddd'),
            -- [Month_Name_Long]
            DATENAME(MONTH, @LoopDate),
            -- [Month_Name_Short]
            FORMAT(@LoopDate, 'MMM'),
            -- [Quarter_Name_CY]
            CONCAT ('Q',DATEPART(QUARTER, @LoopDate)),
            -- [Year_Quarter_Name_CY]
            CONCAT (YEAR(@LoopDate),'-Q',DATEPART(QUARTER, @LoopDate)),
            --[IsWeekday]
            CASE WHEN DATEPART(WEEKDAY, @LoopDate) IN (1,2,3,4,5) THEN 1 ELSE 0 END,
            -- [IsWeekend]
            CASE WHEN DATEPART(WEEKDAY, @LoopDate) IN (6,7) THEN 1 ELSE 0 END
            )
        SET @Loopdate = DATEADD(DAY, 1, @LoopDate)
    END
    
    --Run in the context of reportingDatabase
    --1.Create DimHttpCode table
    CREATE TABLE [dbo].[DimHTTPCode] (
        
        [Http_Code_Key] INT IDENTITY(1,1) NOT NULL,
        [Http_Error_Code] INT NOT NULL,
        [Http_Error_Code_Description] NVARCHAR(150) NOT NULL,
    
    CONSTRAINT [PK_DimHTTPCode] PRIMARY KEY CLUSTERED 
    (
        [Http_Code_Key] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
    ) ON [PRIMARY]
    
    --2.Populate DimHttpCode table
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('100','Continue')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('101','Switching Protocols')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('102','Processing')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('103','Early Hints')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('200','OK')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('201','Created')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('202','Accepted')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('203','Non-Authoritative Information')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('204','No Content')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('205','Reset Content')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('206','Partial Content')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('207','Multi-Status')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('208','Already Reported')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('226','IM Used')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('300','Multiple Choices')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('301','Moved Permanently')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('302','Found')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('303','See Other')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('304','Not Modified')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('305','Use Proxy')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('306','(Unused)')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('307','Temporary Redirect')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('308','Permanent Redirect')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('400','Bad Request')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('401','Unauthorized')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('402','Payment Required')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('403','Forbidden')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('404','Not Found')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('405','Method Not Allowed')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('406','Not Acceptable')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('407','Proxy Authentication Required')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('408','Request Timeout')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('409','Conflict')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('410','Gone')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('411','Length Required')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('412','Precondition Failed')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('413','Content Too Large')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('414','URI Too Long')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('415','Unsupported Media Type')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('416','Range Not Satisfiable')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('417','Expectation Failed')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('418','(Unused)')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('421','Misdirected Request')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('422','Unprocessable Content')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('423','Locked')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('424','Failed Dependency')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('425','Too Early')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('426','Upgrade Required')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('427','Unassigned')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('428','Precondition Required')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('429','Too Many Requests')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('430','Unassigned')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('431','Request Header Fields Too Large')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('451','Unavailable For Legal Reasons')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('500','Internal Server Error')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('501','Not Implemented')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('502','Bad Gateway')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('503','Service Unavailable')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('504','Gateway Timeout')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('505','HTTP Version Not Supported')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('506','Variant Also Negotiates')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('507','Insufficient Storage')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('508','Loop Detected')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('509','Unassigned')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('510','Not Extended (OBSOLETED)')
    INSERT INTO [dbo].[DimHTTPCode] ([Http_Error_Code], [Http_Error_Code_Description]) VALUES ('511','Network Authentication Required')

     

  2. Let’s create the staging and fact tables
    --Run in the context of reportingDatabase
    
    CREATE TABLE [dbo].[staging_factHubnetCloudWebsiteStats](
        [timeGenerated] DATETIME2 NULL,
        [userAction] NVARCHAR(500) NULL,
        [appUrl] NVARCHAR(500) NULL,
        [successFlag] BIT NULL,
        [httpResultCode] INT NULL,
        [durationOfRequestMs] FLOAT NULL,
        [clientType] NVARCHAR(50) NULL,
        [clientOS] NVARCHAR(50) NULL,
        [clientCity] NVARCHAR(50) NULL,
        [clientStateOrProvince] NVARCHAR(50) NULL,
        [clientCountryOrRegion] NVARCHAR(50) NULL,
        [clientBrowser] NVARCHAR(50) NULL,
        [appRoleName] NVARCHAR(50) NULL,
        [requestDate] DATE NULL,
        [requestHour] TINYINT NULL,
        [adfPipelineRunId] NVARCHAR(50) NOT NULL,
        [adfCopyTimestamp] DATETIME2 NOT NULL
    )
    
    ALTER TABLE [dbo].[staging_factHubnetCloudWebsiteStats] ADD DEFAULT GETUTCDATE() FOR [adfCopyTimestamp]
    
    --Run in the context of reportingDatabase
    
    CREATE TABLE [dbo].[FactHubnetCloudWebsiteStats](
        [Request_Timestamp_UTC] DATETIME2 NULL,
        [Request_Date_UTC] DATE NULL,
        [Request_Hour_UTC] TINYINT NULL,
        [Request_User_Action] NVARCHAR(500) NULL,
        [Request_App_URL] NVARCHAR(500) NULL,
        [Request_Success_Flag] BIT NULL,
        [Request_HTTP_Code] INT NULL,
        [Request_Duration_Milliseconds] FLOAT NULL,
        [Request_Client_Type] NVARCHAR(50) NULL,
        [Request_Client_OS] NVARCHAR(50) NULL,
        [Request_Client_Browser] NVARCHAR(50) NULL,
        [Request_Client_City] NVARCHAR(50) NULL,
        [Request_Client_State_Or_Province] NVARCHAR(50) NULL,
        [Request_Client_Country_Or_Region] NVARCHAR(50) NULL,
        [Request_App_Role_Name] NVARCHAR(50)
    )
    GO
    
    ALTER TABLE [dbo].[FactHubnetCloudWebsiteStats] ADD CONSTRAINT [FK_FactHubnetCloudWebsiteStats_DimDate] FOREIGN KEY([Request_Date_UTC])
    REFERENCES [dbo].[DimDate] ([Date_Value])
    GO
    
    ALTER TABLE [dbo].[FactHubnetCloudWebsiteStats] ADD CONSTRAINT [FK_FactHubnetCloudWebsiteStats_DimHTTPCode] FOREIGN KEY([Request_HTTP_Code])
    REFERENCES [dbo].[DimHTTPCode] ([Http_Code_Key])
    GO
  3. If we were storing the data in a Kimball data warehouse we would want to ensure that the schema conforms to a star schema. We have partially followed that by adding the date and HTTP code dimensions by implementing foreign key relationships. If we were going to completely follow the Kimball methodology we would want to create dimensions for other datapoints like the User Action Type, Client OS, Client Type, Client City/State/Country. This is so we can optimise our data storage as much as possible. For example, it takes much less space in the fact table to store integers for a foreign key rather than a string value. Conversely, SQL Server also performs much better when joining on integer values when compared to string. We also would not allow nullable values in the final fact tables

Creating the stored procedure to merge staging data into the fact table

In ADF, we simply could just have a copy data task that would transfer the cleaned data produced by Databricks straight into the fact table. However, there is a large risk of data duplication depending on how often your ADF pipelines are executed and which time period has been specified in your Log Analytics function. In this case, we are going to create a stored procedure that will select any newly copied records to the staging table in the current pipeline run. It will then join that dataset to relevant tables/columns (in this case just DimHTTPCode) to obtain the foreign/surrogate key value to be used in the fact table. We will also rename columns so they match our final table schema to make for cleaner code

--Run in the context of reportingDatabase

CREATE PROCEDURE [dbo].[sp_MergeWebsiteStatsToLive]
    @adfPipelineRunId NVARCHAR(36)
AS

--Selects all of our newly added records into a temp table and join to the correct records in the dimension tables where required. 
--In this case, it's only DimHTTPCode where we want to return a value from as we are pulling the primary/surrogate key column from that table instead
--of writing the HTTP Code value explicitly each time. We are not doing any joins to DimDate as it is a 1-1 join with the same value on each
--side of the relationship. We will also rename columns in the temp table so they match the schema of the final fact table

SELECT 
STG.[timeGenerated] AS [Request_Timestamp_UTC],
STG.[requestDate] AS [Request_Date_UTC],
STG.[requestHour] AS [Request_Hour_UTC],
STG.[userAction] AS [Request_User_Action],
STG.[appUrl] AS [Request_App_URL],
STG.[successFlag] AS [Request_Success_Flag],
DHC.[Http_Code_Key] AS [Request_HTTP_Code],
STG.[durationOfRequestMs] AS [Request_Duration_Milliseconds],
STG.[clientType] AS [Request_Client_Type],
STG.[clientOS] AS [Request_Client_OS],
STG.[clientBrowser] AS [Request_Client_Browser],
STG.[clientCity] AS [Request_Client_City],
STG.[clientStateOrProvince] AS [Request_Client_State_Or_Province],
STG.[clientCountryOrRegion] AS [Request_Client_Country_Or_Region],
STG.[appRoleName] AS [Request_App_Role_Name]
INTO #WebsiteStatsNewRecords
FROM [dbo].[staging_factHubnetCloudWebsiteStats] STG 
LEFT JOIN [dbo].[DimHTTPCode] DHC ON STG.[httpResultCode] = DHC.[Http_Error_Code] 
WHERE STG.[adfPipelineRunId] = @adfPipelineRunId

--Now we're going to merge in any records that do not already exist in the fact table so we avoid data duplication

MERGE INTO [dbo].[FactHubnetCloudWebsiteStats] DST
USING #WebsiteStatsNewRecords SRC
ON DST.[Request_Timestamp_UTC] = SRC.[Request_Timestamp_UTC]
AND DST.[Request_Date_UTC] = SRC.[Request_Date_UTC]
AND DST.[Request_Hour_UTC] = SRC.[Request_Hour_UTC]
AND DST.[Request_User_Action] = SRC.[Request_User_Action]
AND DST.[Request_App_URL] = SRC.[Request_App_URL]
AND DST.[Request_Success_Flag] = SRC.[Request_Success_Flag]
AND DST.[Request_HTTP_Code] = SRC.[Request_HTTP_Code]
AND DST.[Request_Duration_Milliseconds] = SRC.[Request_Duration_Milliseconds]
AND DST.[Request_Client_Type] = SRC.[Request_Client_Type]
AND DST.[Request_Client_OS] = SRC.[Request_Client_OS]
AND DST.[Request_Client_Browser] = SRC.[Request_Client_Browser]
AND DST.[Request_Client_City] = SRC.[Request_Client_City]
AND DST.[Request_Client_State_Or_Province] = SRC.[Request_Client_State_Or_Province]
AND DST.[Request_Client_Country_Or_Region] = SRC.[Request_Client_Country_Or_Region]
AND DST.[Request_App_Role_Name] = SRC.[Request_App_Role_Name]
WHEN NOT MATCHED BY TARGET THEN INSERT
(
[Request_Timestamp_UTC],
[Request_Date_UTC],
[Request_Hour_UTC],
[Request_User_Action],
[Request_App_URL],
[Request_Success_Flag],
[Request_HTTP_Code],
[Request_Duration_Milliseconds],
[Request_Client_Type],
[Request_Client_OS],
[Request_Client_Browser],
[Request_Client_City],
[Request_Client_State_Or_Province],
[Request_Client_Country_Or_Region],
[Request_App_Role_Name]
)
VALUES
(
SRC.[Request_Timestamp_UTC],
SRC.[Request_Date_UTC],
SRC.[Request_Hour_UTC],
SRC.[Request_User_Action],
SRC.[Request_App_URL],
SRC.[Request_Success_Flag],
SRC.[Request_HTTP_Code],
SRC.[Request_Duration_Milliseconds],
SRC.[Request_Client_Type],
SRC.[Request_Client_OS],
SRC.[Request_Client_Browser],
SRC.[Request_Client_City],
SRC.[Request_Client_State_Or_Province],
SRC.[Request_Client_Country_Or_Region],
SRC.[Request_App_Role_Name]
);

--Some logic to control the size of your staging table by only keeping limited history, in this case, 1 week

DELETE FROM [dbo].[staging_factHubnetCloudWebsiteStats]
WHERE [adfCopyTimestamp] < DATEADD(WEEK,-1,GETUTCDATE())

Implementing the final copy and merge logic into your ADF pipeline

We’re almost ready to implement the logic into our ADF pipeline, this is the final stretch if you will, of this part of the process. First, we need to create some new datasets

  1. Open up the authoring mode in ADF Studio
  2. Go to Datasets
  3. Click the plus to add new
  4. Select Azure Data Lake Storage Gen 2 as the data store and click Continue
  5. Set the format to be Parquet (We are using Parquet as it is much more efficient than JSON/CSV at compressing datasets)
  6. Name your dataset DS_Datalake_Parquet
  7. Click OK
  8. Now we’re going to parametrise the file path so it can be used by multiple pipelines
  9. Go to the Parameters tab
  10. Create a new string parameter called container
  11. Create another string parameter called folderPath
  12. Create another string parameter called fileName
  13. Go back to the Connection tab and assign the container parameter to the File system field with the following dynamic content code
    @dataset().container
  14. Assign the folderPath parameter to the Directory field using the following dynamic content code
    @dataset().folderPath
  15. Assign the fileName parameter to the File field using the following dynamic content code
    @dataset().fileName
  16. Create another parquet dataset, same instructions as before called DS_Datalake_Parquet_Wildcard
  17. Go to the Parameters tab
  18. Create a new string parameter called container
  19. Create another string parameter called folderPath
  20. Go back to the Connection tab and assign the container parameter to the File system field with the following dynamic content code
    @dataset().container
  21. Assign the folderPath parameter to the Directory field using the following dynamic content code
    @dataset().folderPath
  22. Create another dataset
  23. Select Azure SQL Database as the data store click Continue
  24. Set the name to be DS_Azure_SQL_Staging_Website_Stats
  25. Set the table name to be your staging table that you created earlier
  26. Open up your pipeline in ADF
  27. We need to set a new pipeline variable called outputFolderPath
    • We can set this in the Variables tab at the bottom of the authoring window as before
    • Set the type to be String
  28. Add a new Set variable activity and have it run after successful execution of your Databricks notebook activity. Name it Set output folder path in the General tab
  29. The value we want to give will be an expression, populate using the Add Dynamic Content link that appears when you click into the value text field
    • @activity('Transform Source Data').output.runOutput
  30. Afterwards, we need to get the list of files in the folder specified in the variable that we just declared the value for. To get this, drag a Get Metadata activity into your pipeline and have it as a successor to Set output folder path
  31. Call this activity Get files in output folder in the General tab
  32. In the Settings tab, set the Dataset to be DS_Datalke_Parquet_Wildcard
  33. For the container parameter we will pass in the text value ‘loganalytics’ as this is the name of our container in the datalake
  34. For the folderPath parameter we will pass in the following dynamic content value
    @variables('outputFolderPath')
  35. Under Field list, click New
  36. Choose Child items
  37. Next we need to add a Filter activity to our pipeline as a successor to Get files in output folder so we only show parquet files. This should be under the Iteration & conditionals header
  38. Name this activity Filter for parquet files
  39. In the Items field, assign the following dynamic content value
    @activity('Get files in output folder').Output.childItems
  40. In the Condition field, assign the following dynamic content value
    @endswith(item().name,'snappy.parquet')
  41. Sometimes the datasets produced by Databricks are so large that multiple parquet files are produced. To account for this, we need to add a ForEach activity from under the Iteration & conditionals header rather than a copy data activity. This should be a successor to the Filter for parquet files activity
  42. Name this ForEach activity Copy to Staging Table in the General tab
  43. Go the Settings tab and tick the Sequential check box
  44. In the Items field enter the following dynamic content
    @activity('Filter for parquet files').output.Value
  45. Go to the Activities tab and click on the pencil icon
  46. Drag a Copy data activity from the Move & transform header into the canvas
  47. Name this activity Copy to Azure SQL in the General tab
  48. In the Source tab, set the Source dataset to be DS_Datalake_Parquet
  49. For the container parameter set this to be the static string ‘loganalytics’
  50. For the folderPath parameter set this to have the following dynamic content
    @variables('outputFolderPath')
  51. For the fileName parameter set this to have the following dynamic content
    @item().name
  52. Go to the Sink tab and set this to be DS_Azure_SQL_Staging_Website_Stats
  53. Go the Mapping tab
  54. Click on Import schemas (This is assuming you ran a Debug on the pipeline after implementing Databricks. If not, please debug your pipeline by placing a break point on the Databricks activity and then debug the pipeline. Remove once completed)
  55. Open Azure Storage Explorer, either in your browser or the desktop application
  56. Locate the parquet file from the output of the Databricks run and copy its path
  57. Paste the folder path into the @variables(‘outputFolderPath’) field ensuring that you leave a trailing /
  58. Paste the filename into the @item().name field and click OK
  59. All the columns should be mapped apart from the adfPipelineRunId and adfCopyTimestamp columns
  60. Ensure the mapping for adfPipelineRunId is set to the pipelineRunId column
  61. Remove the mapping for adfCopyTimestamp as that column has a default value already
  62. Go back to your main pipeline
  63. Add a Stored procedure activity from the General header as a successor to the Copy to Staging Table ForEach loop
  64. Name this activity Merge staging to fact table in the General tab
  65. In the Settings tab, set the Linked Service field to be DS_Azure_SQL
  66. Set the stored procedure to be the one we created earlier, in this case sp_MergeWebsiteStatsToLive
  67. Add a new stored procedure parameter called adfPipelineRunId or whatever you specified in your stored procedure creation of the type String
  68. Add the following dynamic content as the value
    @pipeline().RunId
  69. Click Publish at the top of the authoring window for your changes to save.
  70. Click on Debug

 

Summary

Screenshot of complete ADF pipeline with all steps succeeding

There we have it, we now have copied our data into the Azure SQL database using ADF ready for reporting.

With that, that concludes this series on extracting data from Log Analytics