Reading Time: 9 minutes

You’ve heard of AdventureWorks right? The sample SQL Server database Microsoft have published as well as make it available as the sample database for use with Azure SQL Database. A few months ago I had the idea to make my own sample dataset that I could showcase for further projects I have planned. It’s taken a little longer than planned to complete due to life getting in the way but it’s finally ready.

This blog post will cover the creation process at a high level, for more technical information and source code, please check out my GitHub repository

First, let’s introduce the case study for this dataset, WingIt Airlines. WingIt Airlines is a fictitious medium-sized long-haul airline operating flights between its two bases in San Diego, California and London Heathrow, United Kingdom. Their main corporate offices are in Los Angeles, California. There is a requirement to produce a reporting database that can give both internal and external business stakeholders more insights into how the airline is performing. For the case of simplicity, the schema of the database will look something like the below:

Schema for the WingItAirlines-Reporting Database

Schema created using Microsoft Visio

Using Visual Studio SQL Database Projects, I replicated the above schema into a DACPAC solution, of which you can download from the repository link at the top of this page. Having the schema is all well and good but to make it even more realistic, it is important to enforce constraints such as foreign keys, check constraints, default values etc. These are all replicated in the DACPAC. I have also written scalar functions to perform activities such as calculating how much commission the agency will take from each booking, checking to see if there is sufficient capacity before the transaction is committed to the database. Making this was generally pretty straight forward and picked up knowledge on using scalar functions and different ways of calculating values in columns that I didn’t know before. In the airports table, I’ve also made use of the GEOGRAPHY data type to store the exact location of each airport calculated using longitude and latitude to calculate distances of flights. You can check out the source code for this here.

The schema has been created, we now need to populate it. We could have some T-SQL scripts that are executed manually in a tool such as SQL Server Management Studio (SSMS) but one of the challenges I set myself for this project to make the dataset generation process as automated as possible.

How did I do this? I still created the same T-SQL scripts as before but now they are executed using PowerShell with the SqlServer module installed. You may also need to install the NuGet package provider in addition if it is not already available.

You can get this from the PowerShell gallery

Install-Module SqlServer

You will also need to create a SQLAuth user with the correct permissions on the database; the user should be a member of db_spexecutor, db_datareader and db_datawriter. The scripts also assume that your SQL instance is the default instance i.e. MSSQLSERVER so if you’re using a named instance then will need to alter the value being passed into the -ServerInstance parameter accordingly.

First we run the lookup data population script with

Set-Location '<path of cloned repository>\sample-datasets\WingItAirlines-Reporting\Data Population Scripts'
.\PopulateLookupData.ps1

This will prompt you for username and password for your SQL Authentication user and then the script will populate the following tables. You can change the date range of the FlightSchedule script to create a smaller or larger dataset.

  • Agency
  • AgencyUser
  • AgencyCommission
  • Airplane
  • Airport
  • PassengerFareRate
  • TicketStatus
  • TicketType
  • Route
  • FlightSchedule

That should take a couple of minutes to complete

Next, we are going to populate the fact data with

.\CreateBulkBookings.ps1

This will prompt you for username and password for your SQL Authentication user and then the script will populate the TicketSale table

Let’s take a look at what the script does:

$databaseName = 'WingItAirlines-Reporting'

# This is generally dependent on how much compute resource your host machine has
$maxConcurrentJobs = 2
$sqlServerCredential = Get-Credential

# Get the id of the agency user that sells business class tickets
$businessClassAgencyUser = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT TOP 1`
AU.[Agency_User_Id] `
FROM [dbo].[AgencyUser] AU  `
INNER JOIN [dbo].[Agency] A ON AU.[Agency_Id] = A.[Agency_Id] `
WHERE A.[Agency_Name] = 'Christopher Columbus' "`
-ServerInstance $env:COMPUTERNAME 

$businessClassAgencyUser = $businessClassAgencyUser.Agency_User_Id

# Get the id of the ticket type that is business class
$businessClassTicketType = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT `
[Ticket_Type_Id] `
FROM [dbo].[TicketType]  `
WHERE [Ticket_Type] = 'Business Class' "`
-ServerInstance $env:COMPUTERNAME 

$businessClassTicketType = $businessClassTicketType.Ticket_Type_Id

# Get the id of the agency user that sells economy class tickets
$economyClassAgencyUser = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT TOP 1`
AU.[Agency_User_Id] `
FROM [dbo].[AgencyUser] AU  `
INNER JOIN [dbo].[Agency] A ON AU.[Agency_Id] = A.[Agency_Id] `
WHERE A.[Agency_Name] = 'Sunchasers Ltd' "`
-ServerInstance $env:COMPUTERNAME 

$economyClassAgencyUser = $economyClassAgencyUser.Agency_User_Id

# Get the id of the ticket type that is economy class
$economyClassTicketType = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT `
[Ticket_Type_Id] `
FROM [dbo].[TicketType]  `
WHERE [Ticket_Type] = 'Economy Class' "`
-ServerInstance $env:COMPUTERNAME 

$economyClassTicketType = $economyClassTicketType.Ticket_Type_Id

# Get all flights that don't have any bookings in TicketSales. This was an added feature to continue where you left off so you don't have to delete everything and start from the beginning
$flightSchedule = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-OutputAs DataTables `
-Query "SELECT `
[Flight_Schedule_Id] `
FROM [dbo].[FlightSchedule] `
WHERE NOT EXISTS ( `
SELECT `
[Flight_Schedule_Id] `
FROM [dbo].[TicketSale])" `
-ServerInstance $env:COMPUTERNAME

$flightSchedule = $flightSchedule.Flight_Schedule_Id

# Get the id of the agency user that sells premium economy class tickets
$premiumEconomyClassAgencyUser = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT TOP 1`
AU.[Agency_User_Id] `
FROM [dbo].[AgencyUser] AU  `
INNER JOIN [dbo].[Agency] A ON AU.[Agency_Id] = A.[Agency_Id] `
WHERE A.[Agency_Name] = 'Suntours Vacaction LLC' "`
-ServerInstance $env:COMPUTERNAME 

$premiumEconomyClassAgencyUser = $premiumEconomyClassAgencyUser.Agency_User_Id

# Get the id of the ticket type that is premium economy class
$premiumEconomyClassTicketType = `
Invoke-Sqlcmd `
-Credential $sqlServerCredential `
-Database $databaseName `
-Query "SELECT `
[Ticket_Type_Id] `
FROM [dbo].[TicketType]  `
WHERE [Ticket_Type] = 'Premium Economy Class' "`
-ServerInstance $env:COMPUTERNAME 

$premiumEconomyClassTicketType = $premiumEconomyClassTicketType.Ticket_Type_Id


# For each flight in the flightschedule list where bookings do not already exist
foreach($flight in $flightSchedule)
{
    # Check to see if the are any background jobs running
    $running = @(Get-Job -State Running)
    {
        # Check to see if the maximum background job count threshold has been met, if it has, wait for a slot to be freed
        if($running.Count -ge $maxConcurrentJobs)
        {
            $null = $running | Wait-Job Any
        }
            # Start a background job to create the bookings for the flight using values obtained from previous queries
            Start-Job -Name CreateBulkBookings -ArgumentList $businessClassAgencyUser, $businessClassTicketType, $databaseName, $economyClassAgencyUser, $economyClassTicketType, $flight, $premiumEconomyClassAgencyUser, $premiumEconomyClassTicketType, $sqlServerCredential -ScriptBlock {
                param
                (
                    [int]$businessClassAgencyUser,
                    [int]$businessClassTicketType,
                    [string]$databaseName,
                    [int]$economyClassAgencyUser,
                    [int]$economyClassTicketType,
                    [int]$flight,
                    [int]$premiumEconomyClassAgencyUser,
                    [int]$premiumEconomyClassTicketType,
                    [pscredential]$sqlServerCredential
                )
                
                # Define a function that can get random dates between two given dates
                function Get-RandomDateBetween{
                    [Cmdletbinding()]
                    param(
                        [parameter(Mandatory=$True)][DateTime]$StartDate,
                        [parameter(Mandatory=$True)][DateTime]$EndDate
                        )
                
                    process{
                       return Get-Random -Minimum $StartDate.Ticks -Maximum $EndDate.Ticks | Get-Date -Format "dd/MM/yyyy HH:mm:ss"
                    }
                }
                
                # Define a function that can get random times between two given dates
                function Get-RandomTimeBetween{
                       [Cmdletbinding()]
                      param(
                          [parameter(Mandatory=$True)][string]$StartTime,
                          [parameter(Mandatory=$True)][string]$EndTime
                          )
                      begin{
                          $minuteTimeArray = @("00","15","30","45")
                      }    
                      process{
                          $rangeHours = @($StartTime.Split(":")[0],$EndTime.Split(":")[0])
                          $hourTime = Get-Random -Minimum $rangeHours[0] -Maximum $rangeHours[1]
                          $minuteTime = "00"
                          if($hourTime -ne $rangeHours[0] -and $hourTime -ne $rangeHours[1]){
                              $minuteTime = Get-Random $minuteTimeArray
                              return "${hourTime}:${minuteTime}"
                          }
                          elseif ($hourTime -eq $rangeHours[0]) { # hour is the same as the start time so we ensure the minute time is higher
                              $minuteTime = $minuteTimeArray | ?{ [int]$_ -ge [int]$StartTime.Split(":")[1] } | Get-Random # Pick the next quarter
                              #If there is no quarter available (eg 09:50) we jump to the next hour (10:00)
                              return (.{If(-not $minuteTime){ "${[int]hourTime+1}:00" }else{ "${hourTime}:${minuteTime}" }})               
                           
                          }else { # hour is the same as the end time
                              #By sorting the array, 00 will be pick if no close hour quarter is found
                              $minuteTime = $minuteTimeArray | Sort-Object -Descending | ?{ [int]$_ -le [int]$EndTime.Split(":")[1] } | Get-Random
                              return "${hourTime}:${minuteTime}"
                          }
                      }
                  }
                
                # Get the aircraft allocated to the flight, this is needed to work out whether the booking can be completed by checking the capacity of already sold tickets
                $airplaneFlight = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                [Airplane_Id] `
                FROM [dbo].[FlightSchedule] `
                WHERE [Flight_Schedule_Id] = $flight" `
                -ServerInstance $env:COMPUTERNAME
            
                $airplaneFlight = $airplaneFlight.Airplane_Id
            
                # Get the datetime of when the flight is scheduled to depart
                $dateTimeFlight = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                [Scheduled_Date_Time_Of_Departure_UTC] `
                FROM [dbo].[FlightSchedule] `
                WHERE [Flight_Schedule_Id] = $flight" `
                -ServerInstance $env:COMPUTERNAME
            
                $dateTimeFlight = $dateTimeFlight.Scheduled_Date_Time_Of_Departure_UTC

                # Get the route of the flight            
                $routeToBook = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                [Route_Id] `
                FROM [dbo].[FlightSchedule] `
                WHERE [Flight_Schedule_Id] = $flight" `
                -ServerInstance $env:COMPUTERNAME
            
                $routeToBook = $routeToBook.Route_Id
            
                # Ensure that at least 65% of seats in business class on the flight are sold
                $businessClassBookingsCapacityUtilisation = Get-Random -Minimum 0.65 -Maximum 1
                $businessClassBookingsCapacityUtilisationRounded = [Math]::Round($businessClassBookingsCapacityUtilisation, 2)
            
                # Work out how many business class bookings to create based on the previous value
                $businessClassBookingsToCreate = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                CAST(ROUND([Business_Class_Seat_Count] * $businessClassBookingsCapacityUtilisationRounded, 0) AS INT) AS [Bookings_To_Create] `
                FROM [dbo].[Airplane] `
                WHERE [Airplane_Id] = $airplaneFlight" `
                -ServerInstance $env:COMPUTERNAME
            
                $businessClassBookingsToCreate = $businessClassBookingsToCreate.Bookings_To_Create
                
                # Ensure that at least 65% of seats in economy class on the flight are sold
                $economyClassBookingsCapacityUtilisation = Get-Random -Minimum 0.65 -Maximum 1
                $economyClassBookingsCapacityUtilisationRounded = [Math]::Round($economyClassBookingsCapacityUtilisation, 2)
            
                # Work out how many economy class bookings to create based on the previous value
                $economyClassBookingsToCreate = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                CAST(ROUND([Economy_Class_Seat_Count] * $economyClassBookingsCapacityUtilisationRounded, 0) AS INT) AS [Bookings_To_Create] `
                FROM [dbo].[Airplane] `
                WHERE [Airplane_Id] = $airplaneFlight" `
                -ServerInstance $env:COMPUTERNAME
            
                $economyClassBookingsToCreate = $economyClassBookingsToCreate.Bookings_To_Create
            
                # Ensure that at least 65% of seats in premium economy class on the flight are sold
                $premiumEconomyClassBookingsCapacityUtilisation = Get-Random -Minimum 0.65 -Maximum 1
                $premiumEconomyClassBookingsCapacityUtilisationRounded = [Math]::Round($premiumEconomyClassBookingsCapacityUtilisation, 2)
            
                # Work out how many premium economy class bookings to create based on the previous value
                $premiumEconomyClassBookingsToCreate = `
                Invoke-SqlCmd `
                -Credential $sqlServerCredential `
                -Database $databaseName `
                -Query "SELECT `
                CAST(ROUND([Premium_Economy_Class_Seat_Count] * $premiumEconomyClassBookingsCapacityUtilisationRounded, 0) AS INT) AS [Bookings_To_Create] `
                FROM [dbo].[Airplane] `
                WHERE [Airplane_Id] = $airplaneFlight" `
                -ServerInstance $env:COMPUTERNAME
            
                $premiumEconomyClassBookingsToCreate = $premiumEconomyClassBookingsToCreate.Bookings_To_Create
            
                # Create business class bookings   
                while($businessClassBookingsToCreate -gt 0)
                {
                    # A booking can be created up to 272 days before the flight and that last possible booking is 1 day before the flight using the previously defined Get-RandomDateBetween function
                    $dateOfBooking = Get-RandomDateBetween -StartDate $dateTimeFlight.AddDays(-272) -EndDate $dateTimeFlight.AddDays(-1)
                    $dateOfBooking = [Datetime]::ParseExact($dateOfBooking, 'dd/MM/yyyy HH:mm:ss', $null)

                    # A booking can only be created in business hours, I know this is not realistic but I didn't want the data to be too random
                    $timeOfBooking = Get-RandomTimeBetween -StartTime "08:00" -EndTime "18:00"
                    $timeOfBooking = [System.Timespan]::Parse($timeOfBooking)
                    
                    [datetime]$dateTimeTicketSale = $dateOfBooking.Add($timeOfBooking)
        
                    Invoke-SqlCmd `
                    -Credential $sqlServerCredential `
                    -Database $databaseName `
                    -Query " `
                    DECLARE @ticketSaleDateTime DATETIME2 `
                    SET @ticketSaleDateTime = (SELECT CAST('$dateTimeTicketSale' AS DATETIME2)) `
                    DECLARE @travelDateTime DATETIME2 `
                    SET @travelDateTime = (SELECT CAST('$dateTimeFlight' AS DATETIME2)) `
                    `
                    EXEC [dbo].[CreateBooking] @agencyUserId = $businessClassAgencyUser, `
                    @routeId = $routeToBook, `
                    @ticketSaleDateTime = @ticketSaleDateTime, `
                    @ticketTypeId = $businessClassTicketType, `
                    @travelDateTime = @travelDateTime" `
                    -ServerInstance $env:COMPUTERNAME
        
                    $businessClassBookingsToCreate = $businessClassBookingsToCreate - 1
                }
        
                # Create economy class bookings   
                while($economyClassBookingsToCreate -gt 0)
                {
                    # A booking can be created up to 272 days before the flight and that last possible booking is 1 day before the flight using the previously defined Get-RandomDateBetween function
                    $dateOfBooking = Get-RandomDateBetween -StartDate $dateTimeFlight.AddDays(-272) -EndDate $dateTimeFlight.AddDays(-1)
                    $dateOfBooking = [Datetime]::ParseExact($dateOfBooking, 'dd/MM/yyyy HH:mm:ss', $null)

                    # A booking can only be created in business hours, I know this is not realistic but I didn't want the data to be too random
                    $timeOfBooking = Get-RandomTimeBetween -StartTime "08:00" -EndTime "18:00"
                    $timeOfBooking = [System.Timespan]::Parse($timeOfBooking)
                    
                    [datetime]$dateTimeTicketSale = $dateOfBooking.Add($timeOfBooking)
        
                    Invoke-SqlCmd `
                    -Credential $sqlServerCredential `
                    -Database $databaseName `
                    -Query " `
                    DECLARE @ticketSaleDateTime DATETIME2 `
                    SET @ticketSaleDateTime = (SELECT CAST('$dateTimeTicketSale' AS DATETIME2)) `
                    DECLARE @travelDateTime DATETIME2 `
                    SET @travelDateTime = (SELECT CAST('$dateTimeFlight' AS DATETIME2)) `
                    `
                    EXEC [dbo].[CreateBooking] @agencyUserId = $economyClassAgencyUser, `
                    @routeId = $routeToBook, `
                    @ticketSaleDateTime = @ticketSaleDateTime, `
                    @ticketTypeId = $economyClassTicketType, `
                    @travelDateTime = @travelDateTime" `
                    -ServerInstance $env:COMPUTERNAME
        
                    $economyClassBookingsToCreate = $economyClassBookingsToCreate -1
                }
        
                # Create premium economy class bookings   
                while($premiumEconomyClassBookingsToCreate -gt 0)
                {
                    # A booking can be created up to 272 days before the flight and that last possible booking is 1 day before the flight using the previously defined Get-RandomDateBetween function
                    $dateOfBooking = Get-RandomDateBetween -StartDate $dateTimeFlight.AddDays(-272) -EndDate $dateTimeFlight.AddDays(-1)
                    $dateOfBooking = [Datetime]::ParseExact($dateOfBooking, 'dd/MM/yyyy HH:mm:ss', $null)
 
                    # A booking can only be created in business hours, I know this is not realistic but I didn't want the data to be too random
                    $timeOfBooking = Get-RandomTimeBetween -StartTime "08:00" -EndTime "18:00"
                    $timeOfBooking = [System.Timespan]::Parse($timeOfBooking)
                    
                    [datetime]$dateTimeTicketSale = $dateOfBooking.Add($timeOfBooking)
        
                    Invoke-SqlCmd `
                    -Credential $sqlServerCredential `
                    -Database $databaseName `
                    -Query " `
                    DECLARE @ticketSaleDateTime DATETIME2 `
                    SET @ticketSaleDateTime = (SELECT CAST('$dateTimeTicketSale' AS DATETIME2)) `
                    DECLARE @travelDateTime DATETIME2 `
                    SET @travelDateTime = (SELECT CAST('$dateTimeFlight' AS DATETIME2)) `
                    `
                    EXEC [dbo].[CreateBooking] @agencyUserId = $premiumEconomyClassAgencyUser, `
                    @routeId = $routeToBook, `
                    @ticketSaleDateTime = @ticketSaleDateTime, `
                    @ticketTypeId = $premiumEconomyClassTicketType, `
                    @travelDateTime = @travelDateTime" `
                    -ServerInstance $env:COMPUTERNAME
        
                    $premiumEconomyClassBookingsToCreate = $premiumEconomyClassBookingsToCreate -1
                }
        }
    }  
} 

#Wait for all background jobs to complete before completing script exeution
Get-Job | Wait-Job

Full disclaimer: This is probably a very inefficient way of generating this volume of data and should probably be processed in a multi-threaded tool such as Azure Databricks to improve performance.

I did set this to run on an Azure VM for a couple of months. Originally I had 10 concurrent jobs but this was too slow so have reduced this.

Resources

If you don’t want to follow all of those steps and patiently wait for a sizable dataset to be generated, I have published a backup of the database that you can use right away. You can download it here. Please note that you will need to be using SQL Server 2022 as the backup was generated on a SQL Server 2022 instance.

GitHub repository

Credits

Thanks to @emyann on GitHub for their Gist starter code to generate random dates and times in PowerShell of which I have adapted in the code when creating bulk bookings.

Thanks to @farkoo on GitHub for their example code for checking sports venue capacity of which provided inspiration when creating the flight capacity logic