Microsoft Fabric end to end project (part 1 - Architecture)

With the march 2024 Fabric release, Microsoft published lots of key features that were crucial to push in production a complete Fabric project. Features such as CI/CD for Fabric Pipelines, gateways for Fabric Pipelines and even simple stuff like folders in Fabric workspaces.

No more excuse not to try a project only using only Microsoft Fabric components!

As this is quite a long topic, the project will be divided in several blog posts:

Part 1: Overview and Architecture
Part 2: Data ingestion
Part 3: Data transformation
Part 4: Data visualisation
Part 5: CI/CD and push to production
Part 6: Logging and monitoring

In this first blog post, I’ll present wthe project, who its users will be and some of the more technical aspect like the architecture and the git workflow.

Please note that most the features we will use in the project are still preview features and subject to changes.

Overview of the project

I can’t hide it anymore, I’m a League of Legends (LoL for the rest of the blog post) player. I’m not especially good at the game, but I’ve been playing for quite some time now (enough to have quite a large amount of data to analyse at least). Riot Games, the company behind LoL provides APIs to get data from its games.

The goal of the project is to get some data from games my friends and I played and display this data in Power BI reports. And who knows, maybe I’ll find out some important data I lacked and improve at the game!

For this, my main source of data will be the LoL API. I’ll be making several API calls to get several players basic info, games played and most used in-game characters. I’ll also be using some additional flat csv and json files to have the list of players I want to get data from and some other referentials (like the list of playable characters which is provided through another mean).

Personas

Let’s define several types of users that will need to play with this data.

Ingest LoL data from APIs and flat files
Transform and clean data so that it can be used by Data Analysts and Data Scientists
Orchestrate, monitor and ensure data quality over all layers of ingestion and transformation

Create enterprise and self-service Power BI reports
Explore and analyse data using T-SQL

Create and train models out of cleaned data to predict several outcomes

Use reports created by Data Analysts

Control performance and costs of the platform
Grant access to users
Create Microsoft Fabric items

Let’s pretend that this is a moderate-size project. In the development team, we’ve got:

2 Data Engineers
2 Data Analysts
1 Data Scientist
1 Admin (which is not necessarily tied to a specific project; he’s more like a super admin which controls everything on the Fabric tenant)
Multiple Report Consumers: these are business users from any departement that want to use that League of Legends API data (yeah, there are some people out there that really want to use that data)

Architecture of the project

Our project’s overall architecture is going to look something like this:

Note: all Fabric items are linked to the Azure DevOps git repo (even Power BI report, go check my other blog post about Power BI CI/CD & git integration if you haven’t!). It just felt too crowded if I put every link to the repo. But all notebooks, pipelines and reports and Power BI reports do end up in the repo.

Let me go over some of the important topics of that architecture.

Lakehouse and medallion architecture

There are two main components in Fabric to “store” (I don’t like the word “store” as data in the end is always stored in the same Fabric One Lake, “expose” would be a better fit) and transform data: Lakehouses and Warehouses. I won’t cover here the differences between the two as it’s been already done a thousand times, here for example. I just want to stress out one of the major differences: data transformations in Warehouses are done in T-SQL while they are done in a mix of PySpark / Spark SQL in Lakehouses.

I’m not very proficient in PySpark. I’m more used to T-SQL. But I still chose the Lakehouse over the Warehouse for several reasons :

At the time of writing, still many SQL commands are not supported in Fabric Warehouses (TRUNCATE, ALTER TABLE, …) which makes it pertty hard to use, especially when it comes to deploying a Warehouse from one environment to another
I feel like Microsoft is investing more on the Lakehouse; for instance, a new Fabric feature called Direct Lake over Power BI reports was first supported only in the Lakehouse and only became available weeks later on the Warehouse
It’s the recommended approach in the official Microsoft documentation: Implement medallion lakehouse architecture in Microsoft Fabric; the Warehouse is only recommended in the gold layer (which we’ll talk about in a minute) and even there, the SQL endpoint of the Lakehouse does part of the job

In the Lakeouse, the recommended architecture is the medallion architecture in which you usually have 3 layers. As data move through these layers, its quality and usabiliy is improved. Here’s what each layer is usually used for:

Bronze: used to store raw data. No transformations at all are performed, data is often stored in its original format.
Silver: cleaned data. Data quality is enforced in this layer: we remove empty or abnormal values, merge several sources together and make sure data quality rules are applied. In this layer data is stored in the delta format (more on the delta format here)
Gold: data that’s ready to use. Data here is modeled to suit users’ needs. Data will often be modeled as star schemas for Business Intelligence needs or simply enriched with additional attributes for Machine Learning use cases. Data is stored in delta format as well here.

What we stated above only scratches the surface of lots of major concepts. I won’t dig into these concepts any deeper in that blog post however. Many talended people already have done so. I feel like Medallion Architecture: What, Why and How is a good place to start.

Data Ingestion

Ingestion of the LoL API data will mainly be done using Fabric Notebooks in PySpark (a variation of Python specifically adapted for Spark environments). In two words, Spark is an engine that distributes work in several nodes: instead of having a single machine do all the work, multiple machines work together to accomplish tasks quicker.

Notebooks are probably the best fit to load the complex json data that the API sends us. Other options would have been to use Fabric Pipelines or Fabric Dataflows Gen2 but I feel like both these options are not so good at loading data from APIs (concepts like pagination, query parameters, securing of API keys are more challenging using Pipelines or Dataflows Gen 2).

Ingestion of flat files will be done using manual Upload in the Fabric web page or using the One Lake Explorer.

All this data will be loaded in the bronze Lakehouse in the original format (i.e. mainly json).

Note: I’m still wondering if storing data in its orignial format is the best option in the bronze layer. Storing the data in parquet or delta, which both compress the data, would result in smaller files and thus lower storage cost (as storage costs are billed separately in Fabric). This would imply a small shift from the source data however which is not the point of the bronze layer. As always, I suppose the answer to this is “it depends” but I’ve yet to discover what it depends on exactly.

Workspace and access management

I’ll work with 3 environments for this project: dev, test and prod. Accesses are going the be the same for Data Analysts, Scentists and Engineers on the 3 environments. Reports end-users will only access data in the prod environment. There will be one workspace for the test environment and one workspace for the prod environment. There will be one workspace per Fabric developper for the dev environment (one for each data engineer, data analyst and data scientist). This makes quite a lot of workspace but since each workspace can only be connected simulateously to a single branch, I feel it’s the best way to go for now.

Each environment will have its own bronze, silver and gold layers. Each layer will be “stored” in its own Lakehouse but all three Lakehouses will be kept in the same Fabric workspace.

Note: if at some point we’re able to make all development in IDEs like VS Code, we might be able to only work with a single dev workspace. But for now, Pipelines or Data flow gen2 can only be created from within the Fabric web page.

This set up works fine for a single project. If we had multiple projects, we’d probably keep the same approach for the bronze and silver layer but would split the gold layer across multiple workspaces to control access.

Data Analysts and scientists will only have access to the gold layer of each environment. Data Engineers will have access to everything in each environment. Sensitive information (like player names for instance) might need to be encrypted even when maniupulated by data engineers.

There are several ways these accesses can be granted: workspace role, direct access or custom Lakehouse roles. I’ll deep dive on the subject in part 5 – CI/CD and push to production.

Git integration

Last but not least, let’s talk briefly about git integration.

We mentionned before that all of our Fabric items will be kept in an Azure DevOps git repo. I won’t cover again the benefits of doing so, I already did in these blog posts: Power BI CI/CD & git integration and Which git workflow for Power BI development with pbip?. I’ll just say this: not keeping your Fabric items in an Azure DevOps repo is a crime. It’s so simple to do (except for Warehouses at the moment) and brings so much that you just have to do it.

In a previous blog post, I talked about which git workflow to use for Power BI report development. In that blog post we were also working with the 3 same environments. The dev workspace wasn’t linked to any git branch, the test workspace was linked to a test branch and the prod worskspace was linked to a main branch. We created feature branches in VS Code and created pull requests to merge changes from the feature branch into the test and main branches.

Here we’re keeping the same approach. The only difference is that the dev environment is actually split across each personal developper’s workspace.

Conclusion

In this blog post, we cover the first part of our Fabric journey. We established the foundations of our project. These foundations can and probably will evolve as we go throught the next sections of the project.

Thanks for reading ! Stay tuned for part 2.

Microsoft Fabric end to end project (part 1 – Architecture)