Microsoft Fabric end to end project (part 2 – Ingestion)

Posted by:

|

On:

|

Let’s continue our Fabric end to end project! If you haven’t read part 1, I strongly suggest you read it first. In part 1, we established the groundbase of our project. In part 2, we’re going to be starting the development and we will be ingesting data from the League of Legends (LoL) APIs.

Let’s go!

Setting up our development environment

Creating the Fabric workspace

Since Fabric is a Software as a Service (SaaS), this part should be fairly easy. All we have to do is make sure we have a Fabric capacity (or a Power BI capacity) up and running and create a workspace attached to that capacity. A Fabric capacity is basically a set of machines in Microsoft datacenters that will be doing the computation in the background. Ther are very similar to Power BI capacities for those of you coming from the Power BI world.

You can have multiple capacities, each having its own set of resources. It will be the task of Fabric admins to make sure every project has enough resources to consume. Many distribution strategies can be applied: per environment, per department, per project, or a mix of all of that.

You may be wondering, how much will that capacity cost me? Well, I won’t answer that question here (I’m deeply sorry). First, the financial aspect of Microsoft Fabric would require a blog of its own, and it has already been done, here for instance. In addition, I’m using a free trial capacity for the sake of this project so money is not an issue. Just to give a very broad range, Fabric capacities range from F2 (around 300€/month) to F2048 (around 300 000€/month, yeah, you read correctly).

To set up my environment, I simply need to create a workspace. Again, if you’re coming from the Power BI world, you’re probably familiar with workspaces. Every element created in Fabric needs to be created within a workspace. It’s mainly at the workspace level that rights are going to be configured: can a user simply read data in that workspace or should be able to publish and grant other people access as well?

When creating the workspace, give it a friendly name and make sure it’s created in a capacity, eiter a Fabric or a Power BI (called premium in the screenshot below) one. Fabric items like Warehouses, Lakehouses or Notebooks can only be created in such workspaces.

This first workspace we create will be the workspace for our first data engineer. He will be the one responsible for ingesting data in the bronze layer of our architecture (go check part 1 if you don’t remember the architecture!).

Connecting the Fabric workspace to a git repo

Once this is done, we need to connect the workspace to a git repository. It’s completely unimaginable not to put our code in a git repo. When the code is backed up in a git repo, it eases collaboration, lets you store all the versions of your code and see who and has done what, why and when.

For now, the only type of repo Fabric workspaces can be linked to are Azure DevOps repos. I have already an Azure DevOps project set up with a repo configured. If you haven’t, go check the Microsoft documentation: Create a new Git repo in your project. While your team is small enough (less than 6 users), this service is free. Else it’s not.

When we initialize our repo, a main branch is automatically created. This branch will be linked to our production workspace. We create a test branch from the main branch that will be linked to the test workspace. Finally we create a feature/ingest-lol-data branch that will be linked with our data engineer’s personnal workspace. Each time a new feature is developped, a new branch will have to be created.

You then simply have to go in the workspace settings and link the workspace to the feature branch you just created.

It might take a few minutes to create all this the first time. But once you’re used to it, this is done very quickly. This is one of Microsoft Fabric’s strongest point.

Creating Fabric resources

If you remember part 1 correctly, you know that we are going to be ingesting LoL APIs data using Notebooks. We explained why we chose Notebooks over Pipelines or Dataflows Gen2 in the previous article.

These Notebooks will be ingesting data in the bronze layer. In that layer, data is raw, there are no transformations applied at all yet, we simply load the data just like we receive it from the API. The bronze layer will be a Fabric Lakehouse. The data will be pushed in json format, which is the format the API returns.

We need to create multiple things: one Lakehouse and several notebooks, one for each topic we’re interested in. The fact that I want to have one Notebook per topic is purely personnal, we could have had one big Notebook but I felt like it would be harder to maintain.

The creation of the Lakehouse is pretty straightforward so I’m not going to cover it. Just go check the Microsoft documentation if you need to. I’m also creating a Notebooks folder in which I’ll create all of my Notebooks.

Ingesting LoL data

Ok we’re all set. Let’s actually go and get that data!

Ingesting static data with OneLake Explorer

Most of the APIs we’re going to be using work with the player id. To get this id from the player’s name, we can use the https://europe.api.riotgames.com/riot/account/v1/accounts/by-riot-id/<playerName>/<playerRegion> endpoint. I did this for about 10 players and put their name as well as their ids in a csv file.

I then used OneLake Explorer (which you can download here) to very simply load that data to the Bronze Lakehouse. OneLake Explorer is just like OneDrive Explorer. It lets you directly manipulate files stored on OneLake in the traditional Windows File Explorer.

If the number of players I wanted to get data from was changing quicker, I would have uploaded that csv file on Sharepoint (were business users could update the list of players they want to see) and regularly load that file to the bronze Lakehouse using a Dataflow Gen2.

I also uploaded using the OneLake Explorer the list of all champions (champions are in-game character that you play) that exist in the game. This list is not available through an API so I manually downloaded it from there.

Ingesting evolving data with Notebooks

Ok so we’ve got my list of players and the list of champions. Let’s actually get the API data now!

I’ll be needing several information out of Riot Games APIs:

  • Player’s ranks
  • Player’s champions information (related to the number of games a player has had with each champion)
  • Player’s games
  • Games detail (when the game started, how long it lasted, how much damage each player did, and so on)

For each of these topic, a dedicated API exists. I’ll be creating a Notebook for each topic, as well as an additional Functions Notebook to reuse some piece of code in all Notebooks.

Let’s go and create the Notebook to ingest player’s ranks.

I’m creating the Notebook, giving it a friendly name and getting a session (i.e. connecting my web page to a machine in a Microsoft datacenter to have compute). I’ll mainly be using pySpark to make API requests.

Authenticating to the Riot Games API

To make sure I’ve got the rights to use Riot Games APIs, I need to authenticate to it. I’ve enrolled on the Riot Games Developer portal and got an API key. I need to pass that key everytime I’m making an API call, for the API to know who I am and what rights I have.

Thing is, you don’t want to put your API key in plain text in your Notebooks. If you do, since our Notebook will be synched to git, the key will be there forever, just waiting to be seen by unwanted people.

We need to secure access to that API key. For that purpose, I’m using an Azure Key Vault. Explaining the set up of Azure Key Vault and how to connect from a Fabric Notebook to Azure Key Vault secrets is out of the scope of this blog post. You can check this blog post for that purpose.

Here’s what the beginning of my Notebook looks like.

Making the API calls

Once this is done, we’re good to go! I won’t cover each part of the code in this blog post as it would be both long and boring. The complete code of all Notebooks is available in my Github repo: MS Fabric End to End.

What’s important is the main idea behind this. Here are the main steps:

  1. Import some python libraries to load a set of functions will be needing
  2. Connect to the Azure Key Vault to securely grab the API Key
  3. Load the list of players from the csv file
  4. For each player in that file, make a call to the API to get the player’s rank
  5. Store all players rank in a dataframe and write that file in json in the bronze Lakehouse

Every time this Notebook is executed we will get a json file written in our bronze layer with the rank of all players at the time of the API call. We’ll need to execute this Notebook frequently (for instance every day) to track player’s rank evolution.

We do the same for every other topic we need and this is what we get. For each day we run the Notebooks, we get all the information we need.

One of the drawback of having loaded the data in the Files section is that we can’t use the SQL endpoint to query the data and explore it. For some topics like the player’s rank, we could have easily loaded the data in a delta table (since the schema was defined at the data loaded in a dataframe before writing it as a json file).

For other topics however like the games detail, the structure of the json file is so complicated that it would be a pain to define explicitely the schema here. Instead we just wrote the file without going through a dataframe.

Anyway, we can now commit the changes from our feature branch to the Azure DevOps repo if we want to at this step. We’ll see how to push these items in the test and production workspaces in part 5. We’ll also be covering orchestration (i.e. when these notebooks are scheduled to be ran) in part 5.

Conclusion

In this blog post, we saw how we can use Fabric Notebooks to ingest data in a Fabric Lakehouse by using the Riot Games API. In the next section, we’ll see how to transform that data to make it suitable for analysis.

Thanks for reading, stay tuned!

Posted by

in