Fabric in Production – Part 2: Dataflow Gen2

And we’re back for the second blog post of the Fabric in Production series! If you haven’t, I highly recommend you check out part 1 first, which lays the foundation for all of this series.

So we’ve got our dev environment which is basically a workspace with several Fabric items: Notebooks, Data Pipelines, Lakehouses, etc. In this blog post, we’re going to talk about deploying Dataflows Gen2 from one workspace to another. Brace yourself, Dataflows Gen2 are not the easiest type of items to work with! Let’s dive in!

Note: we only talk about the new Fabric Dataflows Gen2 in this blog post even though some concepts apply to Dataflows Gen1 (or simply Dataflows). Dataflows Gen2 will surpass Dataflows in almost every aspect, in a near future so legacy Dataflows should slowly die.

The problem

If only there were a single problem… Dataflows Gen2, as of today, have many issues. First of all, they are not supported in Git integration. Meaning that if you connected your Fabric workspace to an Azure DevOps or Github repo (like you should do!), the JSON code underneath the Dataflow Gen2 will not show up. In addition, they are not supported in Fabric Deployment Pipelines as well. However, it was announced at FabCon this year that support should be available by the end of the year.

In addition, I feel like the user experience differs from other Fabric items. You can’t edit them unless you’re the owner of the item (which is a pain in the ***) and to exit the editing interface, you’re forced to click the little “X” in the upper-right corner. It’s the only item that works like this so far and I feel like this causes a bad user experience.

Finally, I have one last issue with Dataflows Gen2. It’s these DataflowsStagingWarehouse and DataflowsStagingLakehouse items that get automatically created for us. These are used by the Dataflow Gen2 engine to convert some data sources to a compatible file type before actually ingesting data into a Fabric Warehouse or Lakehouse. If you’ve worked with Azure Data Factory before, it’s the same principle as the Enable Staging option in copy activities.

There are many problems with these automatically created Warehouse and Lakehouse items. First, they show up in the Git integration even though Dataflows Gen2 do not, which is confusing. Sometimes they appear in the workspace UI, and other times they don’t. Lastly, there are times when the automatically created items are messed up. The data that you are trying to load into your final destination, say a Fabric Warehouse, can’t be loaded in the staging area due to some internal Microsoft issue, and there is nothing you can do about it.

Dataflow Staging Items showing in repo view even if the Dataflow itself is not

But anyway, we’ve got our Dataflow Gen2 in our dev Fabric workspace. The Dataflow Gen2 itself is pretty simple. It loads an Excel file from a Sharepoint site to a Lakehouse. Now the million-dollar question is: how do we push that baby into our production workspace (which is empty for now)?

Note: we’re only interested in the deployment of the Dataflow Gen2 in this blog post. We’ll cover other resources in the following blog posts of the series.

First option: manual copy

The first option is pretty lame. Since Dataflows Gen2 aren’t supported in Git integration or Fabric Deployment Pipelines, let’s just try and deploy them … manually.

To do this, we can export the dev Dataflow Gen2’s Power Query Template (which exports a pqt file). Now if we head over to our empty prod workspace, we can create a new Dataflow Gen2 from that Power Query Template file and just like that, the job is done.

This is sub-optimal for many reasons:

This involves a manual operation that needs to be done every time a change is made to a Dataflow Gen2 which may lead to developers fogetting to push to production and inconsistent results
The user deploying to production must have access to the production workspace; in many companies, developers don’t have access to the production environment, this would mean that the team’s Tech Lead (or whoever has production access) needs to deploy all Dataflows Gen2 which can (and most certainly will) create bottlenecks
If the Dataflow Gen2 needs to connect to different source environments depending on the workspace in which the Dataflow Gen 2 is published in (e.g., in the dev workspace, it connects to a dev database, but when pushed to the production workspace, the Dataflow Gen2 should connect to a prod database) these changes must be done manually each time the Dataflow Gen2 is published which is prone to errors

Note: at this point, I recommend you export the JSON version of the Dataflow Gen2 and manually upload that JSON file to the Azure DevOps / Github repo. Every time an update is performed on the Dataflow Gen2, I suggest you update that JSON file. At least, you’ll keep track of changes applied to your Dataflow Gen2, even if this file can’t be used for deployment. This is also true for option #2.

Second option: Single isolated Dataflows Gen2 workspace

The second option is sub-optimal as well, only slightly less so if certain circumstances are met.

If your Dataflows Gen2 data sources are not environment dependent (meaning that your dev Dataflow Gen2 will connect to the same data source as the prod Dataflow Gen2), then a possible “solution” is to get rid of the concept of deploying Dataflows Gen2. What do I mean by that? Simply, isolate all your Dataflows Gen2 in another workspace and have other resources like Data Pipelines always point to the same Datflow Gen2, regardless of their environment.

In our example, the Dataflow Gen2 connects to an Excel file on a Sharepoint site. There is only a single Sharepoint site and a single Excel file. There is no dev file or prod file. Therefore, the Dataflow Gen2 in dev or prod would connect to the exact same file. Both Dataflows Gen2 would be identical, so we create only one, in a dedicated environment. To avoid manual errors when deploying, we simply get rid of deployment.

This option only works if your Dataflows Gen2 do not need to connect to different source environments. In that case, it works well.

Note: I suggest only using Dataflows Gen2 when you are working with data sources that only have one environment so you can use option #2. You basically never want to choose option #1 if you can avoid it.

Note: I also suggest working with Dataflows Gen2 only for sources that are harder to work with than Data Pipelines (typically, Excel files located on a Sharepoint site). If files from the data source can be ingested simply using Data Pipelines, I feel they offer a much more robust solution (even though they still have their own problems as we’ll see in the blog posts to come of the Fabric in Production series).

Making Changes to Dataflows Gen2

Whether you’ve chosen option #1 (good luck) or option #2 (lucky you), the way you make changes to a Dataflow Gen2 is going to be the same. There are no APIs yet to create or update the content of a Dataflow programmatically. So you’ll have to manually edit that Dataflow, either in the dev workspace or in the Dataflow workspace.

Here comes the fun part. As we briefly mentioned before, Dataflows Gen2 can only be edited by their owner. To avoid the “jeez, the intern left the company” problem, it’s safer to create Dataflows Gen2 (and every Fabric item for that matter) with a dedicated Service Account. It’ll be a bit painful, your developers will have to log in using that identity every time they’ll need to perform an update, but it’s better than having to contact a tenant admin or even the Microsoft support to take ownership of the item.

Conclusion

Sadly, there are no options to completely manage the deployment of Dataflows Gen2 programmatically. As of today, I’d really recommend not working with Dataflows Gen2 unless you fall into the use case for option #2.

We hope to receive a beautiful Christmas gift from Microsoft with Dataflows Gen2 Git integration support!

Thanks for reading, stay tuned for part 3!