fleid.github.io

Azure Data Factory v2 - Batch Pipeline Project Review : Notable Implementation Details

This is article is part of a series:

  1. Architecture discussion
  2. ALM and Infrastructure
  3. Notable implementation details <- you are here, comments / PR are welcome

Architecture

As a reminder, here is the solution architecture we established earlier (see the first article of the series for more details):

Schema illustrating the architecture

Implementation in ADFv2

We will focus on the implementation of Step 3 and 4 in Azure Data Factory v2 as they contain most of the logic in our pipeline:

Schema illustrating the architecture

Logical flow

If the Copy Activity of ADFv2 allows for a flattening of the file structure when copying files, it auto generates a name for them in the operation. We decided instead to loop over the folder structure ourselves, gathering metadata as we go, to iterate over individual files and process them as required (copy, rename, delete).

We will use 3 levels of nested loops to achieve that result:

We won’t loop over Year/Month as we’re only processing the current day of results, and we can generate these attributes using the current date.

At the time of writing it is not possible to nest a ForEach activity inside another ForEach activity in ADFv2. What should be done instead is to execute a pipeline in a Foreach and implement the next loop in that second pipeline. Parameters are used to transmit the current item value of the loop between nested pipelines. This is detailed below.

Here is an illustration of the expected logical flow:

Screenshot of the pipeline

Data sets

We will use parameters in our data sets following that strategy.

To iterate over the file structure in the File Store B, we will need:

We don’t need data sets to iterate over years and months as we can extract that from the current date. We will expose those as parameters to be able to process the past if need be.

On the output side, we will need a sink data set targeting the blob store:

Pipeline design

Here are the pipelines that will be created:

Pipeline “01 - Master”

Parameter:

This is the master pipeline for step 3/4. It will contain the main routine and eventually some additional administrative steps (audit, preparatory and clean-up tasks…).

Screenshot of the pipeline

Pipeline “02 - Get Companies and IDs”

Parameter:

Variables:

This pipeline will get the list of companies from the folder metadata, generate the current Year/Month, and from that get the list of available Device IDs from the subfolder metadata.

Screenshot of the pipeline

Pipeline “03 - Get File Names”

Parameters:

This pipeline will get the list of file names from the final subfolder metadata.

Screenshot of the pipeline

Pipeline “04 - Actual Move”

Parameters:

This pipeline will copy and then delete the files.

Screenshot of the pipeline

Actual pipeline Flow

Here is an illustration of the complete flow as implemented in ADFv2 across 4 pipelines:

Screenshot of the pipeline

Conclusion

The documentation of ADFv2 was a bit immature at the time of writing, so figuring some of the quirks of ADFv2 was a bit challenging. But if the solution we built is far from perfect, it is a good first iteration delivering on all the initial requirements.