Skip to main content

Command Palette

Search for a command to run...

Retrieve the hierarchical directory structure from Azure ADLS Gen2 storage

Published
6 min read
Retrieve the hierarchical directory structure from Azure ADLS Gen2 storage

In one of my earlier article, I demonstrated how to leverage DatalakeServiceClient to create and modify files in a Lakehouse within Microsoft Fabric.

Similarly, we can use DatalakeServiceClient for operational needs on Azure ADLS Gen2 storages. In this article, we dive deeper into the details of this approach.

We all know that its pretty straightforward to retrieve the ADLS Gen2 directory structure within the Azure ecosystem but in cases where the requirements is to fetch it out of Azure environment (a windows/web app) the most common approach is to use Access Keys or Client Secrets. But as I have reiterated multiple times that though they are convenient, using them introduces significant security concerns.

A more secure and modern approach is to rely on token-based authentication using the Microsoft Authentication Library (MSAL). Since Managed Identities cannot be directly used outside of Azure environments (as they are tightly bound to Azure resources and cannot be impersonated externally) and also its not possible to impersonate managed identity, MSAL becomes the only preferred approach.

We will follow the same approach that was used in the earlier article, however, that approach was for Fabric lakehouses using Fabric APIs and DataLakeServiceClient and in this article we are dealing with Azure ADLS Gen2 storage.

SetUp

The sample ADLS storage has the following structure

The expectation from the code is that it should be capable of recursively traversing all directories within a given container. As shown in the screenshot above, the customers container contains directories that are nested up to three levels deep and this structure can be dynamic.

Permissions

In the earlier article on data lakes within Fabric lakehouses, the underlying user type(Service Principal, Managed Identity/User/Group) required necessary access at the workspace level. Similarly for Azure ADLS2 storages we will have to grant a role to the user types to able to access the ADLS2 storage. This typically is Storage Blob Data Owner.

You could also grant Storage Blob Data Contributor but the Storage Blob Data Owner role also has POSIX access control (ACL access) that auto grants all the (r-w-x) privileges to all the underlying objects under the container.

For instance in the following screenshot we can see that the owner was auto assigned the (r-w-x) access .

In my other article on fabric lakehouse storages, the scope endpoint used was http://onelake.dfs.fabric.microsoft.com but for Azure ADLS Gen2 storages the endpoint used is http://{storageaccountname}.dfs.core.windows.net

Surprisingly the scope used stays the same whether you fetch Fabric lakehouse storage details or the the ADLS Gen2 storage details on Azure .

The scope used is both cases is https://storage.azure.com/.default

The delegated permission assigned to the service principal is on Azure storage.

Code

Now that we have all the pre requisites in place, install the following packages in your C# console application

dotnet add package Azure.Core
dotnet add package Azure.Storage.Files.DataLake
dotnet add package Microsoft.Identity.Client

Appsettings.json

{
    "Logging": {
        "LogLevel": {
            "Default": "Information",
            "Microsoft": "Warning",
            "Microsoft.Hosting.Lifetime": "Information"
        }
    },

    "AllowedHosts": "*",
    "ClientId": "Service Principal Client Id",
    "TenantId": "Tenant Id",
    "StorageAccount": "Storage Account Name",
    "Container": "Container Name"
}

Program.cs

using Microsoft.Extensions.Configuration;
using Azure.Core;
using Azure.Storage.Files.DataLake;
using AccesTokenCredentials;
using Microsoft.Identity.Client;
using System.Net.Http.Headers;

internal class Program
{
    static string StorageAccount = "";
    static string Container = "";
    static string ClientId = "";
    static string TenantId = "";
    static string filepath = "";
    static async Task Main(string[] args)
    {

        DataLakeServiceClient datalake_Service_Client;
        DataLakeFileSystemClient dataLake_FileSystem_Client;
        ReadConfig();
        var app = PublicClientApplicationBuilder
        .Create(ClientId).WithAuthority($"https://login.microsoftonline.com/{TenantId}/v2.0")
        .WithRedirectUri("http://localhost")
        .Build();

        string[] scopes = new[] { "https://storage.azure.com/.default" };

        var result = await app.AcquireTokenInteractive(scopes).ExecuteAsync();

        var httpClient = new HttpClient();
        httpClient.DefaultRequestHeaders.Authorization = new AuthenticationHeaderValue("Bearer", result.AccessToken);
        TokenCredential tokenCredential = new AccessTokenCredential(result.AccessToken);
        string dfsUri = $"https://{StorageAccount}.dfs.core.windows.net";

        datalake_Service_Client = new DataLakeServiceClient(new Uri(dfsUri), tokenCredential);
        dataLake_FileSystem_Client = datalake_Service_Client.GetFileSystemClient(Container);
        DataLakeDirectoryClient rootDirectory_ = dataLake_FileSystem_Client.GetDirectoryClient("");
        await TraverseDirectory(rootDirectory_);
        Thread.Sleep(5000);
    }

    public static async Task TraverseDirectory(DataLakeDirectoryClient directoryClient)

    {
        await foreach (var item in directoryClient.GetPathsAsync())
        {
            Console.WriteLine(item.Name);

            if (item.IsDirectory == true)
            {
                string[] split = item.Name.Split("/");
                var subDir = directoryClient.GetSubDirectoryClient(split.Length == 1 ? split[0] : split[split.Length - 1]);
                await TraverseDirectory(subDir);
            }
        }
    }

    public static void ReadConfig()
    {
        var builder = new ConfigurationBuilder().AddJsonFile($"Appsettings.json", true, true);
        var config = builder.Build();
        ClientId = config["ClientId"];
        TenantId = config["TenantId"];
        StorageAccount = config["StorageAccount"];
        Container = config["Container"];

    }

}

Couple of important points I would like to highlight from the code above.

The function TraverseDirectory recursively traverses across all the subdirectories .This function takes parameter directoryClient of type DataLakeDirectoryClient and iterates through each subdirectory through directoryClient.GetSubDirectoryClient and sets the path value for directoryClient.GetPathsAsync() for the following iterator

var item in directoryClient.GetPathsAsync()

public static async Task TraverseDirectory(DataLakeDirectoryClient directoryClient)

{
    await foreach (var item in directoryClient.GetPathsAsync())
    {
        Console.WriteLine(item.Name);

        if (item.IsDirectory == true)
        {
            string[] split = item.Name.Split("/");
            var subDir = directoryClient.U(split.Length == 1 ? split[0] : split[split.Length - 1]);
            await TraverseDirectory(subDir);
        }
    }
}

Next, check this line from the main code

string dfsUri = $"https://{StorageAccount}.dfs.core.windows.net";
        datalake_Service_Client = new DataLakeServiceClient(new Uri(dfsUri), tokenCredential);

DataLakeServiceClient expects TokenCredential to validate credentials

But what we get through PublicApplicationBuilder is an accesstoken that is of a type string. To overcome this, I created a separate class called AccessTokenCredentialthat inherits from ClientSecretCredential and accepts the value of accessToken as its constructor parameter.

using Azure.Core;
using Azure.Identity;
using System.IdentityModel.Tokens.Jwt;

namespace AccesTokenCredentials
{
       public class AccessTokenCredential : ClientSecretCredential
    {

        public AccessTokenCredential(string accessToken)
        {
            AccessToken = accessToken;
        }

        private string AccessToken;

        public AccessToken FetchAccessToken()
        {
            JwtSecurityToken token = new JwtSecurityToken(AccessToken);
            return new AccessToken(AccessToken, token.ValidTo);
        }

        public override ValueTask<AccessToken> GetTokenAsync(TokenRequestContext requestContext, CancellationToken cancellationToken)
        {
            return new ValueTask<AccessToken>(FetchAccessToken());
        }

         public override AccessToken GetToken(TokenRequestContext requestContext, CancellationToken cancellationToken)
         {
             JwtSecurityToken token = new JwtSecurityToken(AccessToken);
             return new AccessToken(AccessToken, token.ValidTo);
         }    
      }
}

We then pass the generated accessToken as the constructor value to AccessTokenCredential class

  TokenCredential tokenCredential = new AccessTokenCredential(result.AccessToken);

and this returns a object of type TokenCredential which is then passed to the DataLakeServiceClient

datalake_Service_Client = new DataLakeServiceClient(new Uri(dfsUri), tokenCredential);

For more details on the above topic please refer to my article on the topic here.

Conclusion

In this article, I touch based on how to leverage DatalakeServiceClient for Azure Data Lake storages. Also we explored how it can be used to interact with containers, directories, and files programmatically. Additionally, we looked into how efficiently navigate through all the directory structure data within ADLS Gen2 Azure storage all through MSAL.

Thanks for reading !!!