Modern analytics architecture with Azure Databricks

Azure Data Factory
Azure Data Lake Storage
Azure Databricks
Azure Synapse Analytics
Power BI

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub feedback.

This solution outlines a modern data architecture. Azure Databricks forms the core of the solution. This platform works seamlessly with other services, such as Azure Data Lake Storage Gen2, Azure Data Factory, Azure Synapse Analytics, and Power BI.

ApacheĀ® and Apache Sparkā„¢ are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Architecture

Architecture diagram showing how a modern data architecture collects, processes, analyzes, and visualizes data.

Download a Visio file of this architecture.

Dataflow

  1. Azure Databricks ingests raw streaming data from Azure Event Hubs.

  2. Data Factory loads raw batch data into Data Lake Storage Gen2.

  3. For data storage:

    • Data Lake Storage Gen2 houses data of all types, such as structured, unstructured, and semi-structured. It also stores batch and streaming data.

    • Delta Lake forms the curated layer of the data lake. It stores the refined data in an open-source format.

    • Azure Databricks works well with a medallion architecture that organizes data into layers:

      • Bronze: Holds raw data.
      • Silver: Contains cleaned, filtered data.
      • Gold: Stores aggregated data that's useful for business analytics.
  4. The analytical platform ingests data from the disparate batch and streaming sources. Data scientists use this data for these tasks:

    • Data preparation.
    • Data exploration.
    • Model preparation.
    • Model training.

    MLflow manages parameter, metric, and model tracking in data science code runs. The coding possibilities are flexible:

    • Code can be in SQL, Python, R, and Scala.
    • Code can use popular open-source libraries and frameworks such as Koalas, Pandas, and scikit-learn, which are pre-installed and optimized.
    • Practitioners can optimize for performance and cost with single-node and multi-node compute options.
  5. Machine learning models are available in several formats:

    • Azure Databricks stores information about models in the MLflow Model Registry. The registry makes models available through batch, streaming, and REST APIs.
    • The solution can also deploy models to Azure Machine Learning web services or Azure Kubernetes Service (AKS).
  6. Services that work with the data connect to a single underlying data source to ensure consistency. For instance, users can run SQL queries on the data lake with Azure Databricks SQL Analytics. This service:

  7. Power BI generates analytical and historical reports and dashboards from the unified data platform. This service uses these features when working with Azure Databricks:

  8. Users can export gold data sets out of the data lake into Azure Synapse via the optimized Synapse connector. SQL pools in Azure Synapse provide a data warehousing and compute environment.

  9. The solution uses Azure services for collaboration, performance, reliability, governance, and security:

    • Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.

    • Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.

    • Azure Key Vault securely manages secrets, keys, and certificates.

    • Microsoft Entra ID provides single sign-on (SSO) for Azure Databricks users. Azure Databricks supports automated user provisioning with Microsoft Entra ID for these tasks:

      • Creating new users.
      • Assigning each user an access level.
      • Removing users and denying them access.
    • Azure Monitor collects and analyzes Azure resource telemetry. By proactively identifying problems, this service maximizes performance and reliability.

    • Azure Cost Management and Billing provide financial governance services for Azure workloads.

Components

The solution uses the following components.

Core components

  • Azure Databricks is a data analytics platform. Its fully managed Spark clusters process large streams of data from multiple sources. Azure Databricks cleans and transforms structureless data sets. It combines the processed data with structured data from operational databases or data warehouses. Azure Databricks also trains and deploys scalable machine learning and deep learning models.

  • Event Hubs is a big data streaming platform. As a platform as a service (PaaS), this event ingestion service is fully managed.

  • Data Factory is a hybrid data integration service. You can use this fully managed, serverless solution to create, schedule, and orchestrate data transformation workflows.

  • Data Lake Storage Gen2 is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data may be structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources like logs, files, and media.

  • Azure Databricks SQL Analytics runs queries on data lakes. This service also visualizes data in dashboards.

  • Machine Learning is a cloud-based environment that helps you build, deploy, and manage predictive analytics solutions. With these models, you can forecast behavior, outcomes, and trends.

  • AKS is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications.

  • Azure Synapse is an analytics service for data warehouses and big data systems. This service integrates with Power BI, Machine Learning, and other Azure services.

  • Azure Synapse connectors provide a way to access Azure Synapse from Azure Databricks. These connectors efficiently transfer large volumes of data between Azure Databricks clusters and Azure Synapse instances.

  • SQL pools provide a data warehousing and compute environment in Azure Synapse. The pools are compatible with Azure Storage and Data Lake Storage Gen2.

  • Delta Lake is a storage layer that uses an open file format. This layer runs on top of cloud storage such as Data Lake Storage Gen2. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data.

  • MLflow is an open-source platform for the machine learning lifecycle. Its components monitor machine learning models during training and running. MLflow also stores models and loads them in production.

Reporting and governing components

  • Power BI is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis.

  • Microsoft Purview manages on-premises, multicloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Features include automated data discovery, sensitive data classification, and data lineage.

  • Azure DevOps is a DevOps orchestration platform. This SaaS provides tools and environments for building, deploying, and collaborating on applications.

  • Azure Key Vault stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.

  • Microsoft Entra ID offers cloud-based identity and access management services. These features provide a way for users to sign in and access resources.

  • Azure Monitor collects and analyzes data on environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs.

  • Azure Cost Management and Billing manage cloud spending. By using budgets and recommendations, this service organizes expenses and shows how to reduce costs.

Scenario details

Modern data architectures meet these criteria:

  • Unify data, analytics, and AI workloads.
  • Run efficiently and reliably at any scale.
  • Provide insights through analytics dashboards, operational reports, or advanced analytics.

This solution outlines a modern data architecture that achieves these goals. Azure Databricks forms the core of the solution. This platform works seamlessly with other services. Together, these services provide a solution with these qualities:

  • Simple: Unified analytics, data science, and machine learning simplify the data architecture.
  • Open: The solution supports open-source code, open standards, and open frameworks. It also works with popular integrated development environments (IDEs), libraries, and programming languages. Through native connectors and APIs, the solution works with a broad range of other services, too.
  • Collaborative: Data engineers, data scientists, and analysts work together with this solution. They can use collaborative notebooks, IDEs, dashboards, and other tools to access and analyze common underlying data.

Potential use cases

The system that Swiss Re Group built for its Property & Casualty Reinsurance division inspired this solution. Besides the insurance industry, any area that works with big data or machine learning can also benefit from this solution. Examples include:

  • The energy sector
  • Retail and e-commerce
  • Banking and finance
  • Medicine and healthcare

Next steps

To learn about related solutions, see this information: