Skip to main content

Chapter 5: External Services Integration (Dataform & Power BI)

Welcome to Chapter 5! In Chapter 4: Task Definition & Execution Workflow, we saw how our cloud-accelerator-function can manage complex recipes of operations, ensuring steps run in the correct order. We learned about tasks, dependencies, and how they form a workflow.

But what if some of those steps involve specialized tools that live outside our main application? For example, what if we need to run a very complex set of SQL transformations in BigQuery, or update a business report in a tool like Power BI? Our cloud-accelerator-function is smart, but it doesn't try to do everything itself. Instead, it acts like a good project manager who knows when to delegate specialized tasks to dedicated teams. This chapter explores how our system integrates with External Services, specifically Google Cloud's Dataform and Microsoft's Power BI.

The Project Manager and Specialist Teams: Why Integrate?

Imagine you're managing a big project, like launching a new product.

  • You need some advanced financial models built. You wouldn't build them from scratch; you'd ask the Finance Team (a specialist team) to do it.
  • You need a marketing brochure designed. You'd ask the Design Team (another specialist team) to create it.

Your job as the project manager is to:

  1. Tell the Finance Team when to start working on the models (e.g., after initial sales projections are ready).
  2. Tell the Design Team when to start the brochure (e.g., after the product features are finalized and the financial models look good).
  3. Know when each team has finished their work so the next step can begin.

Our cloud-accelerator-function often faces similar situations in data processing:

  • Complex Data Transformations: For really complicated SQL-based data shaping and modeling in BigQuery, Google Cloud offers a tool called Dataform. Dataform is like our specialist "Data Modeling Team."
  • Business Intelligence Reporting: For creating and sharing interactive reports and dashboards, many organizations use Power BI. Power BI is like our specialist "Reporting Team."

Our cloud-accelerator-function doesn't try to replicate Dataform's SQL power or Power BI's reporting features. Instead, it orchestrates them. It tells Dataform, "Hey, run this data transformation now!" and tells Power BI, "Okay, time to refresh that sales report with the latest data!"

Our Use Case: Let's say we've just loaded fresh daily sales data into a raw BigQuery table (as seen in Chapter 2: Data Ingestion & Transformation Pipeline). Now we want to:

  1. Trigger a Dataform workflow that takes this raw sales data, cleans it, aggregates it, and creates a final sales_summary table in BigQuery.
  2. Once the sales_summary table is updated by Dataform, trigger a Power BI dataset refresh so that the sales dashboard reflects the very latest figures.

This is a perfect job for our "project manager" cloud-accelerator-function and its "specialist teams" Dataform and Power BI.

Key Concepts: Our Specialist Teams

  1. Dataform (The Data Modeling Team):

    • What it is: Dataform is a service within Google Cloud that helps you develop, test, and deploy complex SQL workflows for data transformation in BigQuery. You define your data models (tables, views) and their dependencies using SQLX (an extension of SQL) and Dataform handles the execution.
    • How we use it: Our cloud-accelerator-function can trigger a predefined Dataform workflow. Think of it as telling Dataform, "Please run the set of SQL scripts labeled 'daily_sales_rollup'."
  2. Power BI (The Reporting Team):

    • What it is: Power BI is a business analytics service by Microsoft. It lets you connect to various data sources, create interactive dashboards and reports, and share them.
    • How we use it: Our cloud-accelerator-function can send a signal to Power BI to "refresh" a dataset. This tells Power BI to go fetch the latest data from its sources (like our sales_summary table in BigQuery) so that reports are up-to-date.
  3. Orchestration via Tasks:

    • How does our system "tell" these external services what to do? It uses the Task system we learned about in Chapter 4: Task Definition & Execution Workflow.
    • A task in our workflow can be defined as:
      • "Execute Dataform workflow X."
      • "Refresh Power BI dataset Y."
    • Our system manages the dependencies. For example, the "Refresh Power BI" task will only run after the "Execute Dataform" task successfully completes.

Solving the Use Case: Step-by-Step Orchestration

Let's see how our cloud-accelerator-function manages the Dataform and Power BI tasks for our sales data scenario:

  1. Task Definitions: In our system's configuration (likely in the database from Chapter 3: Metadata Persistence (Database)), we have tasks defined like this:

    • Task 1 (DF_Sales):
      • Type: DATAFORM_EXECUTION
      • Details: Run Dataform workflow named process_daily_sales in our GCP project.
    • Task 2 (PBI_Refresh):
      • Type: POWERBI_REFRESH
      • Details: Refresh Power BI dataset ID abc123xyz in workspace Sales Reports.
      • Dependency: Depends on DF_Sales task completing successfully.
  2. Workflow Trigger: The successful loading of new raw sales data might trigger DF_Sales.

  3. Executing the Dataform Task (DF_Sales):

    • The system picks up DF_Sales.
    • It makes an API call to Google Cloud Dataform, telling it to start the process_daily_sales workflow.
    • Dataform begins its complex SQL magic.
    • Important: Dataform is configured (remember run/scripts/addDataFormCompletionTrigger.sh from Chapter 1: Pub/Sub Event Handling & Routing?) to send a Pub/Sub message back to our cloud-accelerator-function when it's done.
  4. Dataform Completion and Triggering Power BI Task:

    • Our function receives the "Dataform workflow process_daily_sales completed" message. The handle_completed_dataform handler (mentioned in Chapter 1) processes this.
    • This, in turn, updates the TaskLog for DF_Sales to SUCCESS.
    • The task workflow logic from Chapter 4: Task Definition & Execution Workflow sees that DF_Sales is done and PBI_Refresh depends on it.
    • So, it triggers the PBI_Refresh task.
  5. Executing the Power BI Task (PBI_Refresh):

    • The system picks up PBI_Refresh.
    • It makes an API call to the Microsoft Power BI service, instructing it to refresh dataset abc123xyz.
    • Power BI starts pulling the fresh data from the sales_summary table (which Dataform just updated).
    • Once the refresh is initiated, our task might be considered complete (or it could poll for completion, but "initiation" is often enough for a basic setup). The TaskLog for PBI_Refresh is updated.

And just like that, our sales dashboard is updated with the latest data, all orchestrated by the cloud-accelerator-function without it needing to know the nitty-gritty of SQL transformations or Power BI report rendering!

Under the Hood: Talking to Dataform

How does our cloud-accelerator-function actually "talk" to Dataform? It uses Google Cloud's client libraries.

Simplified Flow:

Conceptual Code - Triggering Dataform: When a task of type DATAFORM_EXECUTION needs to run, the code might look something like this (highly simplified):

# Conceptual: Inside a task execution function
# from google.cloud import dataform_v1beta1

def trigger_dataform_workflow(project_id, region, repository_id, workflow_id):
# client = dataform_v1beta1.DataformClient() # Initialize Dataform client

# parent = f"projects/{project_id}/locations/{region}/repositories/{repository_id}"
# workflow_invocation = {"compilation_result": f"{parent}/compilationResults/your-compilation-id"} # Simplified

try:
# response = client.create_workflow_invocation(
# parent=parent, workflow_invocation=workflow_invocation
# )
# logger.info(f"Dataform workflow invocation created: {response.name}")
print(f"SIMULATE: Called Dataform API for workflow {workflow_id}")
# The actual work happens in Dataform; we've just kicked it off.
# We rely on the Pub/Sub message for completion (see Chapter 1).
return True # Indicates initiation was successful
except Exception as e:
# logger.error(f"Failed to trigger Dataform: {e}")
print(f"SIMULATE ERROR: Failed to trigger Dataform: {e}")
return False

This Python snippet shows the idea:

  1. It would use a DataformClient from Google's libraries.
  2. It constructs a request specifying which Dataform workflow to run.
  3. It calls an API method like create_workflow_invocation. The real magic is that Dataform is set up to send a Pub/Sub message upon completion (as configured by scripts like run/scripts/addDataFormCompletionTrigger.sh). Our application listens for this message (see handle_completed_dataform in main.py from Chapter 1: Pub/Sub Event Handling & Routing) to know the Dataform task is truly finished.

Under the Hood: Talking to Power BI

How does our function "talk" to Power BI? It typically uses the Power BI REST API.

Simplified Flow:

Conceptual Code - Triggering Power BI Refresh: When a task of type POWERBI_REFRESH runs, the code (again, highly simplified) might involve making an HTTP request:

# Conceptual: Inside a task execution function
# import requests # To make HTTP requests
# import adal # For Azure Active Directory authentication for Power BI

def trigger_power_bi_refresh(workspace_id, dataset_id):
# Step 1: Get an authentication token for Power BI API
# (This is complex; involves app registration in Azure AD, client secrets, etc.)
# token = get_power_bi_auth_token() # Placeholder for auth logic
token = "dummy_auth_token"

headers = {"Authorization": f"Bearer {token}"}
api_url = f"https://api.powerbi.com/v1.0/myorg/groups/{workspace_id}/datasets/{dataset_id}/refreshes"

try:
# response = requests.post(api_url, headers=headers)
# response.raise_for_status() # Raises an error for bad responses (4xx or 5xx)
# logger.info(f"Power BI refresh initiated for dataset {dataset_id}.")
print(f"SIMULATE: Called Power BI API to refresh dataset {dataset_id}")
return True # Indicates initiation was successful
except Exception as e: # requests.exceptions.HTTPError as e:
# logger.error(f"Failed to trigger Power BI refresh: {e}")
print(f"SIMULATE ERROR: Failed to trigger Power BI refresh: {e}")
return False

This snippet shows:

  1. Authentication: Talking to Power BI API securely requires an authentication token. Getting this token usually involves Azure Active Directory. The details (like client IDs, secrets) would be stored as part of Chapter 6: Configuration Management.
  2. API Call: It uses a library like requests to make a POST request to the correct Power BI API endpoint.
  3. Response: Power BI typically responds with a status indicating the refresh has been accepted and queued. For many basic workflows, this initiation is enough to mark our task as successful.

Important Note on Security: Connecting to external services like Dataform (via GCP IAM roles) and Power BI (via Azure AD applications and secrets) requires careful security configuration. API keys, service account credentials, and other secrets must be managed securely, typically using a secrets manager and accessed via the application's Chapter 6: Configuration Management.

Conclusion

You've now seen how the cloud-accelerator-function acts as a skilled project manager, orchestrating external "specialist teams" like Dataform and Power BI. It doesn't do their specialized work but tells them when to do it and often listens for when they're done. This is achieved by:

  • Defining external operations as Tasks within its workflow system.
  • Using APIs and client libraries to communicate with Dataform and Power BI.
  • Often relying on callback mechanisms (like Pub/Sub for Dataform) or checking API responses to understand the status of the delegated work.

This ability to integrate with external services makes our cloud-accelerator-function incredibly versatile, allowing it to be the central nervous system for complex data pipelines that span multiple tools and platforms.

But how does our system know all the details it needs for these integrations – like Dataform project IDs, Power BI workspace IDs, API keys, or even which buckets to use for raw vs. archived data? All these settings need to be managed. In our next chapter, we'll explore Chapter 6: Configuration Management to see how the system keeps track of all its crucial operational parameters.