Skip to main content

Chapter 5: Task & Dependency Engine

In the previous chapter, we learned how the Main Event Router acts as a smart dispatcher, directing single, isolated events to the correct handlers. This is great for simple, one-off jobs. But what happens when tasks depend on each other?

The Problem: You Can't Frost an Unbaked Cake

Imagine you're baking a cake. You have a recipe with several steps: mix the batter, bake the cake, let it cool, and then add the frosting. You instinctively know you cannot do these steps out of order. Frosting a hot, unbaked pile of batter would be a disaster!

Data pipelines often have similar recipes. Consider a common goal: creating a monthly profit report. To do this, you must first:

  1. Load the monthly_sales data.
  2. Load the monthly_expenses data.

Only when both of those tasks are complete can you run the final query to calculate the profit. Our system needs a "chef" who understands this recipe, checks when the prerequisite steps are done, and knows exactly when to start the final step. The Main Event Router alone doesn't know this recipe; it just knows how to handle one instruction at a time.

The Solution: A Smart Project Manager

The hsr-cloud-accelerator solves this with its Task & Dependency Engine. Think of this engine as the project's smart and meticulous project manager. Its job is to manage the entire workflow, not just individual tasks.

This engine understands the relationships between tasks. It knows that "Calculate Profit" is a child task that depends on two parent tasks: "Load Sales" and "Load Expenses". It uses a database to keep track of a master plan, which is structured as a Directed Acyclic Graph (DAG).

That sounds complicated, but it's just a fancy term for a flowchart that shows which tasks must be completed before others can begin. The "Acyclic" part simply means the workflow never gets stuck in an infinite loop (you can't have a task that depends on itself).

Here's our profit report workflow as a simple DAG:

The engine's job is to watch tasks A and B. Only when both are marked "complete" will it give the green light for task C to start.

How It Works: The Workflow in Action

Let's see how our "Project Manager" handles the monthly profit report. The whole process is event-driven, using Pub/Sub messages to communicate between tasks.

  1. First Parent Finishes: The "Load Sales" task finishes successfully. At the very end of its process, it notifies the Dependency Engine.
  2. Engine Checks Dependencies: The engine looks at its master plan (the database) and asks, "What tasks depend on 'Load Sales'?" It finds "Calculate Profit". Then it asks, "Are all parents of 'Calculate Profit' done for the March report?" The answer is no; it's still waiting for "Load Expenses". So, it does nothing and waits.
  3. Second Parent Finishes: A little later, the "Load Expenses" task finishes and sends its own success notification.
  4. Engine Re-Checks and Confirms: The engine receives this new message and repeats its check. "Are all parents of 'Calculate Profit' now complete?" This time, the answer is yes!
  5. Child Task is Triggered: Now that the conditions are met, the engine takes action. It creates a new Pub/Sub message telling the system to run the "Calculate Profit" task.
  6. Child Task Starts: The Main Event Router receives this new message and routes it to the correct handler, which starts calculating the final profit report.

This ensures the entire data pipeline executes in the correct order, automatically.

Under the Hood: The Database and the Logic

The "engine" isn't a single program but a combination of a smart database schema and logic added to our task handlers. This is a crucial concept that will be explored further in our next chapter, Metadata Database Schema.

For now, let's look at a simplified view of how it's designed.

The Database's Role (The Master Plan)

Our database contains a few key tables that act as the engine's brain:

  • Tasks: A list of all possible tasks (e.g., task_id: 101, name: 'load_sales').
  • TaskDependencies: Defines the parent-child relationships (e.g., child_task_id: 103, parent_task_id: 101). This is our DAG!
  • TaskLog: A record of every time a task is run, including its status (NEW, PROCESSING, SUCCESS, ERROR) and a batch_id to group related runs (e.g., all tasks for the "March 2024" report share the same batch_id).

The Code's Role (The End-of-Task Checklist)

The logic for the engine lives inside the same handlers we've discussed before. After a handler finishes its main job (like loading a file), it performs a few final "dependency engine" steps.

Let's look at the conceptual logic inside the add_parquet_to_bigquery handler from the previous chapter.

# Conceptual code for the end of a task handler
def add_parquet_to_bigquery(data, db_session):
# ... (code to load the Parquet file into BigQuery) ...
# Assume the load was successful.

# 1. Update our status in the database log.
task_log.status = "SUCCESS"
db_session.commit()

# Now, let the Dependency Engine take over.
check_and_trigger_children(task_log, db_session)

After the main work is done, it updates its own status to SUCCESS and then calls a helper function to handle the dependency logic.

Here is what that check_and_trigger_children function does conceptually:

# Conceptual helper function
def check_and_trigger_children(completed_task_log, db_session):
# 1. Find all tasks that depend on me.
# (This queries the TaskDependencies table)
child_tasks = find_my_children_in_db(completed_task_log.task_id)

for child in child_tasks:
# 2. For each child, check if all its parents are done for this batch.
# (This queries the TaskLog table for the same batch_id)
if are_all_parents_complete(child.id, completed_task_log.batch_id):

# 3. If so, send a message to start the child task!
trigger_task_via_pubsub(child.name, completed_task_log.batch_id)
  1. Find Children: The function first queries the database to find all tasks that list the completed task as a parent.
  2. Check Parents: For each potential child task, it runs another check: "Looking at all tasks with the same batch_id, are all of this child's parents marked as SUCCESS in the TaskLog?"
  3. Trigger Next Task: If and only if the check passes, it publishes a new message to Pub/Sub. The Main Event Router will catch this message and start the child task, continuing the workflow.

Conclusion

You've just learned about the Task & Dependency Engine, the automated project manager for hsr-cloud-accelerator.

  • It orchestrates complex workflows with multiple steps.
  • It uses a database to define and track task dependencies (the "recipe" or DAG).
  • It ensures tasks always run in the correct sequence.
  • It works by having each successful task check if it can trigger its children, creating a powerful, self-managing pipeline.

The power of this engine comes from its "brain"—the database that stores the master plan. To truly understand our system, we need to look closer at how that brain is structured.

Next up: Metadata Database Schema


Generated by AI Codebase Knowledge Builder