Chapter 7: Observability and Alerting

In the previous chapter, we examined the hsr-cloud-accelerator's memory—the Metadata Database Schema. This database meticulously records every action the application takes. But a diary is only useful if someone reads it. How do we, as developers, keep an eye on what the system is doing without having to stare at database tables all day?

The Problem: A Car Without a Dashboard

Imagine driving a car that has no dashboard. No speedometer, no fuel gauge, and no warning lights. You wouldn't know how fast you were going, when you were about to run out of gas, or if the engine was overheating. You'd be driving "blind," only finding out about a problem when the car grinds to a halt on the side of the road.

A complex data pipeline without a monitoring system is just like that car. It might be processing data successfully, or it might be quietly failing in a corner. Without gauges and warning lights, we have no idea about its health, status, or performance until a critical report fails to show up, and by then, it might be too late.

The Solution: Gauges and Warning Lights for Our Pipeline

The hsr-cloud-accelerator solves this with its Observability and Alerting layer. This is the system's interactive dashboard and automated communication system. It gives us two key things:

Rich, Structured Logging (The Gauges): Instead of just printing simple messages, our application generates detailed, structured logs for every event. These logs are sent to Google Cloud Logging, creating a searchable, filterable history of every action. It’s like having a computer diagnostic system that tells you the exact performance of every part of your engine.
Proactive Alerting (The Warning Lights): For critical events—like a major error or the successful completion of an entire workflow—the system doesn't just log it; it actively reaches out to us. It constructs a beautiful, easy-to-read "alert card" and sends it directly to a Microsoft Teams channel, keeping the development team informed in real-time.

How It Works: From an Event to an Alert

The entire process is seamlessly integrated into the application's logging framework. When a piece of code wants to record an event, it uses a standard logger, but a special custom handler intercepts the message.

Event Occurs: A function in our code encounters an error. It calls the standard logger.error() function.
Context is Added: Our custom logging handler catches this call. Before sending it anywhere, it enriches the log message with crucial context, like the name of the file being processed or the ID of the task that failed.
Log is Stored: The enriched log message is sent to Google Cloud Logging, where it's stored permanently.
Alert is Sent: Because the log level was ERROR, the handler knows this is a critical event. It triggers a separate process to build a formatted alert card and send it to our Microsoft Teams channel.

This means we get the best of both worlds: a detailed log for deep-dive analysis and an instant notification for things that need our immediate attention.

A Peek at a Teams Alert

Instead of a cryptic text message, our alerts are clean, informative cards.

An Example Teams Alert Card:

Fact Name	Value
Occured at	2024-08-23 10:30:15
Running for	2 minutes ago
extraction_name	ZOHO
error_type	TRANSFER ORDERS
Link to logs	(Clickable Link to Google Cloud Logs)

Error Details:

Traceback (most recent call last):
  File "run/pubsub/handlers/Sku_Master.py", line 42, in sku_master_extractor
    raise ValueError("Could not connect to Zoho API")
ValueError: Could not connect to Zoho API

This card tells us exactly what failed, when it failed, and most importantly, gives us a one-click link to dive directly into the detailed logs in Google Cloud for deeper investigation.

Under the Hood: The Code That Builds the Alerts

The magic happens in run/pubsub/cloud_logging/log_alerts.py. Let's look at a simplified version of its key components.

1. Building the Card (`to_teams`)

This function takes the information from a log event and organizes it into a MessageCard structure.

# File: run/pubsub/cloud_logging/log_alerts.py (Simplified)

def to_teams(record: LogRecord, msg: str, other_log_fields: dict):
    # Get human-readable timestamps
    occured_dt = datetime.fromtimestamp(record.created)
    
    # Create "Facts" - the key-value pairs for the card
    facts = [
        Fact("Occured at", occured_dt.strftime("%Y-%m-%d %H:%M:%S")),
    ]
    other_facts = [Fact(key, str(value)) for key, value in other_log_fields.items()]
    facts.extend(other_facts)

    # ... more code to assemble the card components ...
    
    return MessageCard(...)

This code dynamically builds a list of Fact objects, which become the neat rows you see in the alert table. It can include any custom information we add to our logs.

2. Creating the Magic Link (`get_url`)

This is one of the most powerful features. This helper function constructs a precise URL that opens Google Cloud Logging pre-filtered to show only the logs related to the specific event that caused the alert.

# File: run/pubsub/cloud_logging/log_alerts.py (Simplified)

def get_url(record: LogRecord, query_fields: Dict[str, Any]):
    # Base query for our service
    query = f"""resource.labels.service_name="{os.getenv('PUBSUB_SERVICE')}"
severity>=DEFAULT"""
    
    # Add specific filters from the log context
    for field, val in query_fields.items():
        query = f'''{query}\njsonPayload."{field}"=\"{val}\"'''

    # Build the full URL
    url = f"https://console.cloud.google.com/logs/query;query={query};..."
    return url

By adding filters for things like the file_id or a unique task_id, this link saves us from manually searching through thousands of log lines, taking us directly to the needle in the haystack.

3. Sending the Alert (`report_alert_to_teams`)

Finally, this function takes the assembled card, converts it to JSON, and sends it to the Microsoft Teams webhook URL we defined in our Configuration Hub.

# File: run/pubsub/cloud_logging/log_alerts.py (Simplified)
import threading

def report_alert_to_teams(record, msg, other_log_fields):
    teams_webhook_url = os.getenv("TEAMS_WEBHOOK_URL")
    if not teams_webhook_url:
        return # Don't do anything if no URL is set

    # 1. Build the card object
    teams_webhook = to_teams(record, msg, other_log_fields)
    
    # 2. Convert card to JSON payload
    payload = json.dumps(teams_webhook, default=lambda o: o.__dict__)
    
    # 3. Send it without blocking the main app
    threading.Thread(target=request_task, args=(teams_webhook_url, payload, ...)).start()

Notice the use of threading.Thread. This is a "fire-and-forget" approach. It sends the alert on a separate, parallel thread. This is crucial because it ensures that our main application doesn't have to wait for the alert to be sent. The pipeline can continue its work without delay.

How it's Triggered

The best part is that developers don't need to learn this complex logic. They just use standard logging. For example, in our Sku_Master.py extractor, the error handling is simple:

# File: run/pubsub/handlers/Sku_Master.py
try:
    # ... code to extract data from Zoho ...
    logger.info("SKU MASTER: Item Master pushed to cloud.")
except Exception as e:
    # This helper function triggers the Teams alert!
    ut.error_team_notifications(
        extraction_name="ZOHO",
        error_type="TRANSFER ORDERS",
        error_text=traceback.format_exc()
        )
    logger.exception(e) # This triggers the detailed Cloud Log

The developer only needs to call ut.error_team_notifications. This simple helper function handles all the complexity of building and sending the beautiful alert card we saw earlier.

Conclusion

You have just learned about Observability and Alerting, the eyes, ears, and voice of the hsr-cloud-accelerator.

It provides deep visibility into the system's health and status.
Rich, structured logs are sent to Google Cloud Logging for detailed analysis.
Proactive alerts are sent to Microsoft Teams for critical events, ensuring the team is always informed.
This layer is built on custom logging handlers that automatically enrich logs and trigger alerts, making it easy for developers to use.

With this final chapter, you've completed the tour of the core concepts behind the hsr-cloud-accelerator. You now understand how the project is configured, how data is ingested from files and APIs, how events are routed, how complex workflows are managed with a dependency engine, how the system remembers its state, and finally, how it communicates its status back to you.

You are now equipped with the foundational knowledge to explore, use, and contribute to the project with confidence. Happy coding

Generated by AI Codebase Knowledge Builder

The Problem: A Car Without a Dashboard​

The Solution: Gauges and Warning Lights for Our Pipeline​

How It Works: From an Event to an Alert​

A Peek at a Teams Alert​

Under the Hood: The Code That Builds the Alerts​

1. Building the Card (to_teams)​

2. Creating the Magic Link (get_url)​

3. Sending the Alert (report_alert_to_teams)​

How it's Triggered​

Conclusion​