Chapter 7: Structured Logging and Alerting

Welcome to Chapter 7! In Chapter 6: Configuration Management, we learned how our cloud-accelerator-function gets all its important settings, like project IDs and database URLs, making it flexible and easy to deploy. With our application configured and running, it's vital to know what it's doing, especially if problems arise. This chapter is all about how the application records its activities and tells us when something needs our attention, using Structured Logging and Alerting.

The Flight Recorder & Alarm System: Why We Need This

Imagine our cloud-accelerator-function is like a sophisticated airplane on an important journey (processing your data!).

Flight Recorder (Logging): If the plane encounters turbulence or an engine sputters, the flight recorder logs all the technical details. Pilots and engineers can later review these logs to understand what happened and why.
Alarm System (Alerting): If a critical system on the plane fails (like an engine catching fire!), an alarm immediately goes off in the cockpit, alerting the crew to take immediate action.

Our application needs both:

Detailed Logs: To record what it's doing, any informational messages, warnings, or errors. This helps us understand its behavior and diagnose problems after they occur.
Immediate Alerts: To notify us instantly if something critical goes wrong, so we can investigate and fix it quickly.

Use Case: A File Processing Failure

Let's say an important data file, critical_report_final.csv, is uploaded. Our application tries to process it (as described in Chapter 2: Data Ingestion & Transformation Pipeline), but the file is unexpectedly corrupted and the processing fails. Without good logging and alerting:

We might not even know it failed until someone complains that their report is missing.
When we do find out, figuring out why it failed could be like searching for a needle in a haystack.

With structured logging and alerting:

The application logs an "ERROR" or "CRITICAL" message with details about the corrupted file and the error.
An alert is immediately sent to a Microsoft Teams channel, notifying the support team.
The alert includes a link directly to the detailed logs for this specific failure. This allows for quick identification and resolution of the problem.

Key Concepts: The Parts of Our "Flight Recorder & Alarm"

Logging (The Flight Recorder):
- What it is: The process of recording events, information, and errors as the application runs.
- Why "Structured" Logging? Instead of just writing plain text messages like "Error processing file", structured logs are formatted as JSON (JavaScript Object Notation). JSON is like a neatly organized data entry with clear fields (e.g., timestamp, severity, message, filename). This makes it super easy for computers (and humans using tools) to search, filter, and analyze large volumes of logs.
- Trace Information: For a single operation (like processing one file), our system often performs multiple steps. Each log message related to this single operation can include a unique trace_id. This trace_id acts like a thread, tying together all log entries for that one operation, making it easy to see its entire journey through the system.
Google Cloud Logging (The Log Archive & Viewer):
- This is Google Cloud's centralized service where all our application's structured JSON logs are sent and stored.
- It provides powerful tools to search, view, analyze logs, and even create metrics from them.
Custom Log Handlers (The Smart Log Scribe):
- In Python, we use the standard logging library. A "handler" is a component that decides what to do with a log message (e.g., print it to the console, write to a file).
- Our cloud-accelerator-function uses custom handlers. These are special pieces of code that take the log messages from our application, format them into the structured JSON that Google Cloud Logging expects, and importantly, add useful information like severity levels (INFO, ERROR, CRITICAL) and trace_ids.
Alerting (The Alarm System):
- What it is: The process of automatically notifying stakeholders (like developers or support teams) when specific, important events are logged.
- How it works: Our system can be configured so that if a log message has a high severity (e.g., CRITICAL) or matches certain predefined rules, an alert is triggered.
Microsoft Teams Integration (The Notification Channel):
- When an alert is triggered, a message is sent to a designated Microsoft Teams channel.
- This alert message is designed to be helpful, providing:
  - A summary of the problem.
  - Key context (like the trace_id or filename).
  - A direct link to the relevant logs in Google Cloud Logging for quick investigation.

Solving Our Use Case: The Corrupted File Scenario

Let's see how this works when critical_report_final.csv (which is corrupted) arrives:

Log "Processing Started": When the application starts processing critical_report_final.csv, it might log an INFO message:
```
INFO: Started processing file 'critical_report_final.csv'. Trace ID: trace-abc-123
```
This log, formatted as JSON by our custom handler, goes to Google Cloud Logging.
Log "Error Occurred": During processing, the application detects the corruption. It logs a CRITICAL message:
```
CRITICAL: Failed to process file 'critical_report_final.csv' due to: Corrupted data format. Trace ID: trace-abc-123
```
This also goes to Google Cloud Logging as structured JSON.
Alert Triggered: Because this log message is CRITICAL:
- The custom handler (or a component it calls) recognizes this as an alertable event.
- It constructs an alert message.
- It generates a link to Google Cloud Logging, pre-filtered to show all logs with trace_id: trace-abc-123.
Teams Notification: The alert message is sent to the configured Microsoft Teams channel:

🚨 Critical Alert: File Processing Failure! 🚨 File: critical_report_final.csv Error: Corrupted data format Trace ID: trace-abc-123 View Logs: Click here to see details in GCP Logging
Quick Investigation: A team member sees the alert in Teams, clicks the link, and is taken directly to Google Cloud Logging, where they see both the "Started processing" and "Failed to process" log messages for trace-abc-123, giving them immediate context to understand and fix the issue.

Under the Hood: The Logging Flow

How does a simple logger.info("Hello") call in Python turn into a structured JSON log in Google Cloud?

Simplified Logging Sequence:

Python's logging Module and Our Custom Handler:

Our application uses Python's built-in logging module. We configure this module to use our custom handler. This handler is responsible for the JSON formatting.

A conceptual look at setting up and using the logger (the actual setup happens in run/pubsub/cloud_logging/__init__.py but it's often more involved):

# Conceptual: How logging might be set up and used
import logging
# Assume OurCustomGoogleCloudHandler is defined elsewhere (like in cloud_logging)
# from run.pubsub.cloud_logging import OurCustomGoogleCloudHandler # Fictional class name

# Get the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # Log INFO messages and above

# ---- This setup part would be in the application's initialization ----
# if not any(isinstance(h, OurCustomGoogleCloudHandler) for h in logger.handlers):
#     gcp_handler = OurCustomGoogleCloudHandler()
#     logger.addHandler(gcp_handler)
# ---- End of setup ----

# Example usage in your application code:
def process_data(file_name, trace_id):
    logger.info(
        f"Starting to process {file_name}",
        extra={"trace_id": trace_id, "component": "data_processor"}
    )
    try:
        # ... some processing ...
        if "bad" in file_name: # Simulate an error
            raise ValueError("Simulated bad file")
        logger.info(
            f"Successfully processed {file_name}",
            extra={"trace_id": trace_id, "status": "SUCCESS"}
        )
    except Exception as e:
        logger.error(
            f"Failed to process {file_name}: {e}",
            extra={"trace_id": trace_id, "status": "FAILURE", "error_type": type(e).__name__}
        )

# process_data("good_file.txt", "trace-001")
# process_data("bad_file.csv", "trace-002")

We get a logger instance.
A custom handler (e.g., OurCustomGoogleCloudHandler) would be added to this logger during application startup. This handler knows how to speak to Google Cloud Logging.
When logger.info(), logger.error(), etc., are called:
- The extra={...} dictionary is crucial! This is how we pass custom fields like trace_id, filename, or any other contextual information.
- The custom handler takes the log message, the log level (INFO, ERROR), and the extra dictionary. It then crafts a JSON payload.

A simplified JSON output for the error log above might look like this in Google Cloud Logging:

{
  "message": "Failed to process bad_file.csv: Simulated bad file",
  "severity": "ERROR",
  "timestamp": "2023-11-15T10:30:00Z",
  "logging.googleapis.com/trace": "projects/your-gcp-project/traces/trace-002",
  "component": "data_processor", /* This might not be standard, depends on handler */
  "status": "FAILURE",
  "error_type": "ValueError",
  "filename": "bad_file.csv" /* Custom handler extracts this */
}

Notice how trace_id from extra can be mapped to logging.googleapis.com/trace to enable log correlation in GCP. Other extra fields can become top-level fields in the JSON.

Under the Hood: The Alerting Flow

How does a CRITICAL log turn into a Teams message?

Simplified Alerting Sequence:

Triggering the Alert:

The custom log handler, after successfully sending the log to Google Cloud Logging, can inspect the log record's level.

# Conceptual: Part of the custom log handler's logic
# (This is a highly simplified idea of what might happen inside the handler)

# class OurCustomGoogleCloudHandler(logging.Handler):
#     def emit(self, record): # This method is called by the logger
#         # 1. Format 'record' to JSON
#         json_log = self.format_to_gcp_json(record)
            
#         # 2. Send 'json_log' to Google Cloud Logging
#         # send_to_gcp(json_log)
            
#         # 3. Check if an alert is needed
#         if record.levelno >= logging.CRITICAL: # e.g., CRITICAL or ERROR
#             # from ..helpers.teams_helper import send_teams_alert # Fictional import
#             message_for_teams = self.format_for_teams(record)
#             teams_webhook_url = os.environ.get("TEAMS_WEBHOOK_CRITICAL") # From Config
#             if teams_webhook_url:
#                 # send_teams_alert(teams_webhook_url, message_for_teams)
#                 print(f"SIMULATE: Sent alert to Teams: {message_for_teams}")

#     def format_for_teams(self, record):
#         # Create a nice message with trace ID, error, link to GCP logs etc.
#         trace_id = getattr(record, 'trace_id', 'N/A')
#         log_link = f"https://console.cloud.google.com/logs/viewer?advancedFilter=trace%3D%22{trace_id}%22"
#         return f"🚨 CRITICAL: {record.getMessage()}\nTrace: {trace_id}\n[View Logs]({log_link})"

Inside the emit method of the handler (which processes each log record), after sending the log to GCP, it checks record.levelno.
If it's logging.CRITICAL (or any other configured level for alerting), it calls a helper function (conceptually send_teams_alert) to dispatch the notification.
The TEAMS_WEBHOOK_CRITICAL URL would be fetched from environment variables, which are set up by our Chapter 6: Configuration Management system.

Sending the Teams Alert (Conceptual helpers/teams_helper.py): This helper would be responsible for actually making the HTTP request to Microsoft Teams.

# Conceptual: run/pubsub/helpers/teams_helper.py
# import requests # Library to make HTTP requests
# import os

def send_teams_alert(webhook_url: str, title: str, message: str, gcp_log_link: str):
    """Sends a formatted alert to a Microsoft Teams channel via a webhook."""
    
    payload = {
        "type": "message",
        "attachments": [
            {
                "contentType": "application/vnd.microsoft.card.adaptive",
                "content": {
                    "type": "AdaptiveCard",
                    "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
                    "version": "1.2",
                    "body": [
                        {"type": "TextBlock", "text": title, "weight": "bolder", "size": "medium"},
                        {"type": "TextBlock", "text": message, "wrap": True},
                        {"type": "FactSet", "facts": [{"title": "GCP Logs:", "value": f"[View Details]({gcp_log_link})"}]}
                    ]
                }
            }
        ]
    }
    try:
        # response = requests.post(webhook_url, json=payload, timeout=10)
        # response.raise_for_status() # Raise an exception for HTTP errors
        # logger.info("Successfully sent alert to Teams.")
        print(f"SIMULATE: Posting to Teams webhook: {title} - {message}")
        return True
    except Exception as e: # requests.exceptions.RequestException as e:
        # logger.error(f"Failed to send Teams alert: {e}")
        print(f"SIMULATE ERROR: Failed to send Teams alert: {e}")
        return False

# Conceptual usage:
# send_teams_alert(
#     os.environ.get("TEAMS_WEBHOOK_CRITICAL"),
#     "🚨 Critical Error: File Processing Failure!",
#     "File 'data.csv' could not be processed. Error: File not found.",
#     "https://example.com/logs?trace_id=123"
# )

This function constructs a JSON payload (often using Teams' "Adaptive Cards" format for rich messages) and POSTs it to the webhook URL provided by Microsoft Teams.

Conclusion

You've now seen how the cloud-accelerator-function implements its "flight recorder" and "alarm system" through Structured Logging and Alerting. Key takeaways:

Structured JSON logs are sent to Google Cloud Logging, making them easy to search and analyze.
Custom log handlers format these logs and enrich them with trace information.
For critical events, alerts are automatically sent to Microsoft Teams, providing immediate notification and context.
This combination provides excellent visibility into the application's behavior, enables rapid troubleshooting, and allows for proactive responses to problems.

With our application well-monitored, the final piece of the puzzle is how to get it all running in the cloud and manage its operations. In our next and final chapter, we'll explore Chapter 8: Deployment and Operational Scripts.

The Flight Recorder & Alarm System: Why We Need This​

Key Concepts: The Parts of Our "Flight Recorder & Alarm"​

Solving Our Use Case: The Corrupted File Scenario​

Under the Hood: The Logging Flow​

Under the Hood: The Alerting Flow​

Conclusion​