Skip to main content

Chapter 6: Configuration Management

Welcome to Chapter 6! In Chapter 5: External Services Integration (Dataform & Power BI), we saw how our cloud-accelerator-function can work with specialized tools like Dataform and Power BI, acting as an orchestrator. We mentioned that details like API keys, project IDs, and workspace IDs would need to be configured. But how exactly does our application know these crucial pieces of information? This chapter is all about the "control panel" of our system: Configuration Management.

The Control Panel: Why Do We Need Configuration Management?

Imagine you're assembling a complex piece of flat-pack furniture. The instruction manual might say, "Use Screw A for this part, Bolt B for that part." But how do you know which screw is "Screw A"? You'd look at the parts list that came with the furniture, which identifies each piece.

Our cloud-accelerator-function is similar. It needs to know many "operational parameters" to function correctly. For example:

  • What is your Google Cloud Project ID?
  • What's the name of the Google Cloud Storage bucket where raw data files will land?
  • How much CPU and memory should be allocated if this runs as a Cloud Run service?
  • What's the connection address for the PostgreSQL database we discussed in Chapter 3: Metadata Persistence (Database)?

These settings can change depending on where and how you deploy the application (e.g., a development environment might use different bucket names than a production environment). We can't just hardcode these values directly into our Python scripts or shell scripts! That would be like welding the screws to the furniture parts – very inflexible!

Configuration Management is the system component that defines how our application gets these vital operational parameters. It acts like a central control panel, ensuring all parts of the project use consistent and correct settings.

Key Concepts: The Parts of Our Control Panel

Our cloud-accelerator-function uses a clever system to manage its settings:

  1. The Master Settings File (main_config.yaml - Not Shown):

    • At the very heart of the configuration is a file named main_config.yaml. This file is not directly part of this tutorial's visible code snippets, but it's where a user would define all their specific settings for their deployment.
    • Think of it as the master blueprint or the primary settings page where you type in all your project-specific details.
    • For example, main_config.yaml might contain entries like:
      # Hypothetical content of main_config.yaml
      GCP_PROJECT:
      PROJECT_ID: "my-test-project-001"
      PROJECT_REGION: "us-central1"
      BUCKET:
      NAME: "my-accelerator-data-bucket"
      RAW: "incoming_files/raw"
      ARCHIVE: "processed_files/archive"
      META_DATA:
      DATABASE_URL: "postgresql://admin:secretpassword@10.0.0.5:5432/accelerator_db"
      # ... and many more settings
  2. The Settings Distributor (generate_config.py):

    • We have a Python script located at run/generate_config.py.
    • This script's job is to read the main_config.yaml file.
    • Then, it acts like a distributor, taking those master settings and creating specific configuration files tailored for different parts of our application.
  3. Tailored Configuration Files:

    • For Shell Scripts (scripts/config.sh): The generate_config.py script creates a file named config.sh inside the run/scripts/ directory. This file contains settings formatted as shell environment variables. Any shell scripts we use for deployment or operations (see Chapter 8: Deployment and Operational Scripts) can then easily use these settings.
    • For the Python Application (pubsub/.env): The script also creates a file named .env inside the run/pubsub/ directory. This is a standard way to provide environment variables to Python applications. Our main Python application (the Flask app from Chapter 1: Pub/Sub Event Handling & Routing) reads its settings from here.
  4. Consistency is Key:

    • This approach ensures that if you set your PROJECT_ID once in main_config.yaml, both your shell scripts and your Python application will use that exact same PROJECT_ID. This prevents a lot of headaches caused by inconsistent settings across different parts of a project.

How It Solves the Use Case: Setting Up Our Application

Let's say you're setting up the cloud-accelerator-function for your new project, "Alpha."

  1. Edit the Master Settings: You would open the main_config.yaml file (again, this file isn't shown directly in our tutorial snippets, but imagine it exists) and fill in your details:

    • PROJECT_ID: "alpha-gcp-project"
    • BUCKET_NAME: "alpha-data-files"
    • DATABASE_URL: "postgresql://alpha_user:alpha_pass@db.example.com/alpha_db"
    • ...and so on.
  2. Run the Settings Distributor: You would then run the generate_config.py script from your terminal:

    python run/generate_config.py
  3. Generated Files: After the script runs, two new files (or updated files) will appear:

    • run/scripts/config.sh might look like this (simplified):
      #!/bin/bash

      PROJECT_ID="alpha-gcp-project"
      # ... other shell variables ...
      RAW_BUCKET="alpha-data-files/incoming_files/raw"
      # ...
    • run/pubsub/.env might look like this (simplified):
      DATABASE_URL=postgresql://alpha_user:alpha_pass@db.example.com/alpha_db
      GCP_PROJECT=alpha-gcp-project
      CA_RAW=alpha-data-files/incoming_files/raw
      # ... other environment variables for Python ...
  4. Using the Configurations:

    • Shell Scripts: When a deployment script from run/scripts/ needs the Project ID, it can "source" the config.sh file:
      # Inside a hypothetical deployment script like deploy_cloud_run.sh
      source ./config.sh # Load the variables
      echo "Deploying to project: $PROJECT_ID"
      # gcloud run deploy --project $PROJECT_ID ...
    • Python Application: Our Python code, for instance, when connecting to the database, will read the DATABASE_URL from the environment variables (which are loaded from the .env file by the application's environment).
      # From: run/pubsub/database/config.py
      import os
      # ...
      class Config:
      SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') # Reads from .env
      # ...
      Similarly, other parts of the Python code can access their needed configurations:
      # From: run/pubsub/handlers/bq_jobs.py (conceptual)
      import os
      # ...
      def get_some_data():
      project_id = os.environ.get('GCP_PROJECT') # Reads from .env
      region = os.environ.get('PROJECT_REGION') # Reads from .env
      # Use project_id and region...
      print(f"Operating in project {project_id} and region {region}")

Now, every part of your "Alpha" project setup uses the same, centrally defined settings!

Under the Hood: How generate_config.py Works

Let's peek at what happens when you run python run/generate_config.py.

A Step-by-Step View:

Diving into run/generate_config.py:

The script has a few key parts:

  1. Loading the Master YAML: It first opens and reads the main_config.yaml file using the yaml library.

    # File: run/generate_config.py (simplified)
    import os
    import yaml # For reading YAML files

    if __name__ == "__main__":
    file_dir = os.path.dirname(os.path.realpath(__file__))
    main_config_path = os.path.join(file_dir, "main_config.yaml")
    # ... paths for output files config.sh and .env

    with open(main_config_path, encoding="utf-8") as stream:
    try:
    config = yaml.safe_load(stream) # Loads all data from YAML
    # Now 'config' is a Python dictionary holding all settings
    # ...

    The config variable now holds all the settings from main_config.yaml as a Python dictionary.

  2. Generating scripts/config.sh: A function like get_shell_config takes the config dictionary and formats specific values into a string suitable for a shell script.

    # File: run/generate_config.py (simplified)

    def get_shell_config(data): # 'data' is the 'config' dictionary
    RAW_BUCKET = data["BUCKET"]["NAME"] # Accessing nested values
    return f"""#!/bin/bash

    PROJECT_ID="{data["GCP_PROJECT"]["PROJECT_ID"]}"
    PROJECT_REGION="{data["GCP_PROJECT"]["PROJECT_REGION"]}"
    # ... more variables ...
    RAW_BUCKET="{RAW_BUCKET}"
    NUM_OF_CPU={data["CLOUD_RUN"]["NUM_OF_CPU"]}
    MEMORY_SIZE={data["CLOUD_RUN"]["MEMORY_SIZE"]}
    """ # This string will be written to config.sh

    This function carefully constructs a multi-line string that will become the content of config.sh.

  3. Generating pubsub/.env: Similarly, a function like get_env_config formats values for the .env file.

    # File: run/generate_config.py (simplified)

    def get_env_config(data): # 'data' is the 'config' dictionary
    RAW_BUCKET = f'{data["BUCKET"]["NAME"]}/{data["BUCKET"]["RAW"]}'.rstrip("/")
    # ... (calculating other bucket paths) ...

    mystr = f"""DATABASE_URL={data["META_DATA"]["DATABASE_URL"]}
    GCP_PROJECT={data["GCP_PROJECT"]["PROJECT_ID"]}
    CA_RAW={RAW_BUCKET}
    # ... more environment variables ...
    """
    # ... (loop for 'OTHER' custom variables if any) ...
    return mystr.strip("\n") # This string will be written to .env

    This creates a string where each line is KEY=VALUE, the standard format for .env files.

  4. Writing the Files: Finally, the main part of the script calls these functions and writes their output to the respective files.

    # File: run/generate_config.py (simplified, continuing from __main__)
    # ... (after loading 'config' from YAML) ...
    shell_config_content = get_shell_config(config)
    with open(config_path, "w", encoding="utf-8") as fp: # config_path is 'run/scripts/config.sh'
    fp.write(shell_config_content)

    env_config_content = get_env_config(config)
    with open(env_path, "w", encoding="utf-8") as fp: # env_path is 'run/pubsub/.env'
    fp.write(env_config_content)

    print("Configuration files generated successfully!")
    except yaml.YAMLError as exc:
    print(f"Error reading main_config.yaml: {exc}")

And that's it! One master file (main_config.yaml, which you manage) and one script (generate_config.py) keep all your application's settings consistent and in the right places.

Conclusion

You've now learned about the cloud-accelerator-function's "control panel" – its Configuration Management system. This system is vital for:

  • Providing operational parameters (like Project IDs, bucket names, database URLs) to the application.
  • Ensuring these settings are consistent across different parts of the project (shell scripts and Python code).
  • Making it easy to change settings by editing a single master YAML file (main_config.yaml - which you would create and maintain) and then running generate_config.py to distribute those settings.

This centralized approach makes the application more robust, easier to manage, and simpler to deploy in different environments.

With our application configured and running, it's important to know what it's doing, especially when things go right or, more importantly, when they go wrong. In the next chapter, we'll explore Chapter 7: Structured Logging and Alerting to see how our system keeps a detailed diary of its actions and how it can notify us of important events.