Chapter 6: Configuration Management
Welcome to Chapter 6! In Chapter 5: External Services Integration (Dataform & Power BI), we saw how our cloud-accelerator-function can work with specialized tools like Dataform and Power BI, acting as an orchestrator. We mentioned that details like API keys, project IDs, and workspace IDs would need to be configured. But how exactly does our application know these crucial pieces of information? This chapter is all about the "control panel" of our system: Configuration Management.
The Control Panel: Why Do We Need Configuration Management?
Imagine you're assembling a complex piece of flat-pack furniture. The instruction manual might say, "Use Screw A for this part, Bolt B for that part." But how do you know which screw is "Screw A"? You'd look at the parts list that came with the furniture, which identifies each piece.
Our cloud-accelerator-function is similar. It needs to know many "operational parameters" to function correctly. For example:
- What is your Google Cloud Project ID?
- What's the name of the Google Cloud Storage bucket where raw data files will land?
- How much CPU and memory should be allocated if this runs as a Cloud Run service?
- What's the connection address for the PostgreSQL database we discussed in Chapter 3: Metadata Persistence (Database)?
These settings can change depending on where and how you deploy the application (e.g., a development environment might use different bucket names than a production environment). We can't just hardcode these values directly into our Python scripts or shell scripts! That would be like welding the screws to the furniture parts – very inflexible!
Configuration Management is the system component that defines how our application gets these vital operational parameters. It acts like a central control panel, ensuring all parts of the project use consistent and correct settings.
Key Concepts: The Parts of Our Control Panel
Our cloud-accelerator-function uses a clever system to manage its settings:
-
The Master Settings File (
main_config.yaml- Not Shown):- At the very heart of the configuration is a file named
main_config.yaml. This file is not directly part of this tutorial's visible code snippets, but it's where a user would define all their specific settings for their deployment. - Think of it as the master blueprint or the primary settings page where you type in all your project-specific details.
- For example,
main_config.yamlmight contain entries like:# Hypothetical content of main_config.yaml
GCP_PROJECT:
PROJECT_ID: "my-test-project-001"
PROJECT_REGION: "us-central1"
BUCKET:
NAME: "my-accelerator-data-bucket"
RAW: "incoming_files/raw"
ARCHIVE: "processed_files/archive"
META_DATA:
DATABASE_URL: "postgresql://admin:secretpassword@10.0.0.5:5432/accelerator_db"
# ... and many more settings
- At the very heart of the configuration is a file named
-
The Settings Distributor (
generate_config.py):- We have a Python script located at
run/generate_config.py. - This script's job is to read the
main_config.yamlfile. - Then, it acts like a distributor, taking those master settings and creating specific configuration files tailored for different parts of our application.
- We have a Python script located at
-
Tailored Configuration Files:
- For Shell Scripts (
scripts/config.sh): Thegenerate_config.pyscript creates a file namedconfig.shinside therun/scripts/directory. This file contains settings formatted as shell environment variables. Any shell scripts we use for deployment or operations (see Chapter 8: Deployment and Operational Scripts) can then easily use these settings. - For the Python Application (
pubsub/.env): The script also creates a file named.envinside therun/pubsub/directory. This is a standard way to provide environment variables to Python applications. Our main Python application (the Flask app from Chapter 1: Pub/Sub Event Handling & Routing) reads its settings from here.
- For Shell Scripts (
-
Consistency is Key:
- This approach ensures that if you set your
PROJECT_IDonce inmain_config.yaml, both your shell scripts and your Python application will use that exact samePROJECT_ID. This prevents a lot of headaches caused by inconsistent settings across different parts of a project.
- This approach ensures that if you set your
How It Solves the Use Case: Setting Up Our Application
Let's say you're setting up the cloud-accelerator-function for your new project, "Alpha."
-
Edit the Master Settings: You would open the
main_config.yamlfile (again, this file isn't shown directly in our tutorial snippets, but imagine it exists) and fill in your details:PROJECT_ID: "alpha-gcp-project"BUCKET_NAME: "alpha-data-files"DATABASE_URL: "postgresql://alpha_user:alpha_pass@db.example.com/alpha_db"- ...and so on.
-
Run the Settings Distributor: You would then run the
generate_config.pyscript from your terminal:python run/generate_config.py -
Generated Files: After the script runs, two new files (or updated files) will appear:
run/scripts/config.shmight look like this (simplified):#!/bin/bash
PROJECT_ID="alpha-gcp-project"
# ... other shell variables ...
RAW_BUCKET="alpha-data-files/incoming_files/raw"
# ...run/pubsub/.envmight look like this (simplified):DATABASE_URL=postgresql://alpha_user:alpha_pass@db.example.com/alpha_db
GCP_PROJECT=alpha-gcp-project
CA_RAW=alpha-data-files/incoming_files/raw
# ... other environment variables for Python ...
-
Using the Configurations:
- Shell Scripts: When a deployment script from
run/scripts/needs the Project ID, it can "source" theconfig.shfile:# Inside a hypothetical deployment script like deploy_cloud_run.sh
source ./config.sh # Load the variables
echo "Deploying to project: $PROJECT_ID"
# gcloud run deploy --project $PROJECT_ID ... - Python Application: Our Python code, for instance, when connecting to the database, will read the
DATABASE_URLfrom the environment variables (which are loaded from the.envfile by the application's environment).Similarly, other parts of the Python code can access their needed configurations:# From: run/pubsub/database/config.py
import os
# ...
class Config:
SQLALCHEMY_DATABASE_URI = os.environ.get('DATABASE_URL') # Reads from .env
# ...# From: run/pubsub/handlers/bq_jobs.py (conceptual)
import os
# ...
def get_some_data():
project_id = os.environ.get('GCP_PROJECT') # Reads from .env
region = os.environ.get('PROJECT_REGION') # Reads from .env
# Use project_id and region...
print(f"Operating in project {project_id} and region {region}")
- Shell Scripts: When a deployment script from
Now, every part of your "Alpha" project setup uses the same, centrally defined settings!
Under the Hood: How generate_config.py Works
Let's peek at what happens when you run python run/generate_config.py.
A Step-by-Step View:
Diving into run/generate_config.py:
The script has a few key parts:
-
Loading the Master YAML: It first opens and reads the
main_config.yamlfile using theyamllibrary.# File: run/generate_config.py (simplified)
import os
import yaml # For reading YAML files
if __name__ == "__main__":
file_dir = os.path.dirname(os.path.realpath(__file__))
main_config_path = os.path.join(file_dir, "main_config.yaml")
# ... paths for output files config.sh and .env
with open(main_config_path, encoding="utf-8") as stream:
try:
config = yaml.safe_load(stream) # Loads all data from YAML
# Now 'config' is a Python dictionary holding all settings
# ...The
configvariable now holds all the settings frommain_config.yamlas a Python dictionary. -
Generating
scripts/config.sh: A function likeget_shell_configtakes theconfigdictionary and formats specific values into a string suitable for a shell script.# File: run/generate_config.py (simplified)
def get_shell_config(data): # 'data' is the 'config' dictionary
RAW_BUCKET = data["BUCKET"]["NAME"] # Accessing nested values
return f"""#!/bin/bash
PROJECT_ID="{data["GCP_PROJECT"]["PROJECT_ID"]}"
PROJECT_REGION="{data["GCP_PROJECT"]["PROJECT_REGION"]}"
# ... more variables ...
RAW_BUCKET="{RAW_BUCKET}"
NUM_OF_CPU={data["CLOUD_RUN"]["NUM_OF_CPU"]}
MEMORY_SIZE={data["CLOUD_RUN"]["MEMORY_SIZE"]}
""" # This string will be written to config.shThis function carefully constructs a multi-line string that will become the content of
config.sh. -
Generating
pubsub/.env: Similarly, a function likeget_env_configformats values for the.envfile.# File: run/generate_config.py (simplified)
def get_env_config(data): # 'data' is the 'config' dictionary
RAW_BUCKET = f'{data["BUCKET"]["NAME"]}/{data["BUCKET"]["RAW"]}'.rstrip("/")
# ... (calculating other bucket paths) ...
mystr = f"""DATABASE_URL={data["META_DATA"]["DATABASE_URL"]}
GCP_PROJECT={data["GCP_PROJECT"]["PROJECT_ID"]}
CA_RAW={RAW_BUCKET}
# ... more environment variables ...
"""
# ... (loop for 'OTHER' custom variables if any) ...
return mystr.strip("\n") # This string will be written to .envThis creates a string where each line is
KEY=VALUE, the standard format for.envfiles. -
Writing the Files: Finally, the main part of the script calls these functions and writes their output to the respective files.
# File: run/generate_config.py (simplified, continuing from __main__)
# ... (after loading 'config' from YAML) ...
shell_config_content = get_shell_config(config)
with open(config_path, "w", encoding="utf-8") as fp: # config_path is 'run/scripts/config.sh'
fp.write(shell_config_content)
env_config_content = get_env_config(config)
with open(env_path, "w", encoding="utf-8") as fp: # env_path is 'run/pubsub/.env'
fp.write(env_config_content)
print("Configuration files generated successfully!")
except yaml.YAMLError as exc:
print(f"Error reading main_config.yaml: {exc}")
And that's it! One master file (main_config.yaml, which you manage) and one script (generate_config.py) keep all your application's settings consistent and in the right places.
Conclusion
You've now learned about the cloud-accelerator-function's "control panel" – its Configuration Management system. This system is vital for:
- Providing operational parameters (like Project IDs, bucket names, database URLs) to the application.
- Ensuring these settings are consistent across different parts of the project (shell scripts and Python code).
- Making it easy to change settings by editing a single master YAML file (
main_config.yaml- which you would create and maintain) and then runninggenerate_config.pyto distribute those settings.
This centralized approach makes the application more robust, easier to manage, and simpler to deploy in different environments.
With our application configured and running, it's important to know what it's doing, especially when things go right or, more importantly, when they go wrong. In the next chapter, we'll explore Chapter 7: Structured Logging and Alerting to see how our system keeps a detailed diary of its actions and how it can notify us of important events.