Skip to main content

Chapter 6: Database Metadata Model

In our last chapter, we saw how the Cloud Storage Uploader ships our final, polished Parquet files to a central cloud warehouse. Our data extraction pipeline is complete! From a raw request to a clean, uploaded file, the system works.

But wait. How does the system remember things from one run to the next? For example:

  • How does the HTTP/OData-based Extraction Engine remember the special "delta link" to fetch only new data?
  • If we have dozens of different SAP connections, do we have to type them into config.yaml every time?
  • How can we define a reusable extraction job and just say "run the daily sales report"?

To handle all this, our system needs a brain. Welcome to the Database Metadata Model, the system's central memory and card catalog.

The Problem: A Library without a Card Catalog

Imagine a huge library with millions of books. The books are our data. The extraction engines are the librarians who bring books to you. But if there's no card catalog, how does anyone find anything? How would a librarian remember which books you've already checked out or where to find a specific rare collection? It would be chaos.

Our delta-extractor would face the same problem without a central place to store information. The Database Metadata Model is that card catalog. It doesn't store the books themselves (the large Parquet files). Instead, it stores the small, critical pieces of information about the extractions:

  • All available SAP connection details.
  • Definitions for reusable extraction pipelines.
  • The "delta link" for the last OData extraction.
  • The timestamp of the last successful run for a report.

This component keeps our entire system organized, consistent, and manageable.

Key Concepts: The Brain's Toolkit

To build this "brain," we use a couple of standard, powerful tools.

1. The Database: Our Digital Filing Cabinet

At its heart, the system uses a simple database to store its metadata. For getting started, this is just a single file on your computer (sql_app.db) managed by a system called SQLite. It's a lightweight but powerful way to have a structured filing cabinet for all our important information.

2. SQLAlchemy: The Universal Translator

Our application is written in Python, but databases speak a language called SQL. How do they talk? We use a translator: SQLAlchemy.

SQLAlchemy is a fantastic Python library that lets us work with the database using familiar Python objects instead of writing raw SQL queries. We define what our data "looks like" in Python, and SQLAlchemy handles the rest—creating the tables, saving the data, and fetching it for us.

+----------------+      +-------------+      +--------------+
| Our Python App | <--> | SQLAlchemy | <--> | Database |
| (delta-extractor)| | (Translator)| | (Speaks SQL) |
+----------------+ +-------------+ +--------------+

3. Models: Blueprints for Information

How do we tell SQLAlchemy what our information looks like? We create a Model. A model is just a Python class that serves as a blueprint for a database table.

For example, we might create a Connection model to store SAP connection details.

# A simplified blueprint for a "Connection"
class Connection:
name = "SAP_Production_System"
host = "sap-prod.mycompany.com"
user = "API_USER"
# ... and so on

SQLAlchemy takes this blueprint and knows how to create a connections table in the database with columns for name, host, user, etc.

How to Use It: The Brain at Work

For the most part, you won't interact with the database models directly. They work in the background, supporting other parts of the system. For example, when the HTTP/OData-based Extraction Engine finishes a full run, it will automatically ask the database model to save the "delta link" for that extraction.

The primary user of this metadata model is the API & Orchestration Layer, which we'll cover in the next chapter. The API will use the models to look up all the information needed to run a job.

Under the Hood: A Look Inside the Brain

Let's see how the different parts of the system would rely on the database model to perform a task. Imagine we want to run a predefined pipeline called "DailySalesReport".

This shows that the Database Model acts as the reliable middleman between the application logic (the API) and the raw storage (the database).

Let's look at the simplified code that powers this.

Step 1: The Blueprint (The Model)

First, we define a blueprint for our pipeline configurations using a Python class. This tells SQLAlchemy what a "Pipeline" looks like.

# Simplified model from files like `database/schemas.py`
from sqlalchemy import Column, Integer, String
from database.config import Base # The base blueprint from our config

class Pipeline(Base):
__tablename__ = "pipelines" # The name of the database table

id = Column(Integer, primary_key=True)
name = Column(String, unique=True) # e.g., "DailySalesReport"
connection_name = Column(String) # e.g., "SAP_Production_System"
extraction_type = Column(String) # e.g., "Z_SALES_DATA_EXTRACTOR"
  • __tablename__: This is the name of the table in our SQLite database.
  • Column(...): Each attribute defines a column in the table, along with its data type (String, Integer, etc.).

Step 2: Retrieving Information (The "Read" Operation)

Next, we need a function that can query the database to find a specific pipeline by its name. These are often called "CRUD" (Create, Read, Update, Delete) helpers.

# Simplified function from files like `database/crud.py`
from sqlalchemy.orm import Session

def get_pipeline_by_name(db: Session, name: str):
# Use the SQLAlchemy session to query the Pipeline model
# and find the first one that matches the name.
return db.query(Pipeline).filter(Pipeline.name == name).first()

This simple function takes a database Session and a name, and it uses SQLAlchemy's expressive syntax to find the matching pipeline. It's readable, clean, and we didn't have to write a single line of SQL!

The API & Orchestration Layer can now use this function to get all the details it needs, just by knowing the name of the job it wants to run.

Conclusion

You've now explored the brain of the delta-extractor. The Database Metadata Model is our system's card catalog, using a simple database and the power of SQLAlchemy to store and manage all the critical configuration and state information. It keeps the system organized and enables us to build more advanced, reusable workflows. It doesn't hold the big data, but it holds the small metadata that makes everything else possible.

We now have all the individual components of a professional data pipeline:

  1. Powerful extraction engines.
  2. An automated transformation pipeline.
  3. A reliable cloud uploader.
  4. A smart "brain" to manage it all.

But how do we tie all this together? How do we move beyond running single jobs from the command line and start building a real, automated data service? For that, we need a conductor to lead the orchestra.

Let's move on to the final chapter and meet the conductor: the API & Orchestration Layer.


Generated by AI Codebase Knowledge Builder