Skip to main content

Chapter 2: File System Monitor

In our last chapter, Application GUI and Configuration, we learned how to use the toolkit's "cockpit" to give it instructions. We filled out the forms, clicked "Save," and our settings were stored in the config.json file.

But instructions are useless if no one is there to follow them. Now, we'll meet the first automated worker in our toolkit: the File System Monitor.

What's the Big Idea?

Imagine you have a special inbox on your desk. Your only job is to wait for a specific type of letter to arrive—say, one in a blue envelope. As soon as a blue envelope appears, you grab it, open it, and hand it off to the next person in line to handle the contents. You ignore all other mail.

The File System Monitor does exactly this, but for files on your computer. The problem it solves is: "How does the application know when a new data file is ready for processing without me having to tell it every single time?"

This component is the application's automated lookout. It constantly watches the "source" folder you specified in the configuration. When a new file appears that matches a specific pattern (like a .csv file or a special .txt "trigger" file), it automatically kicks off the entire data processing workflow.

The Watcher in the Folder

Think of the File System Monitor as a high-tech security guard watching a single, important doorway—your source folder.

  1. The Guard's Post: The monitor's "post" is the folder_to_watch you set in Chapter 1. It never looks anywhere else.
  2. The Person of Interest: The guard isn't looking for just anyone. They are looking for files with specific names or extensions, like .csv or a special .txt file that acts as a signal.
  3. The Alert: When the guard spots a file they've been waiting for, they don't handle it themselves. Instead, they pick up their radio and alert the main system, telling it, "A new file has arrived at this location. It's time to start the process."

This "guard" is powered by a clever Python library called watchdog, which is excellent at monitoring folders for changes without using a lot of computer resources.

How It Works: A Step-by-Step Example

Let's say you've configured the toolkit to watch the folder C:/Data/Input.

  1. You click "Start" in the GUI. The File System Monitor reads the config.json file and learns it needs to watch C:/Data/Input.
  2. The monitor begins its watch. It's now silently running in the background.
  3. You (or another automated process) save a new file named raw_salesdata_20241005.txt into the C:/Data/Input folder.
  4. The watchdog library instantly detects that a new file has been created.
  5. The File System Monitor checks the filename. It sees the .txt extension and recognizes it as a "trigger file." A trigger file is a special, often empty, file whose only purpose is to signal that a set of data is ready.
  6. The monitor knows that a .txt trigger file means other related data files should also be present. It immediately starts the next part of the process, passing the filename raw_salesdata_20241005.txt to the CSV-to-Parquet Conversion Engine.
  7. After the file is successfully processed and uploaded, the monitor can also be configured to clean up by deleting the original files to keep the source folder tidy.

Under the Hood: A Look at the Internals

Let's peek behind the curtain to see how this magic happens.

First, here’s a diagram showing the flow of events.

This diagram shows that the Monitor acts as the "first responder." It gets a notification from the computer's Operating System, decides if the new file is important, and then triggers the next stage of the pipeline.

Conceptual Code Examples

The real power comes from the watchdog library. Setting it up involves two main parts: an Observer that does the watching, and a Handler that decides what to do when something happens.

1. Setting Up the Watcher (The Observer)

This is like hiring the security guard and telling them which door to watch.

# Note: This is a simplified example
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

# ... handler class defined below ...

def start_watching(folder_path):
"""Starts the file system monitor."""
event_handler = MyEventHandler()
observer = Observer()
observer.schedule(event_handler, folder_path, recursive=False)
observer.start() # This starts the watching in the background

This code creates an Observer object and tells it to watch our folder_path. Anything that happens in that folder will now be reported to our MyEventHandler.

2. Deciding What to Do (The Handler)

This is the guard's instruction manual. The most important instruction is what to do when a new file is created (on_created).

# Note: This is a simplified example
class MyEventHandler(FileSystemEventHandler):
def on_created(self, event):
"""Called when a file or directory is created."""
print(f"New file detected: {event.src_path}")

# Check if it's a file we care about (e.g., a .txt trigger)
if event.is_directory or not event.src_path.endswith(".txt"):
return # Ignore folders and non-txt files

# It's a trigger file! Let's start the workflow.
process_new_file(event.src_path)

This class has a special method, on_created. The watchdog library automatically calls this method whenever a new file appears, passing in an event object that contains information like the file's path (event.src_path). Our code then checks if it's the right kind of file and, if so, calls another function to begin processing.

3. Finding Related Information

Our toolkit is smart. It knows a .txt file is just a signal. The real data is often in another file, and there might even be a file containing metadata, like the number of rows.

Our code uses helper functions to figure this out. For example, given a trigger file like raw_zmmr002_bukrs_20241004.txt, a function can deduce the name of the file that contains the row count.

# From: count_of_rows.py
import re
import os

def get_count_file_path(filepath: str):
"""Get the count file path given the original txt file."""
folder = os.path.dirname(filepath)
file = os.path.basename(filepath)

# Use pattern matching to find parts of the filename
match = re.match(r"^(.*?)_(.*)_(\d{8,})\.txt$", file)
if match:
# Reconstruct to build the new filename
new_filename = f"{match.group(1)}_{match.group(2)}_count_{match.group(3)}.csv"
return os.path.join(folder, new_filename)
return None

This function from count_of_rows.py is a perfect example of the toolkit's built-in logic. It takes the trigger file's path, breaks its name into pieces, and cleverly reconstructs the name of the associated "count" file. The File System Monitor uses this logic to gather all necessary information before starting the main workflow.

One Guard at a Time

What would happen if you accidentally started the data-toolkit twice? You'd have two "guards" watching the same door, and when a file arrived, they would both try to process it at the same time, leading to errors and confusion.

To prevent this, our toolkit includes a special mechanism to ensure only one monitor is running at a time. It's handled by a function that checks for and shuts down older instances of the application before starting a new one.

# From: delete_monitoring.py
def remove_previous_monitoring():
"""Finds and terminates other running data-toolkit processes."""

# This loop finds other processes with the same name and closes them.
# The 'including_this=False' flag ensures it doesn't close itself!
while delete_data_loaders(including_this=False):
print("Closing an older instance of the application...")

This is a critical safety feature. Before the File System Monitor even starts its watch, this logic ensures it's the only one on duty. We'll explore this more deeply in a later chapter on the Singleton Process Manager (Windows).

Conclusion

You now understand the role of the File System Monitor. It is the automated, ever-watchful "eyes and ears" of our data-toolkit.

  • It watches a specific source folder for new files.
  • It uses patterns to identify important trigger files (like .txt files).
  • When it finds one, it kicks off the data processing workflow.
  • It includes safety features to prevent multiple instances from running simultaneously.

Now that our monitor has detected a new file, what happens next? How is that raw data file actually handled? Let's find out in the next chapter, as we dive into the heart of our application.

Next up: CSV-to-Parquet Conversion Engine.


Generated by AI Codebase Knowledge Builder