Skip to main content

Chapter 3: External Data Extractors

In the previous chapter, we saw how the Data Ingestion Pipeline acts as a smart assembly line for files that are dropped into our system. That's perfect for data we already have. But what about data that lives somewhere else, on another company's servers?

The Problem: Speaking Many Languages

Imagine your business uses several online services:

  • Zoho for your accounting and inventory.
  • Swiggy and Zomato for your daily sales orders.
  • PetPooja for managing your restaurant's point-of-sale system.

All of these services hold incredibly valuable data. But to get it, you have to talk to their Application Programming Interfaces (APIs). The problem is, each API is like a different language. Zoho's API has its own rules for logging in and asking for data, which are completely different from Swiggy's rules, which are different from Zomato's.

Trying to write code that speaks all these different "languages" in one place would be a tangled mess. It would be difficult to update, fix, and manage.

The Solution: A Team of Specialist Translators

The hsr-cloud-accelerator solves this by using External Data Extractors.

Think of these extractors as a team of specialist translators. We have one expert who only speaks "Zoho," another who is fluent in "Swiggy," and so on. When we need data, we don't worry about the foreign language; we just tell our specialist what we want.

Each extractor is a dedicated module responsible for three things:

  1. Authentication: Securely logging into the third-party service (e.g., managing API keys and access tokens).
  2. Request Logic: Knowing exactly how to ask for the data (e.g., which web address to call and what questions to ask).
  3. Data Parsing: Taking the messy, uniquely formatted response from the API and translating it into a clean, standard format that our system understands (like a simple CSV file).

The extractor's only job is to go out, get the data, and drop it off at the front door of our system. And what's our system's front door? The Data Ingestion Pipeline we learned about in the last chapter!

How It Works: The Data-Gathering Mission

Let's follow our "Zoho translator" (the Zoho Extractor) on a mission to get the latest monthly bills. This mission is usually triggered on a schedule, for example, at the beginning of every month.

  1. The Trigger: A scheduler kicks off our Zoho Extractor function.
  2. Authentication: The extractor uses a secret API key (stored securely, thanks to our Configuration Hub) to get a temporary access token from Zoho.
  3. Request: It makes a specific web request to the Zoho API's endpoint for bills.
  4. Response: The Zoho API sends back the data, probably in a format called JSON.
  5. Parse: The extractor cleans up this JSON data and organizes it into a neat table.
  6. Deliver: Finally, it saves this table as a zoho_bills.csv file and uploads it to the "inbox" folder in Google Cloud Storage.
  7. Handoff: The moment the CSV file lands in Cloud Storage, the Data Ingestion Pipeline automatically wakes up and takes over, just as we saw in Chapter 2!

This design is powerful because it keeps responsibilities separate. The extractor's job ends the moment the file is delivered.

Under the Hood: A Peek at an Extractor

Let's look at a simplified version of what an extractor's code might look like. The file run/pubsub/handlers/monthly_refresh.py acts as a coordinator that calls several Zoho extractors.

# File: run/pubsub/handlers/monthly_refresh.py (Simplified)
from ZohoData.Monthly import monthly_bills, monthly_expenses
from CloudStoragePush.CSDataPush import DataPush

def monthly_refresh():
# Set up a connection to our Cloud Storage
gcs = DataPush()
gcs.initial_connect()

# 1. Call the specialist for bills
monthly_bills(gcs=gcs)

# 2. Call the specialist for expenses
monthly_expenses(gcs=gcs)

This main function is simple. It just calls each specialist function one by one. Now, let's imagine what the monthly_bills function inside ZohoData/Monthly.py might do.

Step 1: Authentication

First, the extractor needs to get a login token. This complex logic is hidden away in its own function.

# Conceptual code for ZohoData/Monthly.py
import os

def get_zoho_api_token():
# This function handles the complex logic of getting a
# valid access token from Zoho using secrets.
api_key = os.environ.get("ZOHO_API_KEY")
# ... logic to request and return a token ...
return "a_very_secret_token_123"

By keeping this separate, the rest of our code stays clean and simple.

Step 2: Fetching the Data

Next, it uses this token to make a web request.

# ... continuing in the conceptual extractor
import requests # A library to make web requests

def fetch_bill_data(auth_token):
headers = {'Authorization': f'Zoho-oauthtoken {auth_token}'}
api_url = "https://www.zohoapis.com/inventory/v1/bills"

response = requests.get(api_url, headers=headers)
return response.json() # Returns data as a Python dictionary

This function knows the exact "address" for Zoho's bills and how to present the token correctly.

Step 3: Parsing and Saving

Finally, we put it all together. The main function calls the helpers, cleans the data, and uploads it.

# ... continuing in the conceptual extractor
import pandas as pd # A library for working with tables

def monthly_bills(gcs):
# 1. Authenticate
token = get_zoho_api_token()

# 2. Fetch raw data
raw_data = fetch_bill_data(token)

# 3. Parse and convert to a standard table (DataFrame)
df = pd.DataFrame(raw_data['bills'])

# 4. Save this table as a CSV to Cloud Storage
gcs.upload_dataframe_as_csv(df, "raw/zoho_bills.csv")

And that's it! The last line, gcs.upload_dataframe_as_csv, is the final handoff. The extractor's mission is complete. The CSV file is now in the cloud, and the Data Ingestion Pipeline will handle it from here.

Conclusion

You've just learned about External Data Extractors, our project's team of specialist data-gathering agents.

  • They are specialized modules that translate third-party APIs.
  • Each extractor handles the authentication, request logic, and data parsing for a single source.
  • Their job is to fetch external data and deliver it as a standardized file to our Cloud Storage.
  • This cleanly separates the messy work of external data gathering from the clean, internal processing of our system.

We now have two ways to get data into our system: receiving files directly (Chapter 2) and fetching them from APIs (Chapter 3). But once the data arrives, how does the system decide what to do next? What if loading one table depends on another?

Next up: Main Event Router


Generated by AI Codebase Knowledge Builder