Chapter 3: HTTP/OData-based Extraction Engine

In the previous chapter, we explored the RFC-based Extraction Engine, our "armored vehicle" for high-speed, direct data extraction from SAP. But what happens when you can't build a private tunnel? What if your SAP system lives in the cloud and communicates just like any other modern web service?

For this, we need a different approach. Welcome to the HTTP/OData-based Extraction Engine—the component that talks to SAP using the universal language of the web.

The Problem: Browsing a Massive Online Catalog

Imagine your SAP data isn't in a private vault but is presented as a massive, secure online catalog. You can't use a special key (RFC) to get in. Instead, you have to use a web browser to navigate to the right department, find the products you want, and download their details one page at a time.

This is exactly what our HTTP/OData engine is designed for. It's perfect for modern, cloud-based SAP systems that expose their data through standard web APIs. It's a more universal, web-friendly way to access data.

Key Concepts: Your Web Browsing Toolkit

This engine uses a few standard web concepts to get its job done.

1. The Language: HTTP (Hypertext Transfer Protocol)

HTTP is the fundamental protocol of the World Wide Web. It's the language your browser uses to request a web page from a server. Our engine uses HTTP to send requests to the SAP system's web address (URL) and receive data back, just like your browser does when you visit a website.

2. The Rulebook: OData (Open Data Protocol)

While HTTP is the language, OData is the grammar and a set of rules that make the conversation structured and predictable. It's an industry standard for building web APIs. Think of it as a universal instruction manual for online data catalogs. Thanks to OData, our engine knows exactly how to ask for:

A list of all available datasets (or "product categories").
All the items within a specific dataset.
Just the new or updated items since the last time we checked.

3. The Strategy: Following the `__next` Link

If you search for "laptops" on an online store, you don't get all 10,000 results on one page. You get the first 20, with a "Next Page" button at the bottom.

OData services work the same way. When our engine asks for data, the SAP server sends back the first "page" of records. Along with the data, it includes a special link in its response, often called __next. This link is the URL for the next page of data. Our engine simply follows these links, one by one, until there are no more pages left. This process is called pagination.

+--------------+                               +--------------------+
|  HTTP Engine | -- "Give me business partners"-> | SAP OData Service  |
+--------------+                               +--------------------+
       |                                                 |
       | <----- "Here is Page 1 + a `__next` link" -------+
       |
       +----- "OK, I'll use that link for the next page" ->+
       |
       | <----- "Here is Page 2 + another `__next` link" --+
       |
       |  (And so on, until the last page is received)

How to Use It: Configuring the Web-Based Engine

To use this engine, you simply change the protocol in your config.yaml and provide the necessary web-related details.

# config.yaml

# This tells the entrypoint to use the HTTP Engine
protocol: "HTTP" 

# --- Settings specifically for the HTTP/OData Engine ---
http:
  base_url: "https://your-sap-odata-server.com/sap/opu/odata/sap/"
  auth_user: "YOUR_API_USERNAME"
  # The password is often stored securely, not in this file

extraction_settings:
  service_name: "API_BUSINESS_PARTNER" # The name of the data "catalog"
  entity_name: "A_BusinessPartner"    # The specific data "aisle" you want

This configuration tells the engine:

base_url: The main web address for the OData service.
service_name / entity_name: Which specific dataset to download from the catalog.
auth_user: The credentials needed to log in to the web service.

Under the Hood: A Step-by-Step Browsing Session

When the Extraction Entrypoint kicks off the HTTP engine, it simulates a very methodical web browsing session.

Let's look at some simplified code that shows how this works.

Step 1: Making the First Request

The engine starts by building the initial URL and using a standard Python library like requests to fetch the first page.

# Simplified code from the HTTP engine
import requests

def get_first_page(http_config, settings):
    # Build the full URL, e.g., "https://.../API_BUSINESS_PARTNER/A_BusinessPartner"
    url = f"{http_config['base_url']}/{settings['service_name']}/{settings['entity_name']}"
    
    # Send the web request with username and password
    response = requests.get(url, auth=(http_config['auth_user'], "..."))
    
    return response.json() # Return the data from the response

Step 2: The Pagination Loop

Now, the engine enters a loop. It processes the data from the current page and then checks if the response contains a __next link to continue.

# Simplified representation of the pagination loop

# Start with the URL for the first page
next_page_url = "https://.../A_BusinessPartner"

while next_page_url:
    response = requests.get(next_page_url, auth=("user", "pass")).json()
    
    # Process the data from the current page
    process_data(response['d']['results'])
    
    # Check for the next page link. If it doesn't exist, this will be None.
    next_page_url = response['d'].get('__next')

print("All pages have been downloaded!")

This simple while loop is the core of the engine's fetching mechanism. It continues until the SAP server stops providing a __next link, which means we've reached the end of the data.

What about fetching only new data?

This is where OData shines. After fetching all the data for the first time, the final response from SAP can include a special delta link. This link is a bookmark. The next time you run an extraction, you can use this delta link instead of the original URL. When you do, SAP will only send you the records that have been created or changed since your last run.

Our engine automatically saves this delta link (for example, in the delta_link field in our Database Metadata Model) and uses it for subsequent runs, making future extractions incredibly fast and efficient.

Conclusion

You've now learned about our modern, web-savvy extraction tool: the HTTP/OData-based Extraction Engine. It communicates with SAP systems over standard HTTP, follows the OData rulebook to navigate the data catalog, and cleverly handles pagination with __next links to fetch everything. It's the perfect engine for cloud environments or any system that exposes its data via a web API.

So far, we've covered two ways to get raw data out of SAP. But raw data is messy. It needs to be cleaned, structured, and stored in an efficient format for analysis. What happens after the data is extracted?

Let's find out in the next chapter, where we'll explore the Data Transformation to Parquet.

Generated by AI Codebase Knowledge Builder

The Problem: Browsing a Massive Online Catalog​

Key Concepts: Your Web Browsing Toolkit​

1. The Language: HTTP (Hypertext Transfer Protocol)​

2. The Rulebook: OData (Open Data Protocol)​

3. The Strategy: Following the __next Link​

How to Use It: Configuring the Web-Based Engine​

Under the Hood: A Step-by-Step Browsing Session​

What about fetching only new data?​

Conclusion​