Chapter 4: Data Transformation to Parquet
In our previous chapters, we learned how the RFC-based Extraction Engine and the HTTP/OData-based Extraction Engine act as powerful tools to pull raw data out of SAP systems. But what happens after the data is extracted? We're left with a pile of raw, unprocessed information.
This is where our project's assembly line kicks in. Welcome to the Data Transformation to Parquet pipeline, the component that turns raw materials into a finished, polished product.
The Problem: From Farm to Supermarket Shelf
Imagine you're running a food delivery service. The extraction engines are your trucks that go to different farms (SAP systems) and bring back raw ingredients. They might bring back crates of carrots covered in dirt, sacks of potatoes of all sizes, and bunches of herbs. You can't just send these directly to your customers.
You need a kitchen—an assembly line—to:
- Wash and clean the vegetables (clean up data).
- Sort and chop them into a standard size (enforce a data structure).
- Package them into convenient, sealed meal kits (convert to an efficient file format).
- Check for quality and remove any bad items (remove duplicates).
Our transformation pipeline is this kitchen. It takes the raw data from the extraction engines and prepares it for an analytics "supermarket" like a data lake or data warehouse.
Key Concepts: The Assembly Line Stations
Our data assembly line has a few key stations that every piece of data passes through.
1. The Format: Parquet (The Perfect Package)
The first thing to understand is our "packaging" of choice: Parquet.
Most people are familiar with file formats like CSV or TXT. These are like reading a book, one line at a time from top to bottom. This is fine for reading the whole book, but what if you wanted to find every mention of a single character's name? You'd have to scan every single line.
Parquet is different. It's a columnar format. Imagine instead of a book, you have an address book organized by columns: Last Name, First Name, Phone Number. If you want to find all the people with the last name "Smith," you only need to look at the "Last Name" column, which is incredibly fast.
Parquet files are:
- Columnar: Great for analytical queries that often look at specific columns (e.g.,
SUM(SALES)). - Compressed: They take up significantly less storage space, saving costs.
- Schema-Aware: They store the "recipe" (data types like text, number, date) within the file itself.
2. The Recipe: Schema Enforcement
Before we can package the data, we need to make sure every piece is in the right format. The raw data from SAP might send a date as "2023-10-27" (text) or a number as "100.50". Our assembly line uses a schema—a set of rules—to convert these into proper data types. This ensures that a DATE column is always a date and a REVENUE column is always a number.
3. The Cleanup Crew: Merge and Deduplicate
Our extraction engines often fetch data in small batches or "packages" to be efficient. This can result in dozens of small files.
- Merge: Imagine having 50 small grocery bags. It's much easier to carry one or two large ones. The merge step combines all the small data files from a single extraction run into one larger, more manageable file.
- Deduplicate: Sometimes, especially with delta loads, we might get the same record more than once (e.g., a sales order was created, then immediately updated). The deduplication step looks at a unique key (like
SALES_ORDER_ID) and ensures that we only keep the most recent version of each record.
How to Use It: It's Automatic!
The best part about this component is that you don't have to do anything to run it! It's automatically triggered right after an extraction engine successfully finishes its job. The entire flow, from extraction to transformation, happens as part of the single command you already learned:
python main.py
The transformation pipeline is a loyal worker that waits for the extraction trucks to arrive and immediately gets to work on the raw materials they drop off.
Under the Hood: A Look at the Assembly Line
So what happens behind the scenes after the extraction is complete? Let's follow a batch of data as it moves through the pipeline.
Let's walk through this with some simplified code examples, using the popular pandas library which is perfect for this kind of work.
Step 1: Reading and Merging the Raw Data
The pipeline first finds all the small JSON or text files that the extraction engine created and merges them into a single data table (called a DataFrame).
# Simplified code from the transformation pipeline
import pandas as pd
# Let's say the extractor created these files
raw_file_paths = ["data/raw/package_1.json", "data/raw/package_2.json"]
# Read each JSON file into a table and combine them
list_of_tables = [pd.read_json(path) for path in raw_file_paths]
merged_table = pd.concat(list_of_tables, ignore_index=True)
# Now we have one big table with all the raw data
This simple code takes a list of file paths and produces one unified table, ready for the next station.
Step 2: Cleaning the Data (Schema Enforcement)
Next, the pipeline applies the correct data types. This is like making sure all prices are numbers and all dates are actual dates.
# Continuing from above...
# Define what the columns SHOULD be
correct_schema = {
'SALES_ORDER_ID': 'string',
'ORDER_DATE': 'datetime64[ns]',
'AMOUNT': 'float64'
}
# Apply the schema to the table
clean_table = merged_table.astype(correct_schema)
After this step, our data is structured and clean. The AMOUNT column can now be used for math operations, and the ORDER_DATE can be used for time-based analysis.
Step 3: Deduplicating and Saving to Parquet
Finally, the pipeline removes any duplicate entries and saves the result in the super-efficient Parquet format.
# Continuing from above...
primary_key = "SALES_ORDER_ID"
output_path = "data/clean/sales_data.parquet"
# Remove duplicates, keeping only the last-seen record for each ID
final_table = clean_table.drop_duplicates(subset=[primary_key], keep='last')
# Save the final, clean table to a compressed Parquet file
final_table.to_parquet(output_path, compression='snappy')
print(f"Transformation complete! Clean data at: {output_path}")
And that's it! Our raw, messy data has been turned into a single, clean, compressed, and analysis-ready Parquet file.
Conclusion
You've just toured the delta-extractor's data kitchen. The Data Transformation to Parquet pipeline is a crucial, automated step that takes raw extracted data and puts it through an assembly line to:
- Enforce a clean data schema.
- Merge many small files into one.
- Remove duplicate records.
- Save the output in the highly efficient Parquet format.
This ensures that the data we produce is not just extracted, but is ready for immediate use in professional analytics tools.
But having a perfect Parquet file sitting on your local computer isn't very useful for a team or a large-scale project. How do we get this file to a central, shared location in the cloud?
Let's find out in our next chapter as we explore the Cloud Storage Uploader.
Generated by AI Codebase Knowledge Builder