Incremental Load in Data bricks part -1
Table of contents
Use cases
Suitable if the pipeline runs infrequently.
Assume the scenario if files are loaded in same directory everyday. If not used processed and yet to process folder.
Steps to implement
- List the current files in directory
create database if not exists inc;
files_present_df = spark.createDataFrame(dbutils.fs.ls("abfss://demo@demorz123.dfs.core.windows.net/Incremental_load_test/emp_data/")).select("name", "path")
# Load the results in delta table
files_present_df.write.format("delta").mode("overwrite").saveAsTable("inc.files_present")
- Create delta table to store processed files
if not spark.catalog.tableExists("inc.processed_files"):
# Create an empty DataFrame with the same schema as the source table
files_present_df = spark.table("inc.files_present") # Load the source table
empty_df = spark.createDataFrame([], files_present_df.schema)
# Save the empty DataFrame as a new Delta table
empty_df.write.format("delta").saveAsTable("inc.processed_files")
print("Table 'inc.processed_files' created successfully without data!")
else:
print("Table 'inc.processed_files' already exists.")
- compare processed and files present table and extract new or unprocessed file
# read from delta table
processed_files_df = spark.table("inc.processed_files").select("name")
anti_join_df = files_present_df.join(processed_files_df, on="name", how="left_anti")
files_yet_to_process = [row['path'] for row in anti_join_df.select("path").collect()]
display(anti_join_df)
- pass the unprocessed files in list: (do transformation if you want)
emp_df_new=spark.read.option("header", "true").csv(files_yet_to_process)
emp_df_new.display()
- store the files in processed_files table after transformation
MERGE INTO inc.processed_files AS target
USING inc.files_present AS source
ON target.name = source.name
WHEN NOT MATCHED THEN
INSERT (`name`, path)
VALUES (source.name, source.path);
Congrats! ๐ weโve completed. Thatโs it. we see in next part.
ย