Incremental Load in Data bricks part -1

Use cases

Suitable if the pipeline runs infrequently.
Assume the scenario if files are loaded in same directory everyday. If not used processed and yet to process folder.

Steps to implement

List the current files in directory

create database if not exists inc;

files_present_df = spark.createDataFrame(dbutils.fs.ls("abfss://demo@demorz123.dfs.core.windows.net/Incremental_load_test/emp_data/")).select("name", "path")
# Load the results in delta table
files_present_df.write.format("delta").mode("overwrite").saveAsTable("inc.files_present")

Create delta table to store processed files

if not spark.catalog.tableExists("inc.processed_files"):
    # Create an empty DataFrame with the same schema as the source table
    files_present_df = spark.table("inc.files_present")  # Load the source table
    empty_df = spark.createDataFrame([], files_present_df.schema)

    # Save the empty DataFrame as a new Delta table
    empty_df.write.format("delta").saveAsTable("inc.processed_files")
    print("Table 'inc.processed_files' created successfully without data!")
else:
    print("Table 'inc.processed_files' already exists.")

compare processed and files present table and extract new or unprocessed file

# read from delta table
processed_files_df = spark.table("inc.processed_files").select("name")

anti_join_df = files_present_df.join(processed_files_df, on="name", how="left_anti")

files_yet_to_process = [row['path'] for row in anti_join_df.select("path").collect()]
display(anti_join_df)

pass the unprocessed files in list: (do transformation if you want)

emp_df_new=spark.read.option("header", "true").csv(files_yet_to_process)
emp_df_new.display()

store the files in processed_files table after transformation

MERGE INTO inc.processed_files AS target
USING inc.files_present AS source
ON target.name = source.name
WHEN NOT MATCHED THEN
    INSERT (`name`, path)
    VALUES (source.name, source.path);

Congrats! 🎉 we’ve completed. That’s it. we see in next part.

Incremental Load in Data bricks part -1

Table of contents

Use cases

Steps to implement