Vida online : octubre 2022

Merge does requires only row per key to work. Imagine the following scenario:

You are loading at intervals deltas or incremental from an SAP ERP system (ECC; S/4, etc.) related to purchase orders.

In your target table (in delta lake format for example), you already have a this rows:

The "PrimaryKey" field is self-explanatory. Then the other fields are just example fields, but the last field, "RowNumber", is something the extraction tool usually provides; it can be just an integer number that defines a sequence or a timestamp.

Let's understand that field or column.

The following table depicts a classic delta or incremental load.

It has two kinds of records, the light blue ones, which are new records (new primary keys), and the white ones which are updates to existing records.

K2 and K9 appear twice. One reason for this could be because of the following:

The user changed purchase order K2; for example, he changed one condition of the payment Field1 = "NEWVALUE2".

Then he made a second change; he altered how many units he wanted to order Vfield3 = 22.

Now we can see that we have two records for K2, and we need to pick the last one. In this case, we identify the last one using the PrimaryKey and RowNumber.

If we try to merge this delta without telling the merge how to pick the last and only one record per primary key, it will fail with the following error:

(This screenshot is from a Synapse Notebook)

We can use a Window Function like row_number to filter the last record for each primary key; in this way, we guarantee that the Merge statement runs successfully.

We can create a view:

%%sql
create temp view T_DeltaPurchaseOrders as
select PrimaryKey, Field1, Field2, Vfield3, row_number() over (PARTITION 
 BY PrimaryKey ORDER BY deltapurchaseorders.RowNumber desc) AS frank FROM DeltaPurchaseOrders

This view will return the following result:

Then we can add a condition "where frank = 1"

We will get the following result:

As we can see, now we have only one row per primary key, we can merge that into our table and it won't fail.

Heads-up: Before trying this technique, verify with someone who knows the source system; this is how you should handle "duplicate" records in incremental or delta loads. There may be other cases.

References:

https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge

A merge operation can fail if multiple rows of the source dataset match and the merge attempts to update the same rows of the target Delta table. According to the SQL semantics of merge, such an update operation is ambiguous as it is unclear which source row should be used to update the matched target row. You can preprocess the source table to eliminate the possibility of multiple matches. See the change data capture example—it shows how to preprocess the change dataset (that is, the source dataset) to retain only the latest change for each key before applying that change into the target Delta table.

Vida online

martes, 11 de octubre de 2022

Merge in Synapse Spark Pool using delta lake tables (For example with deltas from SAP)