Databricks Analytics Engineering: Module 4: Data Transformation Patterns
Module 4: Data Transformation Patterns Learning Objectives By the end of this module, you will be able to: Apply common transformation patterns: deduplication, pivot, unpivot, gap-fill, and session...

Source: DEV Community
Module 4: Data Transformation Patterns Learning Objectives By the end of this module, you will be able to: Apply common transformation patterns: deduplication, pivot, unpivot, gap-fill, and sessionization Write complex joins including semi-joins, anti-joins, and inequality joins Parse and transform semi-structured data (JSON, XML) with Databricks SQL Use array and map operations for nested data transformations Implement higher-order functions and SQL UDFs for custom logic 4.1 Deduplication Patterns Duplicate records are one of the most common data quality issues. The approach depends on whether duplicates are exact or fuzzy. Exact Deduplication with ROW_NUMBER -- Keep the latest record per customer based on updated_at WITH ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY updated_at DESC, _ingested_at DESC ) AS row_num FROM silver.stg_customers ) SELECT * EXCEPT(row_num) FROM ranked WHERE row_num = 1; Deduplication with QUALIFY (Databricks SQL) -- Cleaner synt