Databricks

From Colettapedia
Jump to navigation Jump to search


General

  • Meant to be an environment to do
    1. Data Engineering - raw data, normalize, filter extract good representation
    2. Data Science - explore it, model tuning
    3. Productionization - once you've selected bet model, get it out to production
  • Built-in data visualization using Plotly
  • register as delta table, register data in metastore


Features

  • unity catalog
  • data lineage
  • grant privileges through UI or programmatically
  • AI autotag data
  • User stats generated through unity catalog, generated from audit logs, frequent queries. 30-day or 90-days *IF* it's a managed table


Links


Training

Components

dbutils

Delta Lake

Description

  • Optimized storage layer
  • Is OSS that extends Parquet data files with file-based transaction log for ACID transactions and scalable metadata handling
  • Used for both batch and streaming operations
  • All tables on Databricks are Delta tables by default, including spark DataFrames or SQL
  • Just save data to the lakehouse with default settings
  • Delta tables have directories


Operations

  • create a new managed table: df.write.saveAsTable(table_name)
  • find out location using %sql DESCRIBE DETAIL table_name
  • "UPSERT" using MERGE INTO table_name syntax to simultaneously add new rows or update existing rows


Delta lake best practices

  • Columns that are used often and that have high cardinality should be used in conjunction with Z-ORDER BY clause.
  • Compact a delta table written to often using the OPTIMIZE command; Use VACUUM to remove the old files

Delta Share


Lakehouse

  • Organizes data stored with Delta Lake in cloud object storage
  • Combines benefits of enterprise data warehouse and data lake
  • Object hierarchy: Metastore -> Catalog -> Schema (is synonym for database in databricks) -> table -> view -> function
    • view = saved query
    • function = saved logic that returns scalar or set of rows
  • metastore - centralized access control, auditing, lineage data discovery capabilities
    • Unity catalog - new!
    • Built-in Hive metastore (old/legacy)
  • catalog - highest abstraction in relational model
  • database - a collection of tables, views and functions. Optionally has a location
  • table - structured data stored as a directory of files on cloud object storage
    • managed table - no location specified at write time
      • (df.write.saveAsTable("table_name"))
    • unmanaged table - specify location at write time
      • df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
  • Introducing Lakehouse Federation Capabilities in Unity Catalog - serve external DB read only

Hive table

  • Apache hive glossary entry
  • Hive is data warehouse software
  • designed for large datasets extracted from Apache Hadoop Distributed File System
  • SQL-like query language HiveQL, plus use other languages
  • provides centralized metadata store of all the datasets in the organization in a relational database system
  • Available data types: int, datetime/interval, string, struct/union, boolean
  • Write once, read many times
  • Does not do data validation, schema checking

DataFrame.createOrReplaceTempView()

  • create a view into the data, that can be queried using SQL via %sql magic in databricks
  • Used when you want to store a table for using in your notebook that are session scoped
  • views are dropped when the session ends unless you created it as a Hive table
  • saveAsTable()


GIS capabilities

  • Different roles - Creating, consuming, cleaning, ingesting and optimizing, advanced analytics

Workflows

Unity Catalog