Databricks

From Colettapedia
Jump to navigation Jump to search


General

  • Meant to be an environment to do
    1. Data Engineering - raw data, normalize, filter extract good representation
    2. Data Science - explore it, model tuning
    3. Productionization - once you've selected bet model, get it out to production
  • Built-in data visualization using Plotly
  • register as delta table, register data in metastore

Links

Components

Delta Lake

Description

  • Optimized storage layer
  • Is OSS that extends Parquet data files with file-based transaction log for ACID transactions and scalable metadata handling
  • Used for both batch and streaming operations
  • All tables on Databricks are Delta tables by default, including spark DataFrames or SQL
  • Just save data to the lakehouse with default settings
  • Delta tables have directories

Operations

  • create a new managed table: df.write.saveAsTable(table_name)
  • find out location using %sql DESCRIBE DETAIL table_name
  • "UPSERT" using MERGE INTO table_name syntax to simultaneously add new rows or update existing rows


Delta lake best practices

  • Columns that are used often and that have high cardinality should be used in conjunction with Z-ORDER BY clause.
  • Compact a delta table written to often using the OPTIMIZE command; Use VACUUM to remove the old files

Lakehouse

  • Organizes data stored with Delta Lake in cloud object storage
  • Combines benefits of enterprise data warehouse and data lake
  • Object hierarchy: Metastore -> Catalog -> Schema (is synonym for database in databricks) -> table -> view -> function
    • view = saved query
    • function = saved logic that returns scalar or set of rows
  • metastore - centralized access control, auditing, lineage data discovery capabilities
    • Unity catalog - new!
    • Built-in Hive metastore (old/legacy)
  • catalog - highest abstraction in relational model
  • database - a collection of tables, views and functions. Optionally has a location
  • table - structured data stored as a directory of files on cloud object storage
    • managed table - no location specified at write time
      • (df.write.saveAsTable("table_name"))
    • unmanaged table - specify location at write time
      • df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")

Hive table

  • Apache hive glossary entry
  • Hive is data warehouse software
  • designed for large datasets extracted from Apache Hadoop Distributed File System
  • SQL-like query language HiveQL, plus use other languages
  • provides centralized metadata store of all the datasets in the organization in a relational database system
  • Available data types: int, datetime/interval, string, struct/union, boolean
  • Write once, read many times
  • Does not do data validation, schema checking

DataFrame.createOrReplaceTempView()

  • create a view into the data, that can be queried using SQL via %sql magic in databricks
  • Used when you want to store a table for using in your notebook that are session scoped
  • views are dropped when the session ends unless you created it as a Hive table
  • saveAsTable()


Workflows