Databricks

General

Meant to be an environment to do
1. Data Engineering - raw data, normalize, filter extract good representation
2. Data Science - explore it, model tuning
3. Productionization - once you've selected bet model, get it out to production
Built-in data visualization using Plotly
register as delta table, register data in metastore

Features

unity catalog
data lineage
grant privileges through UI or programmatically
AI autotag data
User stats generated through unity catalog, generated from audit logs, frequent queries. 30-day or 90-days *IF* it's a managed table

Links

Training

Data Engineering

Components

dbutils

dbutils docs

Delta Lake

Description

Optimized storage layer
Is OSS that extends Parquet data files with file-based transaction log for ACID transactions and scalable metadata handling
Used for both batch and streaming operations
All tables on Databricks are Delta tables by default, including spark DataFrames or SQL
Just save data to the lakehouse with default settings
Delta tables have directories

Operations

create a new managed table: df.write.saveAsTable(table_name)
find out location using %sql DESCRIBE DETAIL table_name
"UPSERT" using MERGE INTO table_name syntax to simultaneously add new rows or update existing rows

Delta lake best practices

Columns that are used often and that have high cardinality should be used in conjunction with Z-ORDER BY clause.
Compact a delta table written to often using the OPTIMIZE command; Use VACUUM to remove the old files

Delta Share

Data Sharing: Open data sharing for data, analytics and AI - secure access
No copying, such that data gets out of data.
- An open standard for secure data sharing
Databricks Marketplace - Open marketplace for data, analytics, and AI

Lakehouse

Organizes data stored with Delta Lake in cloud object storage
Combines benefits of enterprise data warehouse and data lake
Object hierarchy: Metastore -> Catalog -> Schema (is synonym for database in databricks) -> table -> view -> function
- view = saved query
- function = saved logic that returns scalar or set of rows
metastore - centralized access control, auditing, lineage data discovery capabilities
- Unity catalog - new!
- Built-in Hive metastore (old/legacy)
catalog - highest abstraction in relational model
database - a collection of tables, views and functions. Optionally has a location
table - structured data stored as a directory of files on cloud object storage
- managed table - no location specified at write time
  - (df.write.saveAsTable("table_name"))
- unmanaged table - specify location at write time
  - df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
Introducing Lakehouse Federation Capabilities in Unity Catalog - serve external DB read only

Hive table

Apache hive glossary entry
Hive is data warehouse software
designed for large datasets extracted from Apache Hadoop Distributed File System
SQL-like query language HiveQL, plus use other languages
provides centralized metadata store of all the datasets in the organization in a relational database system
Available data types: int, datetime/interval, string, struct/union, boolean
Write once, read many times
Does not do data validation, schema checking

DataFrame.createOrReplaceTempView()

create a view into the data, that can be queried using SQL via %sql magic in databricks
Used when you want to store a table for using in your notebook that are session scoped
views are dropped when the session ends unless you created it as a Hive table
saveAsTable()

GIS capabilities

Different roles - Creating, consuming, cleaning, ingesting and optimizing, advanced analytics

Databricks

Contents

General

Features

Links

Training

Components

dbutils

Delta Lake

Description

Operations

Delta lake best practices

Delta Share

Lakehouse

Hive table

DataFrame.createOrReplaceTempView()

GIS capabilities

Workflows

Unity Catalog

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools