Databricks
Jump to navigation
Jump to search
Contents
General
- Meant to be an environment to do
- Data Engineering - raw data, normalize, filter extract good representation
- Data Science - explore it, model tuning
- Productionization - once you've selected bet model, get it out to production
- Built-in data visualization using Plotly
- register as delta table, register data in metastore
Links
- Introducing MLflow for end-to-end Machine Learning on Databricks
- How to work with files on Databricks
Components
Delta Lake
Description
- Optimized storage layer
- Is OSS that extends Parquet data files with file-based transaction log for ACID transactions and scalable metadata handling
- Used for both batch and streaming operations
- All tables on Databricks are Delta tables by default, including spark DataFrames or SQL
- Just save data to the lakehouse with default settings
- Delta tables have directories
Operations
- create a new managed table:
df.write.saveAsTable(table_name)
- find out location using
%sql DESCRIBE DETAIL table_name
- "UPSERT" using
MERGE INTO table_name
syntax to simultaneously add new rows or update existing rows
Delta lake best practices
- Columns that are used often and that have high cardinality should be used in conjunction with
Z-ORDER BY
clause. - Compact a delta table written to often using the
OPTIMIZE
command; UseVACUUM
to remove the old files
Lakehouse
- Organizes data stored with Delta Lake in cloud object storage
- Combines benefits of enterprise data warehouse and data lake
- Object hierarchy: Metastore -> Catalog -> Schema (is synonym for database in databricks) -> table -> view -> function
- view = saved query
- function = saved logic that returns scalar or set of rows
- metastore - centralized access control, auditing, lineage data discovery capabilities
- Unity catalog - new!
- Built-in Hive metastore (old/legacy)
- catalog - highest abstraction in relational model
- database - a collection of tables, views and functions. Optionally has a location
- table - structured data stored as a directory of files on cloud object storage
- managed table - no location specified at write time
- (
df.write.saveAsTable("table_name")
)
- (
- unmanaged table - specify location at write time
df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name")
- managed table - no location specified at write time
Hive table
- Apache hive glossary entry
- Hive is data warehouse software
- designed for large datasets extracted from Apache Hadoop Distributed File System
- SQL-like query language HiveQL, plus use other languages
- provides centralized metadata store of all the datasets in the organization in a relational database system
- Available data types: int, datetime/interval, string, struct/union, boolean
- Write once, read many times
- Does not do data validation, schema checking
DataFrame.createOrReplaceTempView()
- create a view into the data, that can be queried using SQL via %sql magic in databricks
- Used when you want to store a table for using in your notebook that are session scoped
- views are dropped when the session ends unless you created it as a Hive table
saveAsTable()