Azure Databricks Notebooks: A Complete Guide for Beginners

Notebooks are where you do your actual work in Azure Databricks. They are interactive documents that combine live code, visualizations, and narrative text in a single place. Whether you are exploring data, building pipelines, training models, or documenting your analysis, notebooks are the primary interface you will use every day. This guide covers everything you need to know to work effectively with Databricks notebooks, including magic commands and the powerful Databricks Utilities (dbutils) framework.

What Is a Databricks Notebook?

A Databricks notebook is a web-based interface made up of individual cells. Each cell can contain code, SQL, markdown text, or shell commands. You run cells one at a time or all at once, and the output appears directly below each cell. This makes notebooks ideal for iterative development — you write a bit of code, run it, see the result, and build on it.

Every notebook has a default language that you set when you create it — Python, SQL, Scala, or R. However, you are not locked into that language. Using magic commands, you can switch between languages within the same notebook on a cell-by-cell basis.

Notebooks attach to a cluster to execute code. Without a running cluster, you can edit cells and write markdown, but you cannot run anything. Once attached, the cluster maintains state across cells, meaning variables and DataFrames you create in one cell are available in subsequent cells.

Magic Commands

Magic commands are special directives that you place at the very beginning of a cell to change its behavior. They start with a % symbol and must be the first line in the cell — nothing can come before them, not even a comment.

%md — Markdown

The %md magic command converts a cell into rendered markdown. This is how you add documentation, headings, explanations, and formatted text to your notebook. It is essential for making notebooks readable and shareable.

%md
# Sales Data Analysis
This notebook explores Q3 sales data across all regions.
## Objectives
- Identify top-performing regions
- Analyse month-over-month trends
- Flag anomalies in returns data

%md
# Sales Data Analysis
This notebook explores Q3 sales data across all regions.
## Objectives
- Identify top-performing regions
- Analyse month-over-month trends
- Flag anomalies in returns data

When you run this cell, it renders as formatted text with headings, bullet points, and any other markdown syntax you use. Get in the habit of starting every notebook with a %md cell that explains what the notebook does, who maintains it, and when it was last updated. This small effort makes a big difference when you or a teammate revisits the notebook weeks later.

%sql — SQL

The %sql magic command lets you write SQL statements directly, regardless of the notebook’s default language. This is one of the most frequently used magic commands because SQL is often the fastest way to explore and query data.

%sql
SELECT region, SUM(revenue) as total_revenue
FROM sales_data
WHERE quarter = 'Q3'
GROUP BY region
ORDER BY total_revenue DESC

%sql
SELECT region, SUM(revenue) as total_revenue
FROM sales_data
WHERE quarter = 'Q3'
GROUP BY region
ORDER BY total_revenue DESC

Results from %sql cells display as formatted tables with built-in visualization options. You can click the chart icon below the results to instantly create bar charts, line charts, pie charts, and other visualizations without writing any additional code.

%scala — Scala

If your notebook’s default language is Python but you need to run a Scala snippet, the %scala magic command switches that cell to Scala.

%scala
val data = spark.read.format("delta").load("/mnt/data/transactions")
data.printSchema()

%scala
val data = spark.read.format("delta").load("/mnt/data/transactions")
data.printSchema()

This is useful when you need to leverage a Scala-specific library or when adapting code examples written in Scala.

%python — Python

The reverse of %scala. If your notebook defaults to Scala or SQL, you can run Python in a specific cell.

%python
import pandas as pd
df = spark.sql("SELECT * FROM sales_data").toPandas()
print(df.describe())

%python
import pandas as pd
df = spark.sql("SELECT * FROM sales_data").toPandas()
print(df.describe())

%r — R

For statistical analysis or when working with R-specific packages, you can switch to R within any notebook.

%r
library(ggplot2)
data <- as.data.frame(SparkR::collect(SparkR::sql("SELECT * FROM sales_data")))
summary(data)

%r
library(ggplot2)
data <- as.data.frame(SparkR::collect(SparkR::sql("SELECT * FROM sales_data")))
summary(data)

%sh — Shell Commands

The %sh magic command executes shell commands on the driver node of your cluster. This is useful for checking system information, running OS-level commands, or inspecting the local environment.

%sh
ps aux | head -20

%sh
ps aux | head -20

This shows running processes on the driver node. Other common uses include:

%sh
whoami

%sh
whoami

%sh
cat /etc/os-release

%sh
cat /etc/os-release

%sh
ls /databricks/driver/

%sh
ls /databricks/driver/

Keep in mind that %sh commands only run on the driver node, not on worker nodes. They are helpful for debugging and environment inspection but are not the right tool for distributed data operations.

%fs — File System Commands

The %fs magic command provides shorthand access to the Databricks file system (DBFS). It is essentially a convenient wrapper around dbutils.fs that lets you interact with files and directories without writing full code.

List the root directory:

%fs
ls /

%fs
ls /

List contents of the built-in sample datasets:

%fs
ls dbfs:/databricks-datasets/

%fs
ls dbfs:/databricks-datasets/

Databricks ships with a collection of public datasets at this path that are perfect for learning and experimentation.

Explore a specific dataset:

%fs
ls dbfs:/databricks-datasets/COVID/

%fs
ls dbfs:/databricks-datasets/COVID/

Preview the first few lines of a CSV file:

%fs
head dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv

%fs
head dbfs:/databricks-datasets/COVID/coronavirusdataset/PatientInfo.csv

The head command is invaluable for quickly checking what a file looks like before you load it into a DataFrame — you can see the column headers, delimiters, and a few rows of data without writing any read logic.

Other useful file system commands:

%fs
mkdirs /mnt/output/reports

%fs
mkdirs /mnt/output/reports

%fs
cp dbfs:/source/file.csv dbfs:/destination/file.csv

%fs
cp dbfs:/source/file.csv dbfs:/destination/file.csv

%fs
rm dbfs:/temp/old_file.csv

%fs
rm dbfs:/temp/old_file.csv

%pip — Install Python Packages

The %pip magic command installs Python packages on the cluster directly from your notebook. This is the recommended way to install libraries in Databricks.

%pip install pandas

%pip install pandas

%pip install requests beautifulsoup4 lxml

%pip install requests beautifulsoup4 lxml

After running a %pip command, Databricks automatically restarts the Python interpreter to make the new packages available. Any variables or DataFrames you had in memory will be lost, so it is best practice to put all your %pip install commands at the very top of your notebook before any other code runs.

You can also install specific versions:

%pip install pandas==2.1.0

%pip install pandas==2.1.0

%run — Execute Another Notebook

The %run magic command executes another notebook inline, as if its code were part of the current notebook. This is useful for running setup scripts, loading shared utility functions, or executing configuration notebooks.

%run ./setup_notebook

%run ./setup_notebook

%run /Workspace/Shared/utilities/common_functions

%run /Workspace/Shared/utilities/common_functions

All variables, functions, and imports defined in the executed notebook become available in the calling notebook. The path can be relative (using ./) or absolute. This is different from dbutils.notebook.run(), which we will cover shortly — %run shares the execution context, while dbutils.notebook.run() executes the notebook in an isolated context.

Databricks Utilities (dbutils)

Databricks Utilities, accessed through the dbutils object, are a collection of helper functions that make it easier to work with files, notebooks, widgets, secrets, and more. They are available in Python and Scala by default. While dbutils cannot be called directly from SQL cells, you can access many of the same capabilities through SQL-specific syntax or by using a Python or Scala cell alongside your SQL workflow.

To see everything dbutils offers, run:

python

dbutils.help()

dbutils.help()

This prints a summary of all available utility modules.

File System Utilities (dbutils.fs)

File system utilities let you interact with DBFS and mounted storage programmatically. They are the code-based equivalent of the %fs magic command but with the full power of your programming language for loops, conditionals, and string manipulation.

View available file system commands:

python

dbutils.fs.help()

dbutils.fs.help()

List files and folders:

python

dbutils.fs.ls('/')

dbutils.fs.ls('/')

This returns a list of FileInfo objects. To display them more readably:

python

for folder in dbutils.fs.ls('/'):
    print(folder)

for folder in dbutils.fs.ls('/'):
    print(folder)

Each FileInfo object contains the path, name, size, and modification time.

Working with mounts:

Mounts let you map external storage locations — like Azure Data Lake Storage or Azure Blob Storage — to a path in DBFS so they feel like local directories. This simplifies access across notebooks and jobs.

To see help specific to mounting:

python

dbutils.fs.help("mount")

dbutils.fs.help("mount")

Common mount operations include:

python

# Mount an Azure Data Lake Storage container
dbutils.fs.mount(
    source = "wasbs://container@storageaccount.blob.core.windows.net",
    mount_point = "/mnt/data",
    extra_configs = {"fs.azure.account.key.storageaccount.blob.core.windows.net": "your-access-key"}
)

# Mount an Azure Data Lake Storage container
dbutils.fs.mount(
    source = "wasbs://container@storageaccount.blob.core.windows.net",
    mount_point = "/mnt/data",
    extra_configs = {"fs.azure.account.key.storageaccount.blob.core.windows.net": "your-access-key"}
)

python

# List all current mounts
dbutils.fs.mounts()

# List all current mounts
dbutils.fs.mounts()

python

# Unmount when no longer needed
dbutils.fs.unmount("/mnt/data")

# Unmount when no longer needed
dbutils.fs.unmount("/mnt/data")

Other essential file system operations:

python

# Create a directory
dbutils.fs.mkdirs("/mnt/output/reports")

# Copy a file
dbutils.fs.cp("/mnt/data/input.csv", "/mnt/data/backup/input.csv")

# Move a file
dbutils.fs.mv("/mnt/data/temp.csv", "/mnt/data/archive/temp.csv")

# Delete a file
dbutils.fs.rm("/mnt/data/old_file.csv")

# Delete a directory and all its contents
dbutils.fs.rm("/mnt/data/temp/", recurse=True)

# Read the first portion of a file
dbutils.fs.head("/mnt/data/sample.csv")

# Create a directory
dbutils.fs.mkdirs("/mnt/output/reports")

# Copy a file
dbutils.fs.cp("/mnt/data/input.csv", "/mnt/data/backup/input.csv")

# Move a file
dbutils.fs.mv("/mnt/data/temp.csv", "/mnt/data/archive/temp.csv")

# Delete a file
dbutils.fs.rm("/mnt/data/old_file.csv")

# Delete a directory and all its contents
dbutils.fs.rm("/mnt/data/temp/", recurse=True)

# Read the first portion of a file
dbutils.fs.head("/mnt/data/sample.csv")

Using file system utilities in Scala:

scala

dbutils.fs.ls("/")
dbutils.fs.head("/mnt/data/sample.csv")

dbutils.fs.ls("/")
dbutils.fs.head("/mnt/data/sample.csv")

The syntax is nearly identical. The main difference is that Scala returns typed collections rather than Python lists.

Notebook Workflow Utilities (dbutils.notebook)

Notebook workflow utilities let you build multi-notebook workflows where one “parent” notebook orchestrates the execution of “child” notebooks. This is the foundation for modular, maintainable pipelines.

View available notebook commands:

python

dbutils.notebook.help()

dbutils.notebook.help()

Run a child notebook:

python

result = dbutils.notebook.run("./child_notebook", 60, {"input": "called from main notebook"})
print(result)

result = dbutils.notebook.run("./child_notebook", 60, {"input": "called from main notebook"})
print(result)

This command takes three arguments: the path to the child notebook, a timeout in seconds (60 in this example means the call will fail if the child notebook takes longer than 60 seconds), and an optional dictionary of parameters to pass to the child notebook.

The child notebook runs in its own isolated context — it does not share variables with the parent. Communication between parent and child happens through the input parameters and the exit value.

Exit a notebook with a return value:

Inside the child notebook, you use exit to return a value to the calling parent:

python

dbutils.notebook.exit("Processing complete: 1500 rows loaded")

dbutils.notebook.exit("Processing complete: 1500 rows loaded")

The value passed to exit is always a string. If you need to return structured data, serialize it as JSON:

python

import json
result = {"status": "success", "rows_processed": 1500}
dbutils.notebook.exit(json.dumps(result))

import json
result = {"status": "success", "rows_processed": 1500}
dbutils.notebook.exit(json.dumps(result))

The parent notebook can then parse this returned string to make decisions about what to do next. This pattern enables sophisticated orchestration — a parent notebook can run multiple child notebooks in sequence, pass results from one to the next, and handle errors at each stage.

Widget Utilities (dbutils.widgets)

Widgets add interactive input controls to the top of your notebook. They are useful for parameterizing notebooks so the same notebook can run with different inputs — different dates, regions, file paths, or any other variable — without editing the code.

View available widget commands:

python

dbutils.widgets.help()

dbutils.widgets.help()

Create a text input widget:

python

dbutils.widgets.text("input", "", "Send the parameter value")

dbutils.widgets.text("input", "", "Send the parameter value")

This creates a text box at the top of the notebook with the label “Send the parameter value” and an empty default value. The first argument is the widget name, the second is the default value, and the third is the display label.

Retrieve the widget value:

python

input_parameter = dbutils.widgets.get("input")
print(f"Received parameter: {input_parameter}")

input_parameter = dbutils.widgets.get("input")
print(f"Received parameter: {input_parameter}")

Other widget types:

python

# Dropdown widget with predefined options
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "production"], "Select environment")

# Combobox (dropdown with free text entry)
dbutils.widgets.combobox("region", "UK", ["UK", "US", "EU", "APAC"], "Select or type region")

# Multiselect widget
dbutils.widgets.multiselect("metrics", "revenue", ["revenue", "orders", "returns", "customers"], "Choose metrics")

# Dropdown widget with predefined options
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "production"], "Select environment")

# Combobox (dropdown with free text entry)
dbutils.widgets.combobox("region", "UK", ["UK", "US", "EU", "APAC"], "Select or type region")

# Multiselect widget
dbutils.widgets.multiselect("metrics", "revenue", ["revenue", "orders", "returns", "customers"], "Choose metrics")

Using widget values in SQL:

While you cannot call dbutils directly from SQL, you can reference widget values using the $ syntax:

sql

%sql
SELECT * FROM sales_data WHERE region = '$region'

%sql
SELECT * FROM sales_data WHERE region = '$region'

Remove widgets when no longer needed:

python

# Remove a specific widget
dbutils.widgets.remove("input")

# Remove all widgets
dbutils.widgets.removeAll()

# Remove a specific widget
dbutils.widgets.remove("input")

# Remove all widgets
dbutils.widgets.removeAll()

Widgets are especially powerful when combined with job scheduling. When you schedule a notebook as a job, you can pass parameter values that populate the widgets automatically, making a single notebook serve multiple use cases.

Secrets Utilities (dbutils.secrets)

Secrets utilities let you access sensitive values — API keys, database passwords, storage account keys — stored securely in Azure Key Vault or the Databricks-backed secret scope, without exposing them in your notebook code.

View available secret commands:

python

dbutils.secrets.help()

dbutils.secrets.help()

List available secret scopes:

python

dbutils.secrets.listScopes()

dbutils.secrets.listScopes()

List secrets within a scope:

python

dbutils.secrets.list("my-scope")

dbutils.secrets.list("my-scope")

This lists the secret keys but not their values — Databricks never displays secret values in notebook output for security.

Retrieve a secret value:

python

storage_key = dbutils.secrets.get(scope="my-scope", key="storage-account-key")

storage_key = dbutils.secrets.get(scope="my-scope", key="storage-account-key")

The returned value is redacted in cell output. Even if you try to print it, Databricks will display [REDACTED] instead of the actual value. This prevents accidental exposure of credentials in shared notebooks or logs.

A common use case is mounting storage using secrets instead of hardcoded keys:

python

dbutils.fs.mount(
    source = "wasbs://container@storageaccount.blob.core.windows.net",
    mount_point = "/mnt/secure-data",
    extra_configs = {
        "fs.azure.account.key.storageaccount.blob.core.windows.net": dbutils.secrets.get("my-scope", "storage-key")
    }
)

dbutils.fs.mount(
    source = "wasbs://container@storageaccount.blob.core.windows.net",
    mount_point = "/mnt/secure-data",
    extra_configs = {
        "fs.azure.account.key.storageaccount.blob.core.windows.net": dbutils.secrets.get("my-scope", "storage-key")
    }
)

Library Utilities (dbutils.library)

Library utilities help manage the Python environment on your cluster. The most relevant command restarts the Python interpreter, which is useful after installing packages.

python

dbutils.library.restartPython()
```

This clears all Python state (variables, imports, cached data) and gives you a fresh interpreter. It is useful when you have installed packages with `%pip` partway through a notebook and need a clean state, or when you want to clear memory after working with large datasets.

## Putting It All Together

A well-structured Databricks notebook typically follows a pattern. It starts with a markdown cell documenting the purpose, author, and last modified date. Next come any `%pip install` commands for additional packages. Then a setup cell creates widgets for parameters. The main body contains the data loading, transformation, and analysis logic, mixing code cells with markdown explanations. Finally, cleanup cells remove temporary data or widgets.

Here is a simplified example of how these pieces work together in practice:
```
%md
# Daily Sales Report
Generates aggregated sales metrics by region.
Maintained by the data engineering team.

dbutils.library.restartPython()
```

This clears all Python state (variables, imports, cached data) and gives you a fresh interpreter. It is useful when you have installed packages with `%pip` partway through a notebook and need a clean state, or when you want to clear memory after working with large datasets.

## Putting It All Together

A well-structured Databricks notebook typically follows a pattern. It starts with a markdown cell documenting the purpose, author, and last modified date. Next come any `%pip install` commands for additional packages. Then a setup cell creates widgets for parameters. The main body contains the data loading, transformation, and analysis logic, mixing code cells with markdown explanations. Finally, cleanup cells remove temporary data or widgets.

Here is a simplified example of how these pieces work together in practice:
```
%md
# Daily Sales Report
Generates aggregated sales metrics by region.
Maintained by the data engineering team.

python

%pip install great-expectations

%pip install great-expectations

python

dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "production"], "Environment")
dbutils.widgets.text("report_date", "2025-01-01", "Report date (YYYY-MM-DD)")

dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "production"], "Environment")
dbutils.widgets.text("report_date", "2025-01-01", "Report date (YYYY-MM-DD)")

python

env = dbutils.widgets.get("environment")
report_date = dbutils.widgets.get("report_date")
storage_key = dbutils.secrets.get("my-scope", f"{env}-storage-key")

env = dbutils.widgets.get("environment")
report_date = dbutils.widgets.get("report_date")
storage_key = dbutils.secrets.get("my-scope", f"{env}-storage-key")

python

for f in dbutils.fs.ls(f"/mnt/{env}/sales/"):
    print(f.name, f.size)

for f in dbutils.fs.ls(f"/mnt/{env}/sales/"):
    print(f.name, f.size)

sql

%sql
SELECT region, SUM(revenue) FROM sales_data
WHERE sale_date = '$report_date'
GROUP BY region

%sql
SELECT region, SUM(revenue) FROM sales_data
WHERE sale_date = '$report_date'
GROUP BY region

python

result = dbutils.notebook.run("./generate_report", 120, {
    "date": report_date,
    "env": env
})
print(f"Report generation returned: {result}")

result = dbutils.notebook.run("./generate_report", 120, {
    "date": report_date,
    "env": env
})
print(f"Report generation returned: {result}")

Wrapping Up

Notebooks are the central workspace in Databricks, and mastering magic commands and dbutils will make you dramatically more productive. Magic commands let you switch between languages and interact with the file system without leaving the notebook. Databricks Utilities give you programmatic control over files, notebook orchestration, parameterization, secrets management, and library handling. Together they transform a simple code editor into a powerful development environment that scales from quick exploration to production pipeline orchestration.