Categories
Snowflake

20 Snowflake Questions Every Data Engineer Should Know

Snowflake has become one of the dominant cloud data platforms, and if you’re working in data engineering, or moving into it, as I am, there’s a good chance it’ll show up in a job description, an interview, or a project handed to you on day one. I’ve been deepening my own Snowflake knowledge alongside my Databricks learning journey, and one of the most useful things I’ve done is work through the questions that come up again and again: in interviews, in documentation, and in real conversations with other engineers.

This post is a collection of 20 of those questions. They cover the fundamentals you really need to understand: architecture, performance, data loading, security, cost, and the features that make Snowflake distinctive. Some are the kind of questions you’d expect in a technical interview. Others are the ones you’ll quietly ask yourself the first time you’re staring at a slow query or a credit bill that looks higher than expected.

If you’re preparing for an interview, use this as a self-assessment: read each question, try to answer it out loud before looking at the answer, and notice where you hesitate. If you’re learning Snowflake from scratch, treat it as a structured tour of the concepts that matter most. And if you’ve been working with Snowflake for years, hopefully there’s still something here worth a second look.

Architecture and fundamentals

1. What is Snowflake and how is it different from a traditional data warehouse?

Snowflake is a cloud-native data warehouse built from scratch for the cloud, not ported from on-prem. The defining difference is the separation of storage and compute into independent layers, which lets you scale each independently and run multiple workloads against the same data without contention. Traditional warehouses (Teradata, on-prem SQL Server, Oracle Exadata) couple storage and compute, so scaling one means scaling both.

2. Explain Snowflake’s three-layer architecture.

Three layers: (1) Storage layer — data sits in cloud object storage (S3, Azure Blob, GCS) in Snowflake’s proprietary compressed columnar micro-partition format. (2) Compute layer — independent virtual warehouses (clusters of compute nodes) that execute queries. Multiple warehouses can read the same data simultaneously without interference. (3) Cloud services layer — the brain: handles authentication, metadata, query parsing and optimization, transaction management, and security. It’s also what makes features like zero-copy cloning and Time Travel possible.

3. What is a virtual warehouse?

A virtual warehouse is a cluster of compute resources that executes queries and DML operations. Sized from X-Small to 6X-Large, each size step roughly doubles compute and cost. Warehouses can be started, stopped, suspended, and resized independently, and they bill per second (with a 60-second minimum) only when running. You typically have multiple warehouses for different workloads — e.g., one for ETL, one for BI, one for ad-hoc analytics — so they don’t fight each other.

4. What’s the difference between a virtual warehouse and a database in Snowflake?

A database is a logical container for schemas and tables (the data). A virtual warehouse is the compute that processes queries against that data. They’re completely decoupled — any warehouse can query any database it has permissions for. This is the core of “separation of storage and compute.”

5. What are micro-partitions and how does Snowflake use them?

Snowflake automatically divides table data into contiguous units of storage called micro-partitions, typically 50–500 MB of uncompressed data each. They’re columnar, compressed, and immutable. Snowflake stores metadata about each micro-partition (min/max values, distinct counts, etc.) which enables partition pruning — at query time, Snowflake skips micro-partitions that can’t possibly contain matching rows. This is automatic and is the main reason Snowflake is fast without manual indexing.

Performance and optimization

6. What is clustering in Snowflake and when should you use it?

Snowflake automatically organizes data into micro-partitions based on insertion order, which works well for most tables. For very large tables (typically multi-terabyte) where queries filter on specific columns, you can define a clustering key to tell Snowflake to physically organize data by those columns, improving partition pruning. Snowflake then optionally uses Automatic Clustering to maintain that ordering as data changes. Don’t cluster small tables — it costs credits and provides no benefit.

7. How do you optimize a slow query in Snowflake?

Start with the Query Profile in the UI to see where time is spent. Common fixes: ensure partition pruning is working (check “partitions scanned” vs “partitions total”), add or revise clustering keys on huge tables, right-size the warehouse (bigger isn’t always better — it’s faster but more expensive), avoid SELECT *, push filters down, materialize expensive subqueries, use result cache where possible, watch for spilling to local or remote storage (a sign the warehouse is undersized for the workload), and check for skewed joins.

8. How does caching work in Snowflake?

Three levels of caching: (1) Result cache — query results are cached for 24 hours at the cloud services layer; identical queries return instantly without using a warehouse at all. (2) Local disk cache — each warehouse caches data it has read from storage on its local SSDs; warm warehouses return queries faster. (3) Metadata cache — table statistics and metadata cached in the services layer for query planning. Local cache is lost when a warehouse is suspended.

Data loading and integration

9. What’s the difference between bulk loading and continuous loading?

Bulk loading uses the COPY INTO command to load files from a stage (internal or external like S3/ADLS) into a table — typically run in batches via scheduled jobs. Continuous loading uses Snowpipe, which auto-ingests files as they arrive in cloud storage via event notifications. Snowpipe is for near-real-time ingestion (latency in seconds to a minute). Bulk is cheaper for large scheduled loads; Snowpipe is for streaming-ish use cases.

10. What is a Snowflake stage?

A stage is a pointer to a location where data files are stored, used for loading and unloading. Internal stages are managed by Snowflake (named, table, or user stages). External stages point to your own S3 bucket, Azure container, or GCS bucket. You typically COPY INTO from a stage, or use Snowpipe to auto-ingest from one.

11. What file formats does Snowflake support?

Structured: CSV, TSV. Semi-structured: JSON, Avro, Parquet, ORC, XML. Snowflake stores semi-structured data in the VARIANT data type, which is queryable directly with dot notation and Snowflake automatically extracts and optimizes the underlying structure. Parquet is the most efficient for large bulk loads.

12. How would you handle CDC (change data capture) in Snowflake?

Use Streams. A Stream is a Snowflake object that records DML changes (inserts, updates, deletes) made to a source table since the Stream was last consumed. You query it like a table to get the changed rows, then process them — typically via a Task that runs SQL on a schedule or in response to a Stream having data. Streams + Tasks together give you native in-warehouse CDC and orchestration without external tools.

Features and capabilities

13. What is Time Travel and what is it used for?

Time Travel lets you query, clone, or restore data as it existed at a previous point in time, up to 90 days back (Enterprise edition; 1 day on Standard). Use cases: recovering from accidental deletes or updates (UNDROP TABLE, SELECT ... AT(TIMESTAMP => ...)), auditing changes, comparing historical states, debugging pipelines. After the Time Travel window expires, data moves into Fail-safe (7 days, recoverable only by Snowflake support).

14. What is zero-copy cloning?

You can clone a table, schema, or entire database instantly with CREATE TABLE foo CLONE bar — and the clone takes up no additional storage initially. Snowflake just creates new metadata pointing to the same underlying micro-partitions. Storage is only consumed when the clone diverges from the original. This is incredibly useful for creating dev/test environments from production, or for snapshotting before a risky change.

15. What is Snowflake’s data sharing feature?

Secure Data Sharing lets you share live data with other Snowflake accounts without copying or moving it. The consumer queries your data directly with their own compute and credits — they see live data, you keep ownership. The Snowflake Marketplace is built on this. It’s a defining Snowflake feature that’s hard to replicate elsewhere.

Security and governance

16. How does Snowflake handle security and access control?

Snowflake uses role-based access control (RBAC). Privileges are granted to roles, not users; users are granted roles. Roles can inherit from other roles to form hierarchies. There are system-defined roles (ACCOUNTADMIN, SECURITYADMIN, SYSADMIN, USERADMIN, PUBLIC) and you create custom roles for your organization. Best practice: use the principle of least privilege, never use ACCOUNTADMIN for daily work, and create functional roles aligned to job functions rather than individuals.

17. What governance features does Snowflake provide?

Column-level security via masking policies (dynamic data masking based on the role querying), row-level security via row access policies, object tagging for classification, Access History and Object Dependencies views for lineage and audit, always-on encryption at rest and in transit, network policies for IP allowlisting, and Snowflake Horizon as the umbrella governance product. For regulated industries like healthcare (relevant to Cygnet), masking PII columns based on role is a common pattern.

Cost and operations

18. How does Snowflake pricing work?

Two main components: (1) Compute — billed in credits based on warehouse size and runtime, per second with a 60-second minimum after each start. Credits convert to dollars at a rate that depends on your edition (Standard, Enterprise, Business Critical) and cloud region. (2) Storage — flat per-TB-per-month rate for compressed data. There are also smaller charges for serverless features (Snowpipe, Automatic Clustering, Materialized Views) and data transfer across regions or clouds.

19. How do you manage and reduce Snowflake costs?

Right-size warehouses (don’t default to Large), set aggressive auto-suspend times (60 seconds is fine for most warehouses), use auto-resume so you don’t pay for idle time, separate workloads onto dedicated warehouses so a runaway query doesn’t bloat a shared one, monitor with WAREHOUSE_METERING_HISTORY and QUERY_HISTORY, set resource monitors with credit quotas and alerts, use result caching aggressively, and avoid unnecessary clustering. The biggest single cost killer is warehouses left running idle — auto-suspend fixes most of it.

20. How do you implement CI/CD for Snowflake?

Treat Snowflake objects as code. Common patterns: store DDL and transformation logic in Git, use dbt for transformations (which gives you version control, testing, documentation, and environments for free), use schemachange or Flyway for managing object migrations, leverage zero-copy cloning to create fast dev/test environments from prod, use separate databases or accounts for dev/test/prod, and run automated tests in a CI pipeline before merging. The dbt + Git + GitHub Actions combo is the most common modern setup.

A few extra ones that come up frequently and are worth knowing:

  • Streams vs Tasks — Streams track changes; Tasks schedule SQL. They’re often used together for in-warehouse pipelines.
  • Materialized views vs regular views — Materialized views store precomputed results and auto-refresh; they cost credits and storage but speed up repeat queries on slow-changing data.
  • Snowpark — Snowflake’s framework for writing data transformations in Python, Java, or Scala that execute inside Snowflake’s compute, similar in spirit to PySpark on Databricks.
  • External tables — query data in cloud storage without loading it into Snowflake; useful for data lake patterns.
  • Iceberg tables — Snowflake now supports Apache Iceberg as an open table format, letting you keep data in your own storage in an open format while still querying it with Snowflake. This is Snowflake’s response to the lakehouse pressure from Databricks.