Snowflake Architecture

Snowflake
November 05, 2024

Snowflake’s follow hybrid architecture and it is a combination of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture. Physically separated each layer but logically integrated, each layer can scale up and down independently, enabling Snowflake to be more elastic and responsive. Snowflake architecture allows data engineers, data analysts, and data scientists to maximize productivity without the performance, scale, or concurrency limitations of other solutions.

Snowflake has a multi-cluster, shared-data architecture that consists of three key layers, namely:

1. Database Storage layer:
The Storage layer in snowflake architecture is responsible for managing and storing data in an effective manner. Snowflake storage layer supports Amazon S3, Azure and Google Cloud to load data into Snowflake using file system. User should upload a file (.txt, .xlsx,.csv etc.) into the cloud and after they create a connection in Snowflake to bring the data. Snowflake’s storage layer is flexible, allowing organizations to scale their storage needs independent of compute resources. It ensures to handle various data volumes without affecting performance. Snowflake uses cloud based object storage to store data. This separation of storage and compute allows for cost-effective and scalable data storage. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake. Snowflake owns responsibilities to all aspects of data management like how data is stored using automatic clustering of data, organization and structure of data, compression technique by keeping data into many micro-partitions, metadata, statistics, cost-effective and scalable data storage. Snowflake Zero Copy Cloning feature allows users to create a copy of a dataset instantly without duplicating the actual data, saving both time and storage costs.

2. Query Processing layer:
The query processing layer is separated from the disk storage layer in the Snowflake data architecture. Snowflake processes queries using Virtual warehouses. Each virtual warehouse is an MPP (massively parallel processing) compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider. A virtual warehouse, in most cases, has its own independent compute cluster and does not interface with other virtual warehouses. Virtual Warehouses may be auto-resumed and auto-suspended, are easily expandable, and include an auto-scaling factor. Snowflake’s Query Processing Layer optimizes SQL queries automatically, modifying execution plans based on underlying data distribution and query complexity to ensure efficient processing. Snowflake’s multi-cluster architecture to achieve high concurrency and faster results for complex analytical workloads. Query Processing Layer dynamically distributes computational resources as needed. This on-demand resource distribution provides peak performance and cost efficiency.

3. Cloud Services layer:
Cloud Services Layer serves as the control plane, managing information, security, and user access. A cloud service is a stateless computing resource that operates across different availability zones and uses highly accessible and usable information. It serves as a centralized platform for administration, authentication, and activity coordination across the data warehouse. Snowflake prioritizes security with end-to-end encryption, role-based access controls, and features like data masking, ensuring comprehensive protection of sensitive data within the cloud-based data warehousing platform. Cloud Services is the centralized management for all storage and it manages the compute environments to work with storage.