Getting started with Polaris Catalog™

Overview

Polaris Catalog™ is an open catalog for Apache Iceberg™. Polaris Catalog is available as a SaaS service managed on Snowflake, but does not require signing up to Snowflake as a customer. It is also available as open source code which you can build and deploy yourself. Polaris Catalog provides an implementation of the Apache Iceberg™ REST catalog with cross-engine security via role-based access control.

In this tutorial, you will learn how to get started with Polaris Catalog managed on Snowflake.

What you’ll learn

  • How to create a new Polaris Catalog account.

  • How to create a new Iceberg catalog in the Polaris Catalog account and secure it using RBAC.

  • How to use Apache Spark™ to create tables in the catalog and run queries.

  • How to use Snowflake to run queries on tables in the catalog.

  • How to mirror or publish managed Iceberg tables in Snowflake to Polaris Catalog.

What you’ll need

  • ORGADMIN privileges in your Snowflake organization (to create a new Polaris Catalog account).

  • ACCOUNTADMIN privileges in your Snowflake account (to connect to the Polaris Catalog account). This Snowflake account does not have to be the same as the Snowflake organization account.

What you’ll do

Two user scenarios:

  • Create a catalog in Polaris Catalog, create a table using Apache Spark™ and query the table using Apache Spark™ and Snowflake engine.

    Image 1: Diagram

  • Create an Apache Iceberg™ table in the Snowflake DB account using Snowflake engine, and publish it to Polaris Catalog so Apache Spark™ can run queries on it.

    Image 2: Diagram

Set up environment

Install Conda, Spark, Jupyter on your laptop

In this tutorial, you can use Conda to easily create a development environment and download necessary packages. This is only needed if you choose to follow the last section for using Apache Spark™ to read Snowflake-managed Apache Iceberg™ tables. This is not required to create or use Iceberg tables on Snowflake.

To install Conda, use the instructions specific to your OS:

Create a file named environment.yml with the following contents:

name: iceberg-lab
channels:
  - conda-forge
dependencies:
  - findspark=2.0.1
  - jupyter=1.0.0
  - pyspark=3.5.0
  - openjdk=11.0.13
Copy

To create the environment needed, run the following in your shell:

conda env create -f environment.yml
Copy

Create Polaris Catalog account

A Polaris Catalog account can be created by ORGADMIN. In Snowsight, select the navigation menu Admin >> Accounts:

Image 3: Screenshot

Expand the + Account drop-down menu and select Create Polaris account.

Image 4: Screenshot

Log into the Polaris Catalog Web Interface

  1. Click the account url that you receive after creating the account, OR Go to https://app.snowflake.com

  2. Click “sign in to a different account” and log in with the Polaris Catalog account created before.

Use case: Create table using Apache Spark™

Create a new catalog

Now that you have a new Polaris Catalog account, let’s log into the account and create a new catalog that can host Iceberg tables.

To create a new catalog, click Catalogs on the left hand pane, and click the +Catalog button on right top. In the dialog box, enter the details. You will need to provide storage account details while creating the catalog.

Default base location: The location where the table data will be stored.

Additional Locations (Optional): A comma separated list of multiple storage locations. It is mainly used if you need to import tables from different locations in this catalog. Let’s leave it blank for now.

S3 role arn: AWS role which has read-write access to storage locations.

External ID: (Optional) A secret that you want to provide while creating a trust relationship between catalog user and storage account. If you skip, it will be auto-generated. Use a simple string abc123 for now.

You can also follow this document for detailed information on storage for Apache Iceberg™ tables (external volumes)

The following screenshot shows a sample S3 storage configuration:

Image 5: Screenshot

Click the Create button to create the catalog in Polaris Catalog.

Now that you’ve created the catalog, you need to set up a trust relationship so that the IAM user specified in the configuration above can read and write data in the storage location. Once the catalog is created, click on your catalog in the list. Note that you will need the S3 storage IAM user, and External ID for this task.

Please follow these instructions to create the trust relationship. Note that only Step 5 is needed.

In the JSON object shown in the above instructions:

  • For <snowflake_user_arn>, use the value under IAM user arn in the Polaris Catalog UI.

  • For <snowflake_external_id>, use the value under External ID in the Polaris Catalog UI.

Create a new connection for Apache Spark™

Create a new connection (client_id / client_secret pair) for Apache Spark to run queries against the catalog that you just created. To create a connection, click the Connections tab in the left nav pane and click the +Connection button in the right corner.

While creating the connection, create a new principal role or choose from one of the available roles (e.g. service_admin). Below is a screenshot where a new role my_spark_admin_role is created. You will have to grant this role privileges to access the catalog that you created above.

Image 6: Screenshot

Copy the client_id and client_secret and keep them in a safe place.

Important

You won’t be able to retrieve these from the Polaris Catalog service again.

Set up catalog privileges for connection

Now that a service connection is created, the next step is to give it privileges so it can access the catalog. Without access privileges, the service connection can’t run any queries on the catalog. To do so, click Catalogs in the left nav pane, then click on your catalog in the list. Let’s create a new role spark_catalog_role and give it privileges to create tables, read and write tables. In the catalog roles section, click the +catalog role button and add the CATALOG_MANAGE_CONTENT privilege from the drop-down list.

Image 7: Screenshot

Now in the principal roles section, click Grant to principal role. Here you will grant the catalog admin privileges to the principal role you created in the previous step.

Image 8: Screenshot

Once done, you will see the catalog privileges like this. The spark_catalog_role role is granted to my_spark_admin_role, which gives admin privileges for the Spark connection that you created in the previous step.

Image 9: Screenshot

Set up Spark

From your terminal, run the following commands to activate the virtual environment you created in the setup, and open jupyter notebooks.

conda activate iceberg-lab
jupyter notebook
Copy

Configure Spark

To configure Spark, run these commands in a Jupyter notebook. For more information, including parameter descriptions, see Configure a service connection in Spark.

import os
os.environ['SPARK_HOME'] = '/Users/<username>/opt/anaconda3/envs/iceberg-lab/lib/python3.12/site-packages/pyspark'

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('iceberg_lab') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'polaris') \
.config('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.polaris.type', 'rest') \
.config('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation','vended-credentials') \
.config('spark.sql.catalog.polaris.uri','https://<polaris_catalog_account_identifier>.snowflakecomputing.com/polaris/api/catalog') \
.config('spark.sql.catalog.polaris.credential','<client_id>:<client_secret>') \
.config('spark.sql.catalog.polaris.warehouse','<catalog_name>') \
.config('spark.sql.catalog.polaris.scope','PRINCIPAL_ROLE:<principal_role_name>') \
.getOrCreate()

#Show namespaces
spark.sql("show namespaces").show()

#Create namespace
spark.sql("create namespace spark_demo")

#Use namespace
spark.sql("use namespace spark_demo")

#Show tables; this will show no tables since it is a new namespace
spark.sql("show tables").show()

#create a test table
spark.sql("create table test_table (col1 int) using iceberg");

#insert a record in the table
spark.sql("insert into test_table values (1)");

#query the table
spark.sql("select * from test_table").show();
Copy

S3 cross region

When your storage account is located in a different region than your Spark client, you must provide an additional Spark configuration setting:

.config('spark.sql.catalog.polaris.client.region','<region_code>') \
Copy

The region code can be found in https://docs.aws.amazon.com/general/latest/gr/rande.html#regional-endpoints

The following code illustrates the same earlier sample, but modified to include the s3 region:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('iceberg_lab') \
.config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160') \
.config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.config('spark.sql.defaultCatalog', 'polaris') \
.config('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.polaris.type', 'rest') \
.config('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation','vended-credentials') \
.config('spark.sql.catalog.polaris.uri','https://<account>.snowflakecomputing.com/polaris/api/catalog') \
.config('spark.sql.catalog.polaris.credential','<client_id>:<secret>') \
.config('spark.sql.catalog.polaris.warehouse','<catalog_name>') \
.config('spark.sql.catalog.polaris.scope','PRINCIPAL_ROLE:ALL') \
.config('spark.sql.catalog.polaris.client.region','<region_code>') \
.getOrCreate()
Copy

Query the tables using Snowflake

You can create a catalog integration object in Snowflake and create an Apache Iceberg™ table in Snowflake that represents the table in Polaris Catalog. In the following example, create an Iceberg table in Snowflake that represents the Iceberg table just created by Spark in the internal catalog in Polaris Catalog.

You can use the same Spark connection credentials or you can create a new Snowflake connection. If you create a new connection, you have to set up roles and privileges accordingly.

CREATE OR REPLACE CATALOG INTEGRATION demo_polaris_int 
CATALOG_SOURCE=POLARIS 
TABLE_FORMAT=ICEBERG 
CATALOG_NAMESPACE='<catalog_namespace>' 
REST_CONFIG = (
CATALOG_URI ='https://<account>.snowflakecomputing.com/polaris/api/catalog' 
WAREHOUSE = <catalog_name>
)
REST_AUTHENTICATION = (
TYPE=OAUTH 
OAUTH_CLIENT_ID='<client_id>' 
OAUTH_CLIENT_SECRET='<secret>' 
OAUTH_ALLOWED_SCOPES=('PRINCIPAL_ROLE:ALL') 
) 
ENABLED=true;

# the <catalog_namespace> created in previous step is spark_demo.
# the <catalog_name> created in previous step is demo_catalog.
Copy

Next, create the table representation in Snowflake using the catalog integration created above and query the table:

create or replace iceberg table test_table
  catalog = 'demo_polaris_int'
  external_volume = '<external_volume>'
  catalog_table_name = 'test_table'

select * from test_table;
Copy

Use case: Sync Apache Iceberg™ tables from Snowflake to Polaris Catalog

If you have Iceberg tables in Snowflake, you can sync them to Polaris Catalog so other engines can query those tables.

Create an external catalog in Polaris Catalog

First you need to create an external catalog in your Polaris Catalog account where the Iceberg tables from Snowflake can be synchronized. The external catalog in your Polaris Catalog account can be created using same instructions in the previous section for creating a catalog. The only difference is toggle the external slider to true.

Note

You must use a different storage location. To ensure that the access privileges defined for a catalog are enforced correctly, two different catalogs can’t have overlapping locations.

Let’s assume that you created a catalog called demo_catalog_ext.

Create a connection for Snowflake

You will also need to create a connection in your Polaris Catalog account for Snowflake. You can follow same instructions as you did in the previous section for creating a Apache Spark connection.

Set up catalog privileges

Lastly, you have to set up privileges on the external catalog so Snowflake connection has the right privileges for external catalog. You can follow the same instructions as you did in the previous section for setting catalog privileges.

Create a catalog integration object in Snowflake

Now, you can create a catalog integration object in Snowflake and sync your Iceberg tables to Polaris Catalog. In the below example, any time a managed Iceberg table in the Snowflake schema polaris_demo.iceberg is modified, it will be synchronized to Polaris Catalog so other engines can query those tables.

CREATE OR REPLACE CATALOG INTEGRATION demo_polaris_ext 
  CATALOG_SOURCE=POLARIS 
  TABLE_FORMAT=ICEBERG 
  CATALOG_NAMESPACE='default' 
  REST_CONFIG = (
    CATALOG_URI ='https://<account>.snowflakecomputing.com/polaris/api/catalog' 
    WAREHOUSE = '<catalog_name>'
  )
  REST_AUTHENTICATION = (
    TYPE=OAUTH 
    OAUTH_CLIENT_ID='<client_id>' 
    OAUTH_CLIENT_SECRET='<secret>' 
    OAUTH_ALLOWED_SCOPES=('PRINCIPAL_ROLE:ALL') 
  ) 
  ENABLED=true;

# the <catalog_name> created in previous step is demo_catalog_ext.
Copy

Now, you can create a managed Iceberg table and sync it from Snowflake to Polaris Catalog:

use database polaris_demo;
use schema iceberg;

# Note that the storage location for this external volume will be different than storage location for external volume in use case 1

CREATE OR REPLACE EXTERNAL VOLUME snowflake_demo_ext
  STORAGE_LOCATIONS =
      (
        (
            NAME = '<storage_location_name>'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://<s3_location>'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<aws_acct>:role/<rolename>'
            STORAGE_AWS_EXTERNAL_ID = '<external_id>'
        )
      );

CREATE OR REPLACE ICEBERG TABLE test_table_managed (col1 int)
  CATALOG = 'SNOWFLAKE'
  EXTERNAL_VOLUME = 'snowflake_demo_ext'
  BASE_LOCATION = 'test_table_managed'
  CATALOG_SYNC = 'demo_polaris_ext'; 
Copy

The table will then be synced to Polaris Catalog and will be available for other engines to run queries.

Note

If the table fails to sync to Polaris Catalog, you can run the SYSTEM$SEND_NOTIFICATIONS_TO_CATALOG system function to diagnose the reason for the sync failure. For more information, see SYSTEM$SEND_NOTIFICATIONS_TO_CATALOG.

Conclusion

You can use an internal catalog in your Polaris Catalog account to create tables, query them, and run DML against the tables using Apache Spark™ or other query engines.

In Snowflake, you can create a catalog integration for Polaris Catalog to do the following:

  • Run queries on Polaris Catalog managed tables.

  • Sync Snowflake tables to an external catalog in your Polaris Catalog account.

What You Learned

  • Create a Polaris Catalog account.

  • Create an internal catalog in your Polaris Catalog account.

  • Use Spark to create tables on the internal catalog.

  • Use Snowflake to create a catalog integration for Polaris Catalog and run queries on it.

  • Create an external catalog in your Polaris Catalog account.

  • Create a managed Apache Iceberg™ table in Snowflake and sync it to the external catalog in your Polaris Catalog account.

  • Query the synced table using Spark