Snowflake Data Clean Rooms: Secure Snowpark Procedures¶

This topic describes the provider and consumer flows needed to programmatically set up a clean room, share it with a consumer, and run analyses on it in a way that uses secure Snowpark procedures loaded into the clean room from the provider’s account. In this flow, a provider loads a secure Snowpark procedure into the clean room using an API that keeps the underlying Python code completely confidential from the consumer.

The Snowpark procedure in this flow is carrying out a linear regression of Reach on Impression count to estimate the slope. It takes as input a table with Impression IDs, User IDs and Timestamps on the provider’s account, and optionally a table of users from a consumer. The Snowpark procedure dynamically creates SQL to join the impressions data onto the consumer’s users data if it is supplied, and create an intermediary table in the clean room that contains the count of impressions and reach by day.

Next, this data from the intermediary table is processed inside the Snowpark procedure, and a regression is carried out to estimate the intercept, slope, and a number of other parameters. This data is then written to a results table inside the clean room, and the ID of this table is given to the consumer as an output. Finally, the consumer can use a get_results template with this ID to get the data back from the clean room. Before the Snowpark procedure finishes, it cleans up all the intermediary tables created in the clean room.

Note: all intermediary tables are created inside the clean room, and so aren’t accessible to anyone but the Snowpark procedure itself.

The key aspects of this flow other than the ones mentioned above are:

Provider:

a. Securely add a Snowpark procedure into the clean room.

b. Add a custom template that runs the Snowpark procedure, and another that retrieves the results.

c. Share the clean room with a consumer.
Consumer:

a. Run the template that carries out the regression.

b. Retrieve the results of the analysis.

Prerequisites¶

You need two separate Snowflake accounts to complete this flow. Use the first account to execute the provider’s commands, then switch to the second account to execute the consumer’s commands.

Provider¶

Note

The following commands should be run in a Snowflake worksheet in the provider account.

Set up the environment¶

Execute the following commands to set up the Snowflake environment before using developer APIs to work with a Snowflake Data Clean Room. If you don’t have the SAMOOHA_APP_ROLE role, contact your account administrator.

use role samooha_app_role;
use warehouse app_wh;

Copy

Create the clean room¶

Create a name for the clean room. Enter a new clean room name to avoid colliding with existing clean room names. Note that clean room names can only be alphanumeric. clean room names cannot contain special characters other than spaces and underscores.

set cleanroom_name = 'Snowpark Demo clean room';

Copy

You can create a new clean room with the clean room name set above. If the clean room name set above already exists as an existing clean room, this process will fail.

This procedure takes approximately 45 seconds to run.

The second argument to provider.cleanroom_init is the distribution of the clean room. This can either be INTERNAL or EXTERNAL. For testing purposes, if you are sharing the clean room to an account in the same organization, you can use INTERNAL to bypass the automated security scan which must take place before an application package is released to collaborators. However, if you are sharing this clean room to an account in a different organization, you must use an EXTERNAL clean room distribution.

call samooha_by_snowflake_local_db.provider.cleanroom_init($cleanroom_name, 'INTERNAL');

Copy

In order to view the status of the security scan, use:

call samooha_by_snowflake_local_db.provider.view_cleanroom_scan_status($cleanroom_name);

Copy

Once you have created your clean room, you must set its release directive before it can be shared with any collaborator. However, if your distribution was set to EXTERNAL, you must first wait for the security scan to complete before setting the release directive. You can continue running the remainder of the steps and return here before the provider.create_cleanroom_listing step while the scan runs.

In order to set the release directive, call:

call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '0');

Copy

Link the dataset, and set the join policy for the dataset¶

Link Snowflake tables into the clean room. These datasets will be made available to you on your account through the latest patch. Browse through the list of tables in your Snowflake account and enter the fully qualified table names (Database.Schema.Table) as an array. The procedure automatically makes the table accessible to the clean room by creating a secure view of the table from within the clean room, thereby avoiding any need to make a copy of your table.

call samooha_by_snowflake_local_db.provider.link_datasets($cleanroom_name, ['<IMPRESSIONS_TABLE>']);

Copy

If you’d like to link a view to the clean room instead that has downstream dependencies, use provider.link_datasets_advanced instead:

call samooha_by_snowflake_local_db.provider.link_datasets_advanced($cleanroom_name, ['<VIEW_NAME>'], ['<SOURCE_DB_NAMES>']);

Copy

Note

If this step doesn’t work even though your table exists, it is likely the SAMOOHA_APP_ROLE role has not yet been given access to it. If so, switch to the ACCOUNTADMIN role, call the below procedure on the database, and then switch back for the rest of the flow:

use role accountadmin;
call samooha_by_snowflake_local_db.provider.register_db('<DATABASE_NAME>');
use role samooha_app_role;

Copy

You can see the datasets linked to the clean room using the following procedure:

call samooha_by_snowflake_local_db.provider.view_provider_datasets($cleanroom_name);

Copy

In order to figure out which columns to use as the join policy, you can peek into your dataset to determine the PII columns. To see the top 10 rows, use this query:

select * from <IMPRESSIONS_TABLE> limit 10;

Copy

Confidentially load the Snowpark procedure into the clean room¶

This section shows you how to load the Snowpark procedure into the clean room. The procedure carries out the following steps:

Preprocesses impressions data: dynamic SQL is created that joins the provider impressions data onto the consumer users data, if the consumer’s table is supplied, and calculates the distinct count of impressions and reach by date, and stores them in an intermediary table inside the clean room. If the consumer table is not supplied, then it uses the entirity of the provider’s impressions table.
Load intermediary table: the intermediary table is loaded into the Snowpark procedure as a pandas dataframe.
Carry out regression: the regression is calculated using the statsmodels library and the results returned as a pandas dataframe.
Write results to Snowflake table: the results are written to a results table inside the clean room, and the ID suffix of the table is returned to the consumer.

a. Since the Snowpark procedure is running inside the clean room, it has a limited ability to write directly to the consumer tenant. Instead, to keep the results more secure, it is written to a table inside the clean room and give consumers the ability to read from the table.
Drop intermediary tables: intermediary tables created during the calculation inside the clean room that are no longer needed are dropped before the Snowpark procedure finishes.

The following API allows you to define your Python functions directly as inline functions into the clean room. Alternatively you can load Python from staged files you’ve uploaded into the clean room stage. See the API reference guide for an example.

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name,
    'reach_impression_regression',
    ['source_table string', 'my_table string'],
    ['snowflake-snowpark-python', 'pandas', 'statsmodels', 'numpy'],
    'string',
    'main',
    $$
import traceback
import pandas as pd
import numpy as np

import statsmodels.formula.api as sm


def drop_tables(session, table_names):
    """
    Drop the tables passed in
    """
    for tbl in table_names:
        session.sql(f'drop table {tbl}').collect()

def preprocess_regression_data(session, source_table, my_table, suffix):
    """
    Preprocess the impressions and customer data into an intermediary table for regression
    """
    table_name = f'cleanroom.intermediary_{suffix}'

    my_table_statement = f'inner join {my_table} c on p.hem = c.hem' if my_table != 'NONE' else ''
    session.sql(f"""
    create or replace table {table_name} as (
        with joint_data as (
            select
                date,
                p.hem as hem,
                impression_id
            from {source_table} p
            {my_table_statement}
        )
        select
            date,
            count(distinct hem) as reach,
            count(distinct impression_id) as num_impressions
        from joint_data
        group by date
        order by date
    );
    """).collect()

    return table_name

def calculate_regression(df):
    """
    Calculate the regression data from the dataframe we put together
    """
    result = sm.ols('REACH ~ 1 + NUM_IMPRESSIONS', data=df).fit()
    retval = pd.concat([
        result.params,
        result.tvalues,
        result.pvalues
    ], keys=['params', 't-stat', 'p-value'], names=['STATISTIC', 'PARAMETER']).rename('VALUE').reset_index()
    return retval

def main(session, source_table, my_table):
    """
    First compute the regression data from an overlap between customer and provider data, and counting
    the number of impressions and reach per day. Next regress these locally and compute the regression
    statistics. Finally write it to a results table which can be queried to get the output.
    """
    suffix = f'regression_results_{abs(hash((source_table, my_table))) % 10000}'

    try:
        # Preprocess impressions and customer data into an intermediary form to use for regression
        intermediary_table_name = preprocess_regression_data(session, source_table, my_table, suffix)

        # Load the data into Python locally
        df = session.table(intermediary_table_name).to_pandas()

        # Carry out the regression and get statistics as an output
        regression_output = calculate_regression(df)

        # Write the statistics to an output table
        # The table and the schema names should be in upper case to quoted identifier related issues.
        table = f'results_{suffix}'.upper()
        retval_df = session.write_pandas(regression_output, table,  schema = 'CLEANROOM', auto_create_table = True)

        # Drop any intermediary tables
        drop_tables(session, [intermediary_table_name])

        # Tell the user the name of the table the results have been written to
        return f'Done, results have been written to the following suffix: {suffix}'
    except:
        return traceback.format_exc()
$$
);

Copy

Note

Loading Python into the clean room creates a new patch for the clean room. If your clean room distribution is set to EXTERNAL, you need to wait for the security scan to complete, then update the default release directive using:

-- See the versions available inside the clean room
show versions in application package samooha_cleanroom_Snowpark_Demo_clean_room;

-- Once the security scan is approved, update the release directive to the latest version
call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '1');

Copy

Add a custom template using the UDFs¶

To add a custom analysis template to the clean room you need a placeholder for table names on both the provider and consumer sides, along with join columns from the provider side. In SQL Jinja templates, these placeholders must always be:

source_table: an array of table names from the provider
my_table: an array of table names from the consumer

Table names can be made dynamic through using these variables, but they can also be hardcoded into the template if desired using the name of the view linked to the cleanroom. Column names can either be hardcoded into the template, if desired, or set dynamically through parameters. If they are set through parameters, remember that you need to call the parameters dimensions or measure_column, which need to be arrays, in order for them to be checked against the column policy. You add these as SQL Jinja parameters in the template that will be passed in later by the consumer when querying. The join policies ensure that the consumer cannot join on columns other than the authorized ones.

Alternatively, any argument in a custom SQL Jinja template can be checked for compliance with the join and column policies using the following filters:

join_policy: checks if a string value or filter clause is compliant with the join policy
column_policy: checks if a string value or filter clause is compliant with the column policy
join_and_column_policy: checks if columns used for a join in a filter clause are compliant with the join policy, and that columns used as a filter are compliant with the column policy

For example, in the clause {{ provider_id | sqlsafe | join_policy }}, an input of p.HEM will be parsed to check if p.HEM is in the join policy. Note: Only use the sqlsafe filter with caution as it allows collaborators to put pure SQL into the template.

Note

All provider/consumer tables must be referenced using these arguments since the name of the secure view actually linked to the clean room will be different to the table name. Critically, provider table aliases MUST be p (or p1), p2, p3, p4, etc. and consumer table aliases must be c (or c1), c2, c3, etc. This is required in order to enforce security policies in the clean room.

Note that this function overrides any existing template with the same name. If you want to update any existing template, you can simply call this function again with the updated template.

call samooha_by_snowflake_local_db.provider.add_custom_sql_template(
        $cleanroom_name,
        'prod_calculate_regression',
        $$
call cleanroom.reach_impression_regression({{ source_table[0] }}, {{ my_table[0] | default('NONE') }});
$$
);

Copy

Finally, a custom template is added that will allow the consumer to retrieve the results of their analysis given the results suffix ID returned to them from the calculate_regression template.

call samooha_by_snowflake_local_db.provider.add_custom_sql_template(
        $cleanroom_name,
        'prod_get_results',
        $$
select * from cleanroom.results_{{ results_suffix | sqlsafe }};
$$
);

Copy

If you want to view the templates that are currently active in the clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_added_templates($cleanroom_name);

Copy

Consumer¶

Note

The following commands should be run in a Snowflake worksheet in the consumer account

Set up the environment¶

use role samooha_app_role;
use warehouse app_wh;

Copy

Install the clean room¶

Once a clean room share has been installed, the list of clean rooms available can be viewed using the below command.

call samooha_by_snowflake_local_db.consumer.view_cleanrooms();

Copy

Assign a name for the clean room that the provider has shared with you.

set cleanroom_name = 'Snowpark Demo clean room';

Copy

The following command installs the clean room on the consumer account with the associated provider and selected clean room.

This procedure takes approximately 45 seconds to run.

call samooha_by_snowflake_local_db.consumer.install_cleanroom($cleanroom_name, '<PROVIDER_ACCOUNT_LOCATOR>');

Copy

Once the clean room has been installed, the provider has to finish setting up the clean room on their side before it is enabled for use. The below function allows you to check the status of the clean room. Once it has been enabled, you should be able to run the Run Analysis command below. It typically takes about 1 minute for the clean room to be enabled.

Make sure the install_cleanroom function has finished before running this function.

call samooha_by_snowflake_local_db.consumer.is_enabled($cleanroom_name);

Copy

Link the dataset¶

Link datasets into the clean room to carry out secure computation with the provider’s data. These datasets will be made available to you on your account through the latest patch.

call samooha_by_snowflake_local_db.consumer.link_datasets($cleanroom_name, ['<USERS_TABLE>']);

Copy

Note

use role accountadmin;
call samooha_by_snowflake_local_db.consumer.register_db('<DATABASE_NAME>');
use role samooha_app_role;

Copy

To run the analysis, you will need to pass in the consumer table. If you want to view the datasets that you have added to the clean room, call the following procedure.

call samooha_by_snowflake_local_db.consumer.view_consumer_datasets($cleanroom_name);

Copy

Run the analysis¶

Now that the clean room is installed, you can run the analysis template given to the clean room by the provider using a “run_analysis” command. You can see how each field is determined in the sections below.

The number of datasets passable is constrained by the template the provider has implemented. Some templates require a specific number of tables. The template creator can implement the requirements they wish to support.

Note

Before running the analysis, you can alter the warehouse size, or use a new, bigger, warehouse size if your tables are large.

call samooha_by_snowflake_local_db.consumer.run_analysis(
  $cleanroom_name,               -- cleanroom
  'prod_calculate_regression',    -- template name

  ['<USERS_TABLE>'],    -- consumer tables

  ['<IMPRESSSIONS_TABLE>'],     -- provider tables

  object_construct()     -- Rest of the custom arguments needed for the template
);

Copy

The output of this analysis will be an ID that can be used to retrieve the results of the regression using the following template:

set result_suffix = 'regression_results_<ID>';

call samooha_by_snowflake_local_db.consumer.run_analysis(
    $cleanroom_name,        -- cleanroom
    'prod_get_results',     -- template name
    [],                     -- consumer tables
    [],                     -- provider tables
    object_construct(
        'results_suffix', $result_suffix  -- The suffix with the results
    )
);

Copy

How to determine the inputs to run_analysis¶

To run the analysis, you need to pass in some parameters to the run_analysis function. This section will show you how to determine what parameters to pass in.

Template names

First, you can see the supported analysis templates by calling the following procedure.

call samooha_by_snowflake_local_db.consumer.view_added_templates($cleanroom_name);

Copy

To run the analysis, you need to pass in some parameters to the run_analysis function. This section shows you how to determine what parameters to pass in.

call samooha_by_snowflake_local_db.consumer.view_template_definition($cleanroom_name, 'prod_calculate_regression');

Copy

This can often also contain a large number of different SQL Jinja parameters. The following functionality parses the SQL Jinja template and extracts the arguments that need to be specified in run_analysis into a list.

call samooha_by_snowflake_local_db.consumer.get_arguments_from_template($cleanroom_name, 'prod_calculate_regression');

Copy

Dataset names

If you want to view the dataset names that have been added to the clean room by the provider, call the following procedure. Note that you cannot view the data present in the datasets that have been added to the clean room by the provider due to the security properties of the clean room.

call samooha_by_snowflake_local_db.consumer.view_provider_datasets($cleanroom_name);

Copy

You can also see the tables you’ve linked to the clean room by using the following call:

call samooha_by_snowflake_local_db.consumer.view_consumer_datasets($cleanroom_name);

Copy

Recommendations¶

Keep all significant data pre-processing happening through dynamic SQL where possible, storing data in intermediary tables using the cleanroom schema. It is much faster and more efficient. For example:
```
session.sql("create or replace table cleanroom.intermediary as ...")
```
Copy
Create UDFs, UDTFs and Procedures by executing SQL via session.sql in the cleanroom schema rather than using the Snowpark decorators. For example:
```
session.sql("create or replace function cleanroom.udf(...")
```
Copy
When you need to load data that is too big to fit in memory, use .to_pandas_batches() and iterate over it. For example:
```
df_iter = session.table(intermediary_table_name).to_pandas_batches()
for df in df_iter:
    ...
```
Copy

Snowflake Data Clean Rooms: Secure Snowpark Procedures¶

Prerequisites¶

Provider¶

Set up the environment¶

Create the clean room¶

Cross-region sharing¶

Link the dataset, and set the join policy for the dataset¶

Confidentially load the Snowpark procedure into the clean room¶

Add a custom template using the UDFs¶

Share with a consumer¶

Consumer¶

Set up the environment¶

Install the clean room¶

Link the dataset¶

Run the analysis¶

How to determine the inputs to run_analysis¶

Recommendations¶