Snowflake Data Clean Rooms: Secure Python Based Templates

This topic describes the provider and consumer flows needed to programmatically set up a clean room, share it with a consumer, and run analyses on it in a way that uses secure Python UDFs loaded into the clean room. In this flow, a provider loads secure Python code into the clean room using an API that keeps the underlying Python code completely confidential from the consumer.

This flow loads two Python functions into the clean room to do some custom data processing and aggregation. These Python UDFs are then called inside a custom SQL Jinja template that acts as glue for the data flow. The template itself calculates an aggregation along a custom grouping created by the Python UDFs.

The key aspects of this flow other than the ones mentioned above are:

  1. Provider:

    a. Securely load 2 confidential Python UDFs into a new clean room.

    b. Create a custom SQL Jinja analysis template using the Python UDFs.

    c. Share it with a consumer.

  2. Consumer:

    a. Examine the template provided within the clean room.

    b. Run an analysis within the clean room using the template.

Prerequisites

You need two separate Snowflake accounts to complete this flow. Use the first account to execute the provider’s commands, then switch to the second account to execute the consumer’s commands.

Provider

Note

The following commands should be run in a Snowflake worksheet in the provider account.

Set up the environment

Execute the following commands to set up the Snowflake environment before using developer APIs to work with a Snowflake Data Clean Room. If you don’t have the SAMOOHA_APP_ROLE role, contact your account administrator.

use role samooha_app_role;
use warehouse app_wh;
Copy

Create the clean room

Create a name for the clean room. Enter a new clean room name to avoid colliding with existing clean room names. Note that clean room names can only be alphanumeric. clean room names cannot contain special characters other than spaces and underscores.

set cleanroom_name = 'Custom Secure Python UDF Demo clean room';
Copy

You can create a new clean room with the clean room name set above. If the clean room name set above already exists as an existing clean room, this process fails.

This procedure takes approximately 45 seconds to run.

The second argument to provider.cleanroom_init is the distribution of the clean room. This can either be INTERNAL or EXTERNAL. For testing purposes, if you are sharing the clean room to an account in the same organization, you can use INTERNAL to bypass the automated security scan which must take place before an application package is released to collaborators. However, if you are sharing this clean room to an account in a different organization, you must use an EXTERNAL clean room distribution.

call samooha_by_snowflake_local_db.provider.cleanroom_init($cleanroom_name, 'INTERNAL');
Copy

In order to view the status of the security scan, use:

call samooha_by_snowflake_local_db.provider.view_cleanroom_scan_status($cleanroom_name);
Copy

Once you have created your clean room, you must set its release directive before it can be shared with any collaborator. However, if your distribution was set to EXTERNAL, you must first wait for the security scan to complete before setting the release directive. You can continue running the remainder of the steps while the scan runs and return here before the provider.create_cleanroom_listing step.

In order to set the release directive, call:

call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '0');
Copy

Cross-region sharing

In order to share a clean room with a Snowflake customer whose account is in a different region than your account, you must enable Cross-Cloud Auto-Fulfillment. For information about the additional costs associated with collaborating with consumers in other regions, see Cross-Cloud Auto-Fulfillment costs.

When using developer APIs, enabling cross-region sharing is a two-step process:

  1. A Snowflake administrator with the ACCOUNTADMIN role enables Cross-Cloud Auto-Fulfillment for your Snowflake account. For instructions, see Collaborate with accounts in different regions.

  2. You execute the provider.enable_laf_for_cleanroom command to enable Cross-Cloud Auto-Fulfillment for the clean room. For example:

    call samooha_by_snowflake_local_db.provider.enable_laf_for_cleanroom($cleanroom_name);
    
    Copy

After you have enabled Cross-Cloud Auto-Fulfillment for the clean room, you can add consumers to your listing as usual using the provider.create_cleanroom_listing command. The listing is automatically replicated to remote clouds and regions as needed.

Confidentially load custom Python code as UDFs into the clean room

This section shows you how to load the following Python functions into the clean room.

  • assign_group -> a UDF that goes row by row and assigns a group ID.

  • group_agg -> a UDF that groups by the ID and aggregates an aspect of the data.

The following API allows you to define your Python functions directly as inline functions into the clean room. Alternatively you can load Python from staged files you’ve uploaded into the clean room stage. See the API reference guide for an example.

The following code defines and loads the assign_group UDF that goes row by row and assigns a group ID:

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name, 
    'assign_group',                      -- Name of the UDF
    ['data variant', 'index integer'],   -- Arguments of the UDF, specified as (variable name, variable type)
    ['numpy', 'pandas'],                 -- Packages UDF will use
    'integer',                           -- Return type of UDF
    'main',                              -- Handler
    $$
import numpy as np
import pandas as pd

def main(data, index):
    df = pd.DataFrame(data) # Just as an example of what we could do
    np.random.seed(0)
    
    # First let's combine the data row and the additional index into a string
    data.append(index)
    data_string = ",".join(str(d) for d in data)

    # Hash it 
    encoded_data_string = data_string.encode()
    hashed_string = hash(encoded_data_string)

    # Return the hashed string
    return hashed_string
    $$
);
Copy

The following code defines and loads the group_agg UDF that groups by the ID and aggregates an aspect of the data:

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name, 
    'group_agg',              -- Name of the UDF
    ['data variant'],         -- Arguments of the UDF, specified as (variable name, variable type)
    ['pandas'],               -- Packages UDF will use
    'integer',                -- Return type of UDF
    'main',                   -- Handler
    $$
import pandas as pd

def main(s):
    s = pd.Series(s)
    return (s == 'SILVER').sum()
    $$
);
Copy

Note

Loading Python into the clean room creates a new patch for the clean room. If your clean room distribution is set to EXTERNAL, you need to wait for the security scan to complete, then update the default release directive using:

-- See the versions available inside the clean room
show versions in application package samooha_cleanroom_Custom_Secure_Python_UDF_Demo_clean_room;

-- Once the security scan is approved, update the release directive to the latest version
call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '2');
Copy

Load Python code from Python files in a stage

Note

This section is an optional alternative to the load_python_into_cleanroom commands above which define Python inline. These load Python from .py files instead, loaded into the clean room stage.

As an alternative to defining, you can load Python from .py files in a stage. In order to do this, you must upload your code into the clean room code stage. Importantly, only the files in the clean room code stage are available to the clean room for use, so your files cannot be located elsewhere. The files must be in the following stage:

ls @samooha_cleanroom_Custom_Secure_Python_UDF_Demo_clean_room.app.code;
Copy

In order to define the assign_group and group_agg UDFs in this way, you can upload the following scripts into the clean room stage:

Create a file in your home directory called ~/assign_group.py and paste in the following code:

import numpy as np
import pandas as pd


def main(data, index):
    _ = pd.DataFrame(data)  # Just as an example of what we could do
    np.random.seed(0)

    # First let's combine the data row and the additional index into a string
    data.append(index)
    data_string = ",".join(str(d) for d in data)

    # Hash it
    encoded_data_string = data_string.encode()
    hashed_string = hash(encoded_data_string)

    # Return the hashed string
    return hashed_string
Copy

Now, you need to upload the code to the clean room stage. Do this by adding it to the folder containing the version of the clean room application files that is currently published. In order to get the necessary folder, you can use the following procedure:

call samooha_by_snowflake_local_db.provider.get_stage_for_python_files($cleanroom_name);
Copy

This gives you the stage to upload this file to. You can upload this file to the stage using the following command from snowsql:

PUT file://~/assign_group.py @samooha_cleanroom_Custom_Secure_Python_UDF_Demo_clean_room.app.code/V1_0P1/test_folder overwrite=True auto_compress=False;
Copy

Finally, you can load Python into the clean room using the following command:

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name,
    'assign_group',                      // Name of the UDF
    ['data variant', 'index integer'],   // Arguments of the UDF, specified as (variable name, variable type)
    ['numpy', 'pandas'],                 // Packages UDF will use
    ['/test_folder/assign_group.py'],                // Name of Python file to import, relative to stage folder uploaded to
    'integer',                           // Return type of UDF
    'assign_group.main'                  // Handler - now scoped to file
);
Copy

In a similar manner, you can create a file called ~/group_agg.py with the following code:

import pandas as pd


def main(s):
    s = pd.Series(s)
    return (s == "SILVER").sum()
Copy

The folder to which this needs to be uploaded will now have changed since the last call to load_python_into_cleanroom has added a patch to the cleanroom. The new folder can be obtained by rerunning the following command:

call samooha_by_snowflake_local_db.provider.get_stage_for_python_files($cleanroom_name);
Copy

The file can then be uploaded to the appropriate folder:

PUT file://~/group_agg.py @samooha_cleanroom_Custom_Secure_Python_UDF_Demo_clean_room.app.code/V1_0P2 overwrite=True auto_compress=False;
Copy

Once uploaded, the Python UDF can be created from this file using the following command:

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    $cleanroom_name,
    'group_agg',                         // Name of the UDF
    ['data variant'],                    // Arguments of the UDF, specified as (variable name, variable type)
    ['pandas'],                          // Packages UDF will use
    ['/group_agg.py'],                   // Name of Python file to import, relative to stage folder uploaded to
    'integer',                           // Return type of UDF
    'group_agg.main'                     // Handler - now scoped to file
);
Copy

Add a custom template using the UDFs

To add a custom analysis template to the clean room you need a placeholder for table names on both the provider and consumer sides, along with join columns from the provider side. In SQL Jinja templates, these placeholders must always be:

  • source_table: an array of table names from the provider

  • my_table: an array of table names from the consumer

Table names can be made dynamic through using these variables, but they can also be hardcoded into the template if desired using the name of the view linked to the cleanroom. Column names can either be hardcoded into the template, if desired, or set dynamically through parameters. If they are set through parameters, remember that you need to call the parameters dimensions or measure_column, which need to be arrays, in order for them to be checked against the column policy. You add these as SQL Jinja parameters in the template that will be passed in later by the consumer when querying. The join policies ensure that the consumer cannot join on columns other than the authorized ones.

Alternatively, any argument in a custom SQL Jinja template can be checked for compliance with the join and column policies using the following filters:

  • join_policy: checks if a string value or filter clause is compliant with the join policy

  • column_policy: checks if a string value or filter clause is compliant with the column policy

  • join_and_column_policy: checks if columns used for a join in a filter clause are compliant with the join policy, and that columns used as a filter are compliant with the column policy

For example, in the clause {{ provider_id | sqlsafe | join_policy }}, an input of p.HEM will be parsed to check if p.HEM is in the join policy. Note: Only use the sqlsafe filter with caution as it allows collaborators to put pure SQL into the template.

Note

All provider/consumer tables must be referenced using these arguments since the name of the secure view actually linked to the cleanroom will be different to the table name. Critically, provider table aliases MUST be p (or p1), p2, p3, p4, etc. and consumer table aliases must be c (or c1), c2, c3, etc. This is required in order to enforce security policies in the cleanroom.

Note that this function overrides any existing template with the same name. If you want to update any existing template, you can simply call this function again with the updated template.

This template first enriches the provider data with a hash of a series of columns from the provider’s table. This enriched data is then inner-joined onto the consumer’s dataset on email, with an optional filter clause passed in. Finally, the custom Python group_agg UDF is used to calculate some aggregation as a function of the hashed columns from the first UDF.

call samooha_by_snowflake_local_db.provider.add_custom_sql_template(
    $cleanroom_name, 
    'prod_custom_udf_template', 
    $$
with enriched_provider_data as (
    select 
        cleanroom.assign_group(array_construct(identifier({{ filter_column | column_policy }}), identifier({{ dimensions[0] | column_policy }})), identifier({{ measure_column[0] | column_policy }})) as groupid,
        {{ filter_column | sqlsafe }},
        hem
    from identifier({{ source_table[0] }}) p
), filtered_data as (
    select 
        groupid,
        identifier({{ filter_column | column_policy }})
    from enriched_provider_data p
    inner join identifier({{ my_table[0] }}) c
    on p.hem = c.hem
    {% if where_clause %}
    where {{ where_clause | sqlsafe }}
    {% endif %}
)
select groupid, cleanroom.group_agg(array_agg({{ filter_column | sqlsafe }})) as count from filtered_data p
group by groupid;
    $$
);
Copy

Note

You can add Differential Privacy sensitivity to samooha_by_snowflake_local_db.provider.add_custom_sql_template procedure call above as the last parameter (if you do not add it, it will default to 1)

If you want to view the templates that are currently active in the clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_added_templates($cleanroom_name);
Copy

Set the column policy on each table

Display the data linked to see the columns present inside the table. To view the top 10 rows, call the following procedure.

select * from SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS limit 10;
Copy

Set the columns the consumer can group, aggregate (e.g. SUM/AVG) and generally use in an analysis for every table and template combination. This gives flexibility so the same table can allow different column selections depending on the underlying template. This should only be called after adding the template.

Note that the column policy is replace only, so if the function is called again, then the previously set column policy is completely replaced by the new one.

Column policy should not be used on identity columns like email, HEM, RampID, etc. since you don’t want the consumer to be able to group by these columns. In the production environment, the system will intelligently infer PII columns and block this operation, but this feature is not available in the sandbox environment. It should only be used on columns that you want the consumer to be able to aggregate and group by, like Status, Age Band, Region Code, Days Active, etc.

Note that for the “column_policy” and “join_policy” to carry out checks on the consumer analysis requests, all column names MUST be referred to as dimensions or measure_columns in the SQL Jinja template. Make sure you use these tags to refer to columns you want to be checked in custom SQL Jinja templates.

call samooha_by_snowflake_local_db.provider.set_column_policy($cleanroom_name, [
    'prod_custom_udf_template:SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS:STATUS', 
    'prod_custom_udf_template:SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS:REGION_CODE',
    'prod_custom_udf_template:SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS:AGE_BAND',
    'prod_custom_udf_template:SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS:DAYS_ACTIVE']);
Copy

If you want to view the column policy that has been added to the clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_column_policy($cleanroom_name);
Copy

Share with a consumer

Finally, add a data consumer to the clean room by adding their Snowflake account locator and account names as shown below. The Snowflake account name must be of the form <ORGANIZATION>.<ACCOUNT_NAME>.

Note

In order to call the following procedures, make sure you have first set the release directive using provider.set_default_release_directive. You can see the latest available version and patches using:

show versions in application package samooha_cleanroom_Custom_Secure_Python_UDF_Demo_clean_room;
Copy

Note

Note this call takes about 60 seconds to complete, since it sets up a number of tasks for listening and logging requests from the consumer.

call samooha_by_snowflake_local_db.provider.add_consumers($cleanroom_name, '<CONSMUMER_ACCOUNT_LOCATOR>');
Copy

Multiple consumer account locators can be passed into the provider.add_consumers function as a comma separated string, or as separate calls to provider.add_consumers.

If you want to view the consumers who have been added to this clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_consumers($cleanroom_name);
Copy

View the clean rooms that have been recently created via the following procedure:

call samooha_by_snowflake_local_db.provider.view_cleanrooms();
Copy

View more insights on the clean room recently created via the following procedure.

call samooha_by_snowflake_local_db.provider.describe_cleanroom($cleanroom_name);
Copy

Any clean room created can also be deleted. The following command drops the clean room entirely, so any consumers who previously had access to the clean room will no longer be able to use it. If a clean room with the same name is desired in the future, it must be re-initialized using the above flow.

call samooha_by_snowflake_local_db.provider.drop_cleanroom($cleanroom_name);
Copy

Note

The provider flow is now finished. Switch to the consumer account to continue with consumer flow.

Consumer

Note

The following commands should be run in a Snowflake worksheet in the consumer account

Set up the environment

Execute the following commands to set up the Snowflake environment before using developer APIs to work with a Snowflake Data Clean Room. If you don’t have the SAMOOHA_APP_ROLE role, contact your account administrator.

use role samooha_app_role;
use warehouse app_wh;
Copy

Install the clean room

Once a clean room share has been installed, the list of clean rooms available can be viewed using the below command.

call samooha_by_snowflake_local_db.consumer.view_cleanrooms();
Copy

Assign a name for the clean room that the provider has shared with you.

set cleanroom_name = 'Custom Secure Python UDF Demo clean room';
Copy

The following command installs the clean room on the consumer account with the associated provider and selected clean room.

This procedure takes approximately 45 seconds to run.

call samooha_by_snowflake_local_db.consumer.install_cleanroom($cleanroom_name, '<PROVIDER_ACCOUNT_LOCATOR>');
Copy

Once the clean room has been installed, the provider has to finish setting up the clean room on their side before it is enabled for use. The below function allows you to check the status of the clean room. Once it has been enabled, you should be able to run the Run Analysis command below. It typically takes about 1 minute for the clean room to be enabled.

call samooha_by_snowflake_local_db.consumer.is_enabled($cleanroom_name);
Copy

Run the analysis

Now that the clean room is installed, you can run the analysis template given to the clean room by the provider using a “run_analysis” command. You can see how each field is determined in the sections below.

The number of datasets passable is constrained by the template the provider has implemented. Some templates require a specific number of tables. The template creator can implement the requirements they wish to support.

Note

Before running the analysis, you can alter the warehouse size, or use a new, bigger, warehouse size if your tables are large.

call samooha_by_snowflake_local_db.consumer.run_analysis(
  $cleanroom_name,               -- cleanroom
  'prod_custom_udf_template',    -- template name

  ['SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS'],    -- consumer tables

  ['SAMOOHA_SAMPLE_DATABASE_NAV2.DEMO.CUSTOMERS'],     -- provider tables

  object_construct(    -- Rest of the custom arguments needed for the template
    'filter_column', 'p.status',            -- One of the SQL Jinja arguments, the column the UDF filters on

    'dimensions', ['p.DAYS_ACTIVE'],
    
    'measure_column', ['p.AGE_BAND'],

    'where_clause', 'c.status = $$GOLD$$'   -- A boolean filter applied to the data
  )
);
Copy

For each of the columns referred to in either the dataset filtering “where_clause”, or the dimensions or measure_columns, you can use p. to refer to fields in provider tables, and c. to refer to fields in consumer tables. Use p2, p3, etc. for more than one provider table and c2, c3, etc. for more than one consumer table.

How to determine the inputs to run_analysis

To run the analysis, you need to pass in some parameters to the run_analysis function. This section will show you how to determine what parameters to pass in.

Template names

First, you can see the supported analysis templates by calling the following procedure.

call samooha_by_snowflake_local_db.consumer.view_added_templates($cleanroom_name);
Copy

Before running an analysis with a template, you need to know what arguments to specify and what types are expected. For custom templates, you can execute the following.

call samooha_by_snowflake_local_db.consumer.view_template_definition($cleanroom_name, 'prod_custom_udf_template');
Copy

This can often also contain a large number of different SQL Jinja parameters. The following functionality parses the SQL Jinja template and extracts the arguments that need to be specified in run_analysis into a convenient list.

call samooha_by_snowflake_local_db.consumer.get_arguments_from_template($cleanroom_name, 'prod_custom_udf_template');
Copy

Dataset names

If you want to view the dataset names that have been added to the clean room by the provider, call the following procedure. Note that you cannot view the data present in the datasets that have been added to the clean room by the provider due to the security properties of the clean room.

call samooha_by_snowflake_local_db.consumer.view_provider_datasets($cleanroom_name);
Copy

You can also see the tables you’ve linked to the clean room by using the following call:

call samooha_by_snowflake_local_db.consumer.view_consumer_datasets($cleanroom_name);
Copy

Dimension and measure columns

While running the analysis, you might want to filter, group by and aggregate on certain columns. If you want to view the column policy that has been added to the clean room by the provider, call the following procedure.

call samooha_by_snowflake_local_db.consumer.view_provider_column_policy($cleanroom_name);
Copy

Common errors

If you are getting Not approved: unauthorized columns used error as a result of run analysis, you may want to view the join policy and column policy set by the provider again.

call samooha_by_snowflake_local_db.consumer.view_provider_join_policy($cleanroom_name);
call samooha_by_snowflake_local_db.consumer.view_provider_column_policy($cleanroom_name);
Copy

It is also possible that you have exhausted your privacy budget, which prevents you from executing more queries. Your remaining privacy budget can be viewed using the below command. It resets daily, or the clean room provider can reset it if they wish.

call samooha_by_snowflake_local_db.consumer.view_remaining_privacy_budget($cleanroom_name);
Copy

You can check if Differential Privacy has been enabled for your clean room using the following API:

call samooha_by_snowflake_local_db.consumer.is_dp_enabled($cleanroom_name);
Copy