Snowflake Data Clean Rooms: Register a developer edition clean room in the web app

This topic describes the provider and consumer flows needed to programmatically set up a clean room, share it with a consumer, and run analyses through advanced machine learning algorithms in it. It will closely follow the Machine Learning flow, but provides consumers with a UI interface to an otherwise complex machine learning workflow. By registering the clean room into the UI, the consumer does not need to use Snowsight or the developer edition of a Snowflake Data Clean Room. This enables non-technical consumers to interface completely with customized and complex clean rooms.

This flow covers the following:

  1. Provider:

    a. Add a custom template running a Lookalike Modeling analysis.

    b. Securely add Machine Learning python code-based templates leveraging XGBoost.

    c. Call the machine learning UDFs inside the clean room using the custom template.

    d. Register the clean room into the UI with a customized UI form.

  2. Consumer:

    a. Use the web app of a Snowflake Data Clean Room to install the clean room and run analyses on it.


Lookalike Modeling is a type of analysis where a consumer tries to find “high-value” customers from a provider’s data by training a statistical model on their high-value customers. This model uses consumer-specified flags to indicate high-value users, such as those with expenditures above a certain threshold, in the consumer’s dataset. The trained model is then used to infer which customers in the provider’s data could potentially be “high value” to the consumer.


You need two separate Snowflake accounts to complete this flow. Use the first account to execute the provider’s commands, then switch to the second account to execute the consumer’s commands.



The following commands should be run in a Snowflake worksheet in the provider account

Set up the environment

The following commands are required for running the developer edition of a Snowflake Data Clean Room. If you don’t have the following role, contact your account administrator.

use role samooha_app_role;
use warehouse app_wh;

Create the clean room

Create a name for the clean room. Enter a new clean room name to avoid colliding with existing clean room names. Note that clean room names can only be alphanumeric. Clean room names cannot contain special characters other than spaces and underscores.

set cleanroom_name = 'UI Registration ML Clean room';

You can create a new cleanroom with the clean room name set above. If the clean room name set above already exists as an existing clean room, this process will fail.

This procedure may take a little longer to run, typically about half a minute.

The second argument to provider.cleanroom_init is the distribution of the clean room. This can either be INTERNAL or EXTERNAL. For testing purposes, if you are sharing the clean room to an account in the same organization, you can use INTERNAL to bypass the automated security scan which must take place before an application package is released to collaborators. However, if you are sharing this clean room to an account in a different organization, you must use an EXTERNAL clean room distribution.

call samooha_by_snowflake_local_db.provider.cleanroom_init($cleanroom_name, 'INTERNAL');

In order to view the status of the security scan, use:

call samooha_by_snowflake_local_db.provider.view_cleanroom_scan_status($cleanroom_name);

Once you have created your clean room, you must set its release directive before it can be shared with any collaborator. However, if your distribution was set to EXTERNAL, you must first wait for the security scan to complete before setting the release directive. You can continue running the remainder of the steps and return here before the provider.create_cleanroom_listing step while the scan runs.

In order to set the release directive, call:

call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '0');

Cross-region sharing

In order to share a clean room with a Snowflake customer whose account is in a different region than your account, you must enable Cross-Cloud Auto-Fulfillment. For information about the additional costs associated with collaborating with consumers in other regions, see Cross-Cloud Auto-Fulfillment costs.

In developer edition, enabling cross-region sharing is a two-step process:

  1. A Snowflake administrator with the ACCOUNTADMIN role enables Cross-Cloud Auto-Fulfillment for your Snowflake account. For instructions, see Collaborate with accounts in different regions.

  2. You execute the provider.enable_laf_for_cleanroom command to enable Cross-Cloud Auto-Fulfillment for the clean room. For example:

    call samooha_by_snowflake_local_db.provider.enable_laf_for_cleanroom($cleanroom_name);

After you have enabled Cross-Cloud Auto-Fulfillment for the clean room, you can add consumers to your listing as usual using the provider.create_cleanroom_listing command. The listing is automatically replicated to remote clouds and regions as needed.

Add confidential Machine Learning Python code to the clean room

Load some python functions into the clean room for the lookalike ML work. All python functions installed in the clean room remain completely confidential. They cannot be seen by the consumer.

The following API allows you to define your Python functions directly as inline functions into the clean room. Alternatively you can load Python from staged files you’ve uploaded into the clean room stage. See the API reference guide for an example.


Note that this implementation is limited by the total Snowflake size constraint on the amount of data that can be aggregated by ARRAY_AGG (i.e. 16MB). An implementation leveraging batching and streaming models that can scale to arbitrary-sized datasets using batching is available upon request.

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    ['input_data variant', 'labels variant'],
    ['pandas', 'numpy', 'xgboost'],
import numpy as np
import pandas as pd
import xgboost
from sklearn import preprocessing
import sys
import os
import pickle
import codecs
import threading

class TrainXGBoostClassifier(object):
    def __init__(self):
        self.model = None
        self._params = {
            "objective": "binary:logistic",
            "max_depth": 3,
            "nthread": 1,
            "eval_metric": "auc",
        self.num_boosting_rounds = 10

    def get_params(self):
        if self.model is not None and "updater" not in self._params:
                {"process_type": "update", "updater": "refresh", "refresh_leaf": True}
        return self._params

    def train(self, X, y):
        # Train the model in a threadsafe way

        # pick only the categorical attributes
        categorical = X.select_dtypes(include=[object])

        # fit a one-hot-encoder to convert categorical features to binary features (required by XGBoost)
        ohe = preprocessing.OneHotEncoder()
        categorical_ohe = ohe.fit_transform(categorical)
        self.ohe = ohe

        # get the rest of the features and add them to the binary features
        non_categorical = X.select_dtypes(exclude=[object])
        train_x = np.concatenate((categorical_ohe.toarray(), non_categorical.to_numpy()), axis=1)

        xg_train = xgboost.DMatrix(train_x, label=y)

        params = self.get_params()
        params["eval_metric"] = "auc"
        evallist = [(xg_train, "train")]
        evals_result = {}

        self.model = xgboost.train(
            params, xg_train, self.num_boosting_rounds, evallist, evals_result=evals_result

        self.evals_result = evals_result

    def __dump_model(self, model):
        # Save down the model as a json string to load up for scoring/inference

        pickle_jar = codecs.encode(pickle.dumps([model, self.ohe]), "base64").decode()
        return pickle_jar

    def dump_model(self):
        # Save down the model as a json string to load up for scoring/inference
        if self.model is not None:
            return self.__dump_model(self.model)
            raise ValueError("Model needs to be trained first")

def train(d1, l1):
    # get take training features and put them in a pandas dataframe
    X = pd.DataFrame(d1)

    # get the labels into a Numpy array
    y = np.array(l1)

    trainer = TrainXGBoostClassifier()
    trainer.train(X, y)

    # return training stats, accuracy, and the pickled model and pickled one-hot-encoder
    return {
        "total_rows": len(d1),
        "total_bytes_in": sys.getsizeof(d1),
        "model": trainer.dump_model(),
        "iteration": trainer.num_boosting_rounds,
        "auc": np.max(trainer.evals_result["train"]["auc"]),
        "error": 1 - np.max(trainer.evals_result["train"]["auc"])

Now install a scoring function into the clean room:

call samooha_by_snowflake_local_db.provider.load_python_into_cleanroom(
    ['pickle_jar variant', 'emails variant', 'features variant'],
    ['pandas', 'numpy', 'xgboost'],
import numpy as np
import pandas as pd
import xgboost as xgb
import pickle
import codecs
import json

def score(model, emails, features):
    # load model
    model = model[0] if not isinstance(model, str) else model
    model = pickle.loads(codecs.decode(model.encode(), "base64"))

    # retrieve the XGBoost trainer from the pickle jar
    bst = model[0]

    # retrieve the fitted one-hot-encoder from the pickle jar
    ohe2 = model[1]

    # create pandas dataframe from the inference features
    Y = pd.DataFrame(features)

    # select the categorical attributes and one-hot-encode them
    Y1 = Y.select_dtypes(include=[object])
    Y2 = ohe2.transform(Y1)

    # select the non-categorical attributes
    Y3 = Y.select_dtypes(exclude=[object])

    # join the results of the one-hot encoding to the rest of the attributes
    Y_pred = np.concatenate((Y2.toarray(), Y3.to_numpy()), axis=1)

    # inference
    dscore = xgb.DMatrix(Y_pred)
    pred = bst.predict(dscore)

    retval = list(zip(np.array(emails), list(map(str, pred))))
    retval = [{"email": r[0], "score": r[1]} for r in retval]
    return json.dumps(retval)  


Loading Python into the clean room creates a new patch for the clean room. If your clean room distribution is set to EXTERNAL, you need to wait for the security scan to complete, then update the default release directive using:

-- See the versions available inside the cleanroom
show versions in application package samooha_cleanroom_UI_Registration_ML_clean_room;

-- Once the security scan is approved, update the release directive to the latest version
call samooha_by_snowflake_local_db.provider.set_default_release_directive($cleanroom_name, 'V1_0', '2');

Add a Custom Lookalike Modeling template

To add a custom analysis template to the clean room you need a placeholder for table names on both the provider and consumer sides, along with join columns from the provider side. In SQL Jinja templates, these placeholders must always be:

  • source_table: an array of table names from the provider

  • my_table: an array of table names from the consumer

Table names can be made dynamic through using these variables, but they can also be hardcoded into the template if desired using the name of the view linked to the clean room. Column names can either be hardcoded into the template, if desired, or set dynamically through parameters. If they are set through parameters, remember that you need to call the parameters dimensions or measure_column, which need to be arrays, in order for them to be checked against the column policy. You add these as SQL Jinja parameters in the template that will be passed in later by the consumer when querying. The join policies ensure that the consumer cannot join on columns other than the authorized ones.

Alternatively, any argument in a custom SQL Jinja template can be checked for compliance with the join and column policies using the following filters:

  • join_policy: checks if a string value or filter clause is compliant with the join policy

  • column_policy: checks if a string value or filter clause is compliant with the column policy

  • join_and_column_policy: checks if columns used for a join in a filter clause are compliant with the join policy, and that columns used as a filter are compliant with the column policy

For example, in the clause {{ provider_id | sqlsafe | join_policy }}, an input of p.HEM will be parsed to check if p.HEM is in the join policy. Note: Only use the sqlsafe filter with caution as it allows collaborators to put pure SQL into the template.


All provider/consumer tables must be referenced using these arguments since the name of the secure view actually linked to the cleanroom will be different to the table name. Critically, provider table aliases MUST be p (or p1), p2, p3, p4, etc. and consumer table aliases must be c (or c1), c2, c3, etc. This is required in order to enforce security policies in the cleanroom.

Note that this function overrides any existing template with the same name. If you want to update any existing template, you can simply call this function again with the updated template.

A set of features is selected from the provider dataset, and a set of labels is selected from the consumer dataset, along with a “high value” flag (called label_value). These 2 tables are then inner-joined on email and passed to the Random Forest training algorithm. Lastly, the output of the model training step is passed to an inference function, which uses the trained model to “infer” which of the provider customers NOT in the consumer datasets could be “high value”. The count of such individuals is then returned, along with the model error.

The threshold for determining the score beyond which a customer is “likely high value” is manually set in the template as 0.5. This can be easily changed when adding the template to the clean room.

call samooha_by_snowflake_local_db.provider.add_custom_sql_template($cleanroom_name, 'prod_custom_lookalike_template',
features AS (
        identifier({{ provider_join_col | join_policy }}) as joincol,
        array_construct({{ dimensions[0] | sqlsafe }} {% for feat in dimensions[1:] %} , {{ feat | sqlsafe }} {% endfor %}) as features
        identifier({{ source_table[0] }}) as p
labels AS (
        c.{{ consumer_join_col | sqlsafe }} as joincol,
        c.{{ filter_column | default('SALES_DLR') | sqlsafe }} {{ operator | default('>=') | sqlsafe }} {{ filter_value | default(2000) | sqlsafe }} as label_value
        identifier({{ my_table[0] }}) as c
trained_model AS (
        train_out:model::varchar as model,
        train_out:error::float as error
    FROM (
        cleanroom.{{ lookalike_train_function | default('lookalike_train') | sqlsafe }}(array_agg(f.features), array_agg(l.label_value)) as train_out
      FROM features f, labels l
      WHERE f.joincol = l.joincol
inference_output AS (
        MOD(seq4(), 100) as batch,
        cleanroom.{{ lookalike_score_function | default('lookalike_score') | sqlsafe }}(
            array_agg(distinct t.model), 
            array_agg(identifier({{ provider_join_col | join_policy }})), 
            array_agg(array_construct( identifier({{ dimensions[0] }}) {% for feat in dimensions[1:] %} , identifier({{ feat }}) {% endfor %}) )
        ) as scores
    FROM trained_model t, identifier({{ source_table[0] }}) p
    WHERE identifier({{ provider_join_col | join_policy }}) NOT IN (SELECT c.{{ consumer_join_col | sqlsafe }} FROM identifier({{ my_table[0] }}) c)
    GROUP BY batch
processed_output AS (
    SELECT value:email::string as id, value:score::float as score FROM (select scores from inference_output), lateral flatten(input => parse_json(scores))
), train_results as (
    SELECT {{ num_boosting_rounds | sqlsafe }} as num_boosting_rounds, {{ trim_extremes | sqlsafe }} as trim_extremes, p.audience_size as audience_size, t.error as error from (SELECT count(distinct id) as audience_size FROM processed_output WHERE score > 0.5) p, trained_model t
), seed_size as (
    select count(*) as seed_audience_size from features f, labels l where f.joincol = l.joincol
select s.seed_audience_size, t.audience_size as num_lookalikes_found, t.error from train_results t, seed_size s

Note You can add Differential Privacy sensitivity to samooha_by_snowflake_local_db.provider.add_custom_sql_template procedure call above as the last parameter (if you do not add it, it will default to 1)

If you want to view the templates that are currently active in the clean room, call the following procedure. You can make the modifications to enable Differential Privacy guarantees on your analysis. A similar pattern can be incorporated into any custom template that you choose to write.

call samooha_by_snowflake_local_db.provider.view_added_templates($cleanroom_name);

Set the column policy on each table

Display the data linked to see the columns present inside the table. To view the top 10 rows, call the following procedure.

select * from <PROVIDER_TABLE> limit 10;

Set the columns on which you want to group, aggregate (e.g. SUM/AVG) and generally use in an analysis for every table and template combination. This gives flexibility so that the same table can allow different column selections depending on the underlying template. This should be called only after adding the template.

Note that the column policy is replace only, so if the function is called again, then the previously set column policy is completely replaced by the new one.

Column policy should not be used on identity columns like email, HEM, RampID, etc. since you don’t want the consumer to be able to group by these columns. In the production environment, the system will intelligently infer PII columns and block this operation, but this feature is not available in the sandbox environment. It should only be used on columns that you want the consumer to be able to aggregate and group by, like Status, Age Band, Region Code, Days Active, etc.

Note that for the “column_policy” and “join_policy” to carry out checks on the consumer analysis requests, all column names MUST be referred to as dimensions or measure_columns in the SQL Jinja template. Make sure you use these tags to refer to columns you want to be checked in custom SQL Jinja templates.

call samooha_by_snowflake_local_db.provider.set_column_policy($cleanroom_name, [

If you want to view the column policy that has been added to the clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_column_policy($cleanroom_name);

Share with a consumer

Finally, add a data consumer to the clean room by adding their Snowflake account locator and account names as shown below. The Snowflake account name must be of the form <ORGANIZATION>.<ACCOUNT_NAME>.


In order to call the following procedures, make sure you have first set the release directive using provider.set_default_release_directive. You can see the latest available version and patches using:

show versions in application package samooha_cleanroom_UI_Registration_ML_clean_room;
call samooha_by_snowflake_local_db.provider.add_consumers($cleanroom_name, '<CONSUMER_ACCOUNT_LOCATOR>', '<CONSUMER_ACCOUNT_NAME>');
call samooha_By_snowflake_local_db.provider.create_cleanroom_listing($cleanroom_name, '<CONSUMER_ACCOUNT_NAME>');

Register cleanroom with the UI

Now that the clean room has been shared with the consumer, you can register it to be available for the consumer to use through the web app of a Snowflake Data Clean Room. This means the consumer can run custom analyses, including those backed by complex machine learning code, through the UI without the need to use the developer edition of a Snowflake Data Clean Room.

First, add some customizations for the UI form for the various parameters available through the template. This improves the consumer experience and makes it more likely for them to supply a valid analysis request.

You can optionally use the warehouse_hints customization to control which kind of warehouse the consumer account will use to run analyses. For example, for most machine learning use cases, Snowflake recommends specifying a Snowpark-optimized warehouse. For valid values for warehouse_size, see CREATE WAREHOUSE.

call samooha_by_snowflake_local_db.provider.add_ui_form_customizations(
        'display_name': 'Custom Lookalike Template',
        'description': 'Use our customized ML techniques to find lookalike audiences.',
        'methodology': 'Specify your own seed audience, while matching against our users. Then customize the lookalike model across number of boosting rounds and removing outliers.',
        'warehouse_hints': {
            'warehouse_size': 'medium',
            'snowpark_optimized': TRUE,
    'num_boosting_rounds': {
        'display_name': 'Number of Boosting Rounds',
        'type': 'integer',
        'default': 10,
        'order': 7,
        'description': 'How many rounds of boosting should the model do?',
        'size': 'M',
        'group': 'Training & Inference Configuration'
    'trim_extremes': {
        'display_name': 'Trim Extremes',
        'type': 'boolean',
        'default': False,
        'order': 8,
        'description': 'Remove outliers by default?',
        'size': 'M',
        'group': 'Training & Inference Configuration'
    'lookalike_train_function': {
        'display_name': 'Training Function',
        'choices': ['lookalike_train'],
        'type': 'dropdown',
        'order': 9,
        'description': 'Which function do you want to use for training?',
        'size': 'M',
        'group': 'Training & Inference Configuration'
    'lookalike_score_function': {
        'display_name': 'Scoring Function',
        'choices': ['lookalike_score'],
        'type': 'dropdown',
        'order': 10,
        'description': 'Which function do you want to use for scoring?',
        'size': 'M',
        'group': 'Training & Inference Configuration'
    'provider_join_col': {
        'display_name': 'Provider Join Column',
        'choices': ['p.HASHED_EMAIL'],
        'type': 'dropdown',
        'order': 4,
        'description': 'Select the provider column to join users on.',
        'infoMessage': 'We recommend using HASHED_EMAIL.',
        'size': 'M',
        'group': 'Enable Provider Features'
    'consumer_join_col': {
        'display_name': 'Consumer Join Column',
        'description': 'Specify column in your table that matches the providers (i.e. HASHED_EMAIL).',
        'size': 'M',
        'default': 'HASHED_EMAIL',
        'infoMessage': 'We recommend using HASHED_EMAIL.',
        'order': 5,
        'group': 'Enable Provider Features'
    'dimensions': {
        'display_name': 'Feature Selection',
        'choices': ['p.STATUS', 'p.AGE', 'p.REGION_CODE', 'p.DAYS_ACTIVE'],
        'type': 'multiselect',
        'order': 6,
        'description': 'What features do you want to train on?',
        'infoMessage': 'We recommend selecting all features for maximum signal.',
        'size': 'M',
        'group': 'Enable Provider Features'
    'filter_column': {
        'display_name': 'Filter Column',
        'type': 'any',
        'order': 1,
        'description': 'Specify column in your table to filter for high value users (i.e. SALES_DLR)',
        'size': 'S',
        'default': 'SALES_DLR',
        'infoMessage': 'We recommend you input SALES_DLR over here.',
        'group': 'Seed Audience Selection'
    'operator': {
        'display_name': 'Operator',
        'choices': ['>=', '=', '<='],
        'type': 'dropdown',
        'order': 2,
        'description': 'What is the operator your want to use for the filter?',
        'size': 'S',
        'group': 'Seed Audience Selection'
    'filter_value': {
        'display_name': 'Filter Value',
        'order': 3,
        'description': 'What value do you want to filter to?',
        'default': 2000,        
        'size': 'S',
        'group': 'Seed Audience Selection'
}, {
    'measure_columns': ['seed_audience_size', 'audience_size', 'num_lookalikes_found', 'error'],
    'default_output_type': 'BAR'

Finally, use the provider.register_cleanroom_in_ui command to register the template for the consumer you just added to the clean room. You can specify which consumer you are enabling this custom analysis for by registering the clean room into the UI. In order to invoke this API, the user needs to specify the Snowflake registered username of the person using this API for verification before a clean room can be registered. This user must have signed up to the web app of a Snowflake Data Clean Room and must supply their email.


The provider.register_cleanroom_in_ui command, unlike the provider.add_consumers and provider.create_cleanroom_listing commands, needs the full account identifier of the consumer account, specified in the format LOCATOR.REGION.CLOUD. For information about finding this account identifier, see Using an account locator as an identifier.

Examples include:




call samooha_by_snowflake_local_db.provider.register_cleanroom_in_ui($cleanroom_name, 'prod_custom_lookalike_template', '<CONSUMER_ACCOUNT_IDENTIFIER>', '<USER_EMAIL>');

This command invokes a request to register the clean room into the UI. Previous requests can be seen using the following command:

call samooha_by_snowflake_local_db.provider.view_ui_registration_request_log();

Once this request has been made, you need to log in to your provider account in the web app to trigger the registration of this cleanroom. Until you log into the web app using this account, the request remains in a “PENDING” state. To log in to the web app, see Snowflake Data Clean Room: Web app.

Helper methods

Multiple consumer account locators can be passed into the provider.add_consumers function as a comma separated string, or as separate calls to provider.add_consumers.

If you want to view the consumers who have been added to this clean room, call the following procedure.

call samooha_by_snowflake_local_db.provider.view_consumers($cleanroom_name);

If you want to view the clean rooms that have been created recently, use the following procedure.

call samooha_by_snowflake_local_db.provider.view_cleanrooms();

If you want to get more insights about the clean room that you have created, use the following procedure.

call samooha_by_snowflake_local_db.provider.describe_cleanroom($cleanroom_name);

Any clean room created can also be deleted. The following command drops the clean room entirely, so any consumers who previously had access to the clean room will no longer be able to use it. If a clean room with the same name is desired in the future, it must be re-initialized using the above flow.

call samooha_by_snowflake_local_db.provider.drop_cleanroom($cleanroom_name);

Consumer Flow

As part of the provider flow, you completed the following:

  1. Created a new clean room

  2. Loaded in a custom template, backed by complex Python machine learning code

  3. Created a customized form that your consumers can use to interact with the clean room

  4. Added a description and methodology that you consumers can use

Now you can act as the consumer to see what this clean room looks like in the web app. You can also make modifications on this sheet, and re-register the clean room and see the changes coming up on the consumer UI in real time.

Access the consumer account on the Web App

To sign in to the clean room as the consumer:

  1. Sign in to the web app of the Snowflake Data Clean Room. For details, see Snowflake Data Clean Room: Web app.

  2. Enter your email address, and select Continue.

  3. Enter your password.

  4. Select the Snowflake account associated with the consumer account.

On the Cleanrooms page in the web app, you should see this clean room as a tile as per below. Install the clean room by clicking on the Install button and following the flow:

Next, in the Analyses section, click on New Analysis to pull up the run page for your clean room.

On the next page, you should be able to see the custom template you just added as a tile on the page. Click on that and hit “Next”.

Finally, you should now see your clean room; selecting it and pressing next will allow you to enter the screen where your custom form will be rendered for the consumer to fill.