Databricks - Unity Catalog
Databricks is a cloud-based data analytics platform that integrates data engineering, data science, and machine learning tools to enable seamless processing, analysis, and exploration of large datasets, offering services like Databricks Delta Lake and Databricks Runtime. Unity Catalog is a unified governance solution build on top of Databricks.
The connector is available here and supports
- Synchronizing Databricks users to an identity store in Raito Cloud.
- Synchronizing Databricks Unity Catalog meta data (data structure, known permissions, …) to a data source in Raito Cloud.
- Synchronizing Databricks Unity Catalog grants from an to Raito.
- Synchronize the data usage information to Raito Cloud.
Prerequisites
Unity Catalog
Databricks Unity Catalog should be enabled on the account and workspaces, as this is essential to the Raito Databricks plugin.
Authentication
We support the following authentication methods:
- OAuth
- Personal Access Token
- Azure managed identities
- GCP ID authentication
The associated account should be admin in the Databricks account and on all workspaces. There are no required permissions within the Databricks Unity catalog.
OAuth (Azure, AWS, GCP)
Authentication using OAuth, requires a valid client_id
and client_secret
.
The client_id
and client_secret
should be provided in the databricks-client-id
and databricks-client-secret
parameter respectively.
More information can be found on the following pages: azure, aws, gcp.
Personal Access Token (Azure, AWS, GCP)
To authenticate using a personal access token, the token should be provided in the databricks-token
parameter.
More information can be found on the following pages: azure, aws, gcp.
Azure managed identities
A Microsoft Entra ID service principal can be used to authenticate against the Databricks account.
To use this authentication method, databricks-azure-client-id
, databricks-azure-client-secret
and databricks-azure-tenant-id
should be provided.
More information can be found on the following here.
GCP ID authentication
To authenticate using GCP ID authentication, the GCP Service Account Credentials JSON or the location of these credentials on the local filesystem should be provided in the databricks-google-credentials
parameter.
Additionally, a GCP service account e-mail should be provided in the databricks-google-service-account
parameter.
More information can be found on the following here.
Basic Authentication
To authenticate by email and password, email and password can be provided in the databricks-user
and databricks-password
parameters respectively.
We recommend to use basic authentication only for testing purposes.
Databricks-specific CLI parameters
To see all parameters, type
$> raito info raito-io/cli-plugin-databricks
in a terminal window.
Currently, the following configuration parameters are available:
Configuration name | Description | Mandatory | Default value |
---|---|---|---|
databricks-account-id |
The Databricks account to connect to. | True | |
databricks-platform |
The Databricks platform to connect to (AWS/GCP/Azure). | True | |
databricks-client-id |
The (oauth) client ID to use when authenticating against the Databricks account. | False | |
databricks-client-secret |
The (oauth) client Secret to use when authentic against the Databricks account. | False | |
databricks-token |
The Databricks personal access token (PAT) (AWS, Azure, and GCP) or Azure Active Directory (Azure AD) token (Azure). | False | |
databricks-azure-use-msi |
true to use Azure Managed Service Identity passwordless authentication flow for service principals. Requires AzureResourceID to be set. |
False | false |
databricks-azure-client-id |
The Azure AD service principal’s client secret. | False | |
databricks-azure-client-secret |
The Azure AD service principal’s application ID. | False | |
databricks-azure-tenant-id |
The Azure AD service principal’s tenant ID. | False | |
databricks-azure-environment |
The Azure environment type (such as Public, UsGov, China, and Germany) for a specific set of API endpoints. | False | PUBLIC |
databricks-google-credentials |
GCP Service Account Credentials JSON or the location of these credentials on the local filesystem. | False | |
databricks-google-service-account |
The Google Cloud Platform (GCP) service account e-mail used for impersonation in the Default Application Credentials Flow that does not require a password. | False | |
databricks-data-usage-window |
The maximum number of days of usage data to retrieve. Maximum is 90 days. | False | 90 |
databricks-sql-warehouses |
A map of deployment IDs to workspace and warehouse IDs, required to support data object tags, row level filtering and column masking (see sql-warehouse) | False |
SQL Warehouses
To enable row filtering and column masking, the plugin needs access to a SQL warehouse to manage those filters and masks.
The configuration file should be update in such a way that the databricks-sql-warehouses
parameter is a list that defines the workspace deployment ID and warehouse ID. Note that no duplicate workspace IDs are allowed.
For example:
databricks-sql-warehouses:
- workspace-id: abc-12345678-fedc
warehouse-id: 1234567891234567
- workspace-id: 123-12345678-fedc
warehouse-id: 9234567891234567
Where abc-12345678-fedc
and 123-12345678-fedc
are the workspace IDs and 1234567891234567
and 9234567891234567
are the corresponding SQL warehouse IDs.