Create and Implement a Databricks Source

Components of Databricks Source

Obtain the following components from your Databricks account before creating a Databricks source in Zeotap CDP:

Databricks Host
Catalog Name
Schema Name
Table Name
JDBC
- HTTP Path
- Databricks Client ID and Client Secret
Job based (Recommended)
- Cluster ID
- Warehouse ID
Note - If you are ingesting more than 1M records, we recommend using the Job-based approach, as JDBC may encounter issues with large data volumes.

Databricks Host

Databricks Host is the unique URL assigned to your Databricks workspace. Host can be found in the url for the databricks account

Catalog Name

In Databricks, a catalog acts like a folder system. It organises schema/databases into a hierarchy, allowing users to group tables and views logically.

Schema Name

In Databricks, a schema is essentially a database that contains tables. It serves as a container for organising and managing related tables, providing a structured way to store and retrieve data.

Table Name

This refers to the name of the table within the schema.

Prerequisites for JDBC Approach

Before proceeding with this integration, ensure that you mandatorily complete the following prerequisite steps mentioned in the Databricks documentation:

1. Create a Service Principal

2. Assign Workspace-Level Permissions to the Service Principal

3. Create an OAuth Secret for the Service Principal

Databricks Client ID and Client Secret

Before you can use OAuth to authenticate to Databricks, you must first create an OAuth Client Secret, which can be used to generate OAuth access tokens. Note that a service principal can have up to five OAuth secrets. Account admins and workspace admins can create an OAuth secret for a service principal. For detailed steps about how to generate Client ID and Secret, refer to the Databricks documentation. This information is also outlined in the Prerequisites section above.

HTTP Path

To locate the HTTP Path on Databricks, perform the following steps:

Log into your Databricks instance and under SQL, click SQL Warehouses as shown below.

Click the desired SQL warehouse. If an SQL warehouse does not exist, then create one by clicking Create SQL Warehouse.

On the SQL Warehouse summary page, go to the Connection details tab.

Copy the HTTP Path from the displayed information.

Partition Column

A partition column is a key used to organize large tables into smaller, logical groups. For example, data can be partitioned by date or region, so queries targeting a specific time range or location run much faster. If no Partition Column exists, use any Unique Column instead.

Unique Column Fields

Unique column in the table, for which all the rows will have different value If partition column doesn’t exist we do the following operations

We will first create a temporary view, and run queries over that
We will drop / delete the view, once we fetched all the data

Prerequisites for Job Based Approach

Access token

Access token can be generated by navigating to Settings → User setting → Developer → Generate new token

Cluster ID

Cluster id can be found in the compute → cluster name → automatically added tags

Warehouse ID

Warehouse ID can be found by navigating to SQL warehouses → Select warehouse → warehouse Id

Create a Databricks Source

Once you have obtained the above details from your Databricks account, perform the following steps to create a Databricks Source in the Zeotap CDP App:

Navigate to the Sources application under Integrate, in the Zeotap CDP App.

Click CREATE SOURCE.

Choose Data Warehouse as the Category.

Click Databricks at the Data Source.

Enter a short and descriptive name for the Source.

Choose the Region of upload.

Choose the Refresh Frequency from the drop-down menu. The first data sync takes place once you create the source. However, the subsequent syncs take place based on the refresh frequency that you select. Currently, we support the following sync frequencies:a. Sync onceb. Every hourc. Every 3 hoursd. Every 6 hourse. Every 12 hours,f. Dailyg. Weeklyh. Monthlyi. Sync Time- When you choose Daily, Weekly, or Monthly as the sync frequency, you can specify the exact time for the sync to occur.ii. Sync Period- Indicates whether the selected Sync Time is in the AM or PM.iii. Monthly Sync date- If Monthly sync is selected, you can specify the day of the month on which the sync should run.

Enter the Databricks workspace URL in the Databricks Host field.

Provide the Catalog Name, Schema Name, and Table Name obtained as mentioned in the Components of Databricks Source section.

Under Data Entity, depending on the type of data you want to ingest, choose either Customer Data or Non Customer Data and proceed with the Source creation. To know more about Customer Data and Non Customer Data, refer here.

Under Delta Queries Selection, you can decide whether or not you want to consider deltas (data additions/changes) in a table for a specific duration based on the timestamp column. Based on your requirement, select either true or false.

a. If you select true, only new and updated values based on the timestamp in the delta column are fetched from the table. In this case, Delta Column Name and Delta Column Data Type fields become active, and you need to provide the following information:i. Under Delta Column Name, provide the name of the column that you want to fetch data from.ii. Under Delta Column Data Type, choose the time increment for fetching the data from the selected column.

b. If you select false, then you are not required to make any additional selections. Zeotap CDP fetches all existing data from the table during each run.

Select the mechanism you want to use to pull the data. The currently supported options are JDBC and Job-based

If you selected JDBC in the previous step, please enter the values for the fields below. You can refer to the Prerequisites for JDBC connection section for details on how to obtain these values.a. Client IDb. Client Secretc. HTTP Pathd. Partition Table Detailsi. Partition Column Nameii. Partition Column Typee. Unique Column Name

If you selected Job Based in the previous step, please enter the values for the fields below. You can refer to the Prerequisites for Job Based create-and-implement-a-databricks-source.md#h2_306711543section for details on how to obtain these values.a. Access Tokenb. Cluster IDc. Warehouse ID

Click Next to proceed to fields selection.

In the window that appears, a list of fields are displayed. You can select the desired fields using the check boxes. Use Select All to select all the fields available in your Databricks account. If you know the field names, you can select them after searching in the search box.

Click CREATE SOURCE. Upon successfully creating the source, you can view all the relevant information about the created source under the IMPLEMENTATION DETAILS tab.

Note:The initial data transfer from Databricks to Zeotap CDP may take time, depending on the data volume. For assistance with Databricks source setup or other questions, reach out to the Zeotap support team at support@zeotap.com.

​Components of Databricks Source

​Databricks Host

​Catalog Name

​Schema Name

​Table Name

​Prerequisites for JDBC Approach

1. Create a Service Principal

2. Assign Workspace-Level Permissions to the Service Principal

3. Create an OAuth Secret for the Service Principal

​Databricks Client ID and Client Secret

​HTTP Path

​Partition Column

​Unique Column Fields

​Prerequisites for Job Based Approach

​Access token

​Cluster ID

​Warehouse ID

​Create a Databricks Source

Components of Databricks Source

Databricks Host

Catalog Name

Schema Name

Table Name

Prerequisites for JDBC Approach

Databricks Client ID and Client Secret

HTTP Path

Partition Column

Unique Column Fields

Prerequisites for Job Based Approach

Access token

Cluster ID

Warehouse ID

Create a Databricks Source