
Smart Migration to the Cloud: Everything You Want to Know
Cloud data infrastructure can boost innovation and optimize operations for any company. But how do you migrate smartly?
Today, we continue our round-up of new features in Databricks for the first quarter of 2023. In this second guide, we’ll look at updates to workflows, notebooks, governance and runtime.
Our goal? To showcase several of the most important and useful changes to Databricks we saw in Q1 of 2023 and explain how they can save you time, make your team’s life easier, and improve your data management across the board.
If you haven’t already, don’t forget to read part 1 – New in Databricks 2023, featuring a look into the Serverless Warehouse, Model Serving and the Databricks Extension for VS Code.
In this article, you’ll read about:
Databricks Workflows offer a comprehensive range of orchestration capabilities directly within the Lakehouse platform, eliminating the need for third-party tools. This simplifies your modern data stack, reducing both complexity and costs.
In the last few months, Databricks released two new job triggers, file arrival and continuous, and an interesting feature that lets you run your Python code from a remote Git repository as a Workflows task. Let’s explore them below.
Availability: Generally Available on Azure, AWS, GCP
Upon data publication, file trigger arrival is a common requirement to cloud storage, especially in event-based processing systems.
With this feature you can trigger a run of your Databricks job when a new file arrives in an external location, such Amazon S3 or Azure storage. This feature is very useful when a scheduled job is inefficient because new data arrives irregularly. The file arrival triggers check for new files every minute. You pay only the cloud provider cost associated with listing files in the storage location.
DO YOU WANT TO OPTIMIZE YOUR DATABRICKS SETUP AND MAKE THE MOST OF ITS NEW FEATURES?
SPEAK TO US TODAY.
Availability: Generally Available on Azure, AWS, GCP
Continuous jobs allow customers to maintain an uninterrupted workflow by automatically triggering a new job run as soon as the previous one finishes, regardless of whether it was successful or it failed.
With continuous jobs, you can reuse the compute instances from the previous job run, so this feature drastically reduces provisioning time for Databricks clusters.
You can also leverage the continuous job trigger type on Spark Structured Streaming jobs and batch jobs.
Availability: Generally Available – Azure, AWS, GCP
Previously, you could run jobs using notebooks stored in a remote Git repository or a Databricks Repo. Now, you can also use Python code located in a remote Git repository or a Databricks Repo to create and manage production jobs, providing an additional option for automating continuous deployment.
This feature allows for a single source of truth for the job definition process in the remote repository and ensures that each job run is linked to a specific commit hash.
Databricks Runtime 12.1 introduced the Variable Explorer feature, which makes debugging more efficient. Additionally, the ability to execute SQL Cells in Parallel with other running commands in a notebook was added, which enhances performance and efficiency when working with SQL in Databricks. Support for .ipynb files in Repos is now also available, making it easier to work with and share notebooks in a collaborative environment.
Let’s take a closer look.
Availability: Generally Available on Azure, AWS, GCP
Variable Explorer gives the ability to view current Python variables within the notebook interface. This feature allows you to see the value and data type of each variable defined in the notebook, along with its shape.
This way, debugging your code becomes easier and more efficient, much like working in your preferred IDE. It’s also particularly helpful when working on multiple cells and not running all of them.
It can even display variables that were initialized but no longer exist in the notebook due to deleted cells, as these variables may still be stored in the cluster (in the Python interpreter memory in the driver).
Availability: Generally Available on Azure, AWS, GCP
The parallel SQL cell execution feature allows the execution of SQL cells concurrently with other running commands in a Notebook on interactive clusters, and it’s compatible with any DBR version.
However, this feature doesn’t support temporary views, UDFs, and implicit Python DataFrame (.sqldf) for parallelly executed cells, since they run in a new session.
Availability: Public Preview on Azure, AWS, GCP
Support for Jupyter notebooks (.ipynb files) is available in Repos. You can clone repositories with .ipynb notebooks, edit them in the Databricks UI, and then commit and push them back as .ipynb notebooks.
Metadata such as the notebook dashboard is preserved. Administrators can control whether outputs can be committed or not.
You can also:
DO YOU HAVE A DATABRICKS CHALLENGE?
WE CAN HELP. GET IN TOUCH TODAY!
Data governance involves the adoption of policies and practices to securely manage an organization’s data assets and ensure they add value and support the business strategy. As the volume and complexity of data continue to increase, data governance is becoming a critical factor in achieving core business outcomes.
With its Unity Catalog and Delta Sharing features Databricks offers a centralized governance solution for both data and AI. In 2023, Databricks introduced new features like user-assigned managed identities to access storage in Unity Catalog for Azure Databricks, Account SCIM, View Frequent Queries and Users of Table in Unity Catalog Data Explorer, and View Lineage Information for a job.
Availability: Public Preview on Azure
The use of managed identities in Azure resources eliminates the need to manage credentials in code and allows applications to access resources that support Azure AD authentication. Azure manages the identity, so there is no need for users to do so.
System-assigned managed identities have their lifecycle tied to the resource that created them. It is a special type of service principal that is created in Azure AD for the identity. When the resource is deleted, Azure automatically deletes that service principal.
User-assigned managed identities can be created by a user and assigned to one or more Azure Resources. They are a special type of service principal that is being created in Azure AD for the identity but is managed separately from the resources that use it.
In Databricks, you can use managed identities to access storage containers on behalf of Unity Catalog users.
The feature has two primary use cases:
Availability: Generally Available on Azure, AWS, GCP
Account SCIM API lets you use an identity provider (IdP) to create users in a Databricks account and to give users the proper level of access and remote access. You can also deprovision users when they leave your organization or no longer need access to Databricks.
You can either configure one SCIM (System for Cross-Domain Identity Management) provisioning connector from your identity provider to your Databricks account using account-level SCIM provisioning or configure separate SCIM provisioning connectors to each workspace using workspace-level SCIM provisioning.
Availability: Generally Available on Azure, AWS, GCP
In the Data Explorer, the Insights tab provides the ability to review the most common recent queries and users of any table registered in the Unity Catalog. The Insights tab showcases frequent queries and user access within the last 30 days.
This information can help answer questions such as
Note that only Databricks SQL queries are included in the queries and users displayed on the Insights tab.
Availability: Generally available on Azure, AWS
If you have Unity Catalog enabled in your Databricks workspace, you can now access lineage information for any of the tables registered in your workflow.
You will see a link with the count of upstream and downstream tables in the Job details panel for your job, Job run details panel for a job run, or Task run details panel for a task run, indicating the availability of lineage information for your workflow.
This information is captured down to the column level and is supported for all languages. It includes all notebooks, workflows, and dashboards related to the query, and can be visualized in Data Explorer in near-real-time.
Lineage data can be retrieved with the Databricks REST API. Moreover, lineage is aggregated across all workspaces that share the same Unity Catalog metastore, so lineage captured in one workspace is visible in any other workspace sharing that metastore.
Availability: Generally available on Azure, AWS, GCP
The latest Long-Term Support (LTS) Databricks Runtime version is DBR 12.2. Databricks recommends that you use the LTS version of DBR for your production jobs. If you require some features that are available only in a later runtime release, you can use that for your production job cluster. However, it is important to note that when a new LTS version containing all your necessary features becomes available, you are highly encouraged to switch your production job clusters to that version.
Databricks Runtime 12.2 LTS powered by Apache Spark 3.3.2 contains various new features and improvements, including support for Delta Lake schema evolution, structured streaming workloads on clusters with shared access mode, new features for Predictive I/O, implicit lateral column aliasing support, a new forEachBatch feature, and standardized connection options for query federation.
This update also includes bug fixes and upgraded Python and R libraries. There are some behavior changes with the new lateral column alias feature, and users are now required to have SELECT and MODIFY privileges.
Databricks continues to evolve and update, offering new features and conveniences. Keeping up to date with this new functionality ensures that you can manage your data better – and help your teams to do so, too.
Looking ahead to Q2 and beyond, Databricks continues to evolve and develop. In 2023, newer features include Dolly, a cost-effective large language model (LLM), Databricks Connect V2, which empowers developers to utilize Databricks and their IDE, and the generally available DBR 13.0, which is built on Apache Spark 3.4.0 and includes features like cluster metrics and the ability to skip Delta table modifications with Structured Streaming, among many others.
We’ll explore all of these in an upcoming feature. Stay tuned!