Achieving Data Excellence with Databricks Unity Catalog and C5i’s Intelligent Data Management Framework in CPG
Updated on

Achieving Data Excellence with Databricks Unity Catalog and C5i’s Intelligent Data Management Framework in CPG

Today, organizations rely on data more than ever. Efficient data management allows better decision-making, improves operations, and drives innovation. However, managing large amounts of data isn’t easy. Issues such as inconsistent datasets, poor data quality, data silos, and data complexity need to be tackled so that organizations can leverage data for a competitive edge. C5i’s Intelligent Data Management Framework (iDMF) caters to the need of implementing robust data management for enterprises, by leveraging industry experience and automation driven by ML models.

Databricks emerges as a game-changer, with its Lakehouse as an integrated solution that is designed to unify traditional data warehouses and scalable data lakes. Using tools like Data Lake to ensure reliability and Delta Live Tables for streamlined pipelines, Databricks enables companies to manage data efficiently by maintaining data quality, overcoming data fragmentation, and ensuring efficient data processing.

Recently, a leading CPG company decided to take a transformative step towards data management. Our client wanted to set up a powerful data management solution by implementing Databricks Unity Catalog (UC) for its Databricks workspaces. This not only unlocks data excellence but also marks the beginning of a new era in data governance and innovation.

Revolutionizing the Data Management Process

To help our client manage data more efficiently for generating key insights and future innovations, we wanted to enable the Databricks Unity Catalog as a unified governance layer for their people and organizations workspaces. Using this data management process, the objective was to:

1. Enhance data security and governance
Establish a unified framework to manage access to data, permissions, and compliance, including the encryption of Personally Identifiable Information (PII).

2. Streamline data lineage and auditing
Enable comprehensive tracking of data transformations and its usage to improve transparency and follow regulatory compliance.

3. Centralize metadata and resource management
Simplify administration, improve data discoverability, and foster collaboration across teams.

Databricks Unity Catalog: The Context Behind Using This Approach

The organization had been focusing on enhancing its data governance and streamlining data accessibility to meet growing analytics and compliance demands. A decentralized approach to managing data assets had often resulted in silos, inconsistencies, and challenges with compliance. To address these, adopting Databricks Unity Catalog (UC) as a unified governance solution across its data landscape became the best feasible option.

Unity Catalog offers a centralized metadata layer that ensures consistency and access control across various data assets. Its integration with Databricks’ Lakehouse platform allowed the organization to establish secure and scalable data practices, while effectively meeting all regulatory requirements.

This function unit is the heart of this large organization; maintaining and accessing company’s data faced several challenges:

  • Engineers could access sensitive PII data unnecessarily, posing security risks.
  • Anonymizing data required resource-heavy manual efforts.
  • Complying with the large CPG company records management policy on data retention and deletion was cumbersome.
  • Data access reviews were labor-intensive and prone to errors, which diverted valuable bandwidth from value-driven activities.
  • Existing access management relied on service principles and was decentralized, which led to inefficiencies and governance gaps.

With Unity Catalog, the company was able to address these pain points, and move towards a robust, scalable, and secure data ecosystem.

The Transformation: C5i Intelligent Data Management Framework solutions (iDMF)

Our customized iDMF solution to automate migrating to Databricks Unity Catalog brought about a paradigm shift in data management practices of this function. Below is a snapshot of our client’s transformed data management landscape:

Prior Now
Easy access to associate data for engineers Restricted access through fine-grained access control
Manual, time-consuming process to provide the data required to review and audit access Access review and governance is now quicker and more visible through Databricks
Granting access to Databricks users takes time Faster turnaround in granting access
Access is granted to containers and files in Azure Data Lake storage Access is granted to catalog, schemas (databases), tables, and views
No ability to grant access to specific columns (fields) of data within files Fine-grained access control at the column (field) level
Data engineers cannot find or explore data in a secure manner Faster and easier for data engineers to find the data they need
Complex method for data masking Easy column masking
No ability to add meta data and tags, or categorize the data Enables capture of meta data, tags and categories
Not able to leverage advanced Databricks features dependent on Unity Catalog Ability to use advanced Databricks features through Unity Catalog i.e. GenAI, Lakehouse Monitoring, etc.
Difficulty sharing data to internal stakeholders in different regions Easily share data to users in different regions and those who are not using Databricks
ML models are not registered in a centralized repository Centralized model management enables secure access with fine-grained controls, and ensures transparency through lineage and versioning
Custom pipelines to measure data quality and model drifts Out of box dashboard to monitor data quality and ML model drifts

 

Implementation Strategy: Minimizing Risks, Maximizing Efficiency

There were numerous workstreams and hundreds of data pipelines that were migrated to Unity Catalog. Given below are the key considerations we took into account for the migration:

1. Creating New Workspaces: Dedicated development and product workspaces were set up with Unity Catalog enabled.

2. Incremental Migration: Pipelines and data were migrated in phases to minimize disruptions.

3. Reduction of operational risks: No impact on existing pipelines, which minimized risks in the company’s operations

4. Migration optimization: Using automation for UC table creation and validation, the migration process was streamlined

To execute end-to-end migration to Unity Catalog, the following five-pronged approach was followed:

Discover -> Design -> Configure -> Upgrade -> Clean up and Deploy

Discovery

  • Review existing data Lakehouse architecture and data sources
  • Use UCX (Databricks Utility) to evaluate existing ML and analytics workflows, assess the impact of Unity Catalog on existing pipelines and workloads, and create an inventory of data pipelines that require the Unity Catalog upgrade
  • Determine security and compliance requirements
  • Identify sensitive datasets that need fine-grained access control
  • Gather details on existing data assets – hive metastore, parquet, and other files

Unity Catalog Design

  • Create a detailed design for UC structure, including access connector, storage credentials, external location, metastore, catalog, schema, and naming standards
  • Define security groups that need to be created
  • Identify design automation requirements such as Parquet/CSV to Delta conversion, testing, and sync

Initial Configuration

  • Create new workspaces, configure access, configure metastores, and enable UC for workspaces
  • Provide access to the development team

Upgrade Workload

  • Use UCX and C5i UC migration framework to convert existing hive metastore tables and file to unity catalog tables
  • Modify existing databricks notebook to use unity catalog tables as source/target
  • Fix column anomalies by addressing any discrepancies in column names or schemas
  • Testing automation to validate the changes
  • Optimize performance by reviewing and updating the notebook’s code to leverage UC features like fine-grained access controls and lineage tracking

Clean Up and Deploy

  • Decommission old workspaces, temporary objects, file, clusters, Mount Points
  • Deploy to production environment and validate the changes
  • Train data engineers/ data analysis/ data scientists/ business users on using UC

Key Outcomes and Additional Benefits

  • Improved Collaboration: Unified governance to ensure that all teams work with consistent and trusted data, fostering collaboration across departments.
  • Enhanced Security: Sensitive data is now encrypted, and access is tightly controlled.
  • Operational Efficiency: Reduced manual efforts in governance, freeing up bandwidth for strategic initiatives.
  • Improved Productivity: Teams can now troubleshoot and explore data faster, accelerating time-to-insight.
  • Enterprise Alignment: Now easier to stay in sync with sibling platforms within the organization, ensuring cross-team compatibility.
  • Future-Ready Platform: Adoption of Databricks’ latest innovations.
  • Cost Optimization: Eliminating redundant data pipelines and unnecessary data duplication reduces infrastructure costs.

A Glimpse at the Future

The Unity Catalog implementation is more than just a technical innovation; it’s a strategic leap forward. By centralizing governance, enhancing security, and fostering collaboration, our client is now well poised to scale its data initiatives and embrace the full potential of modern analytics and AI. As customers and organizations evolve, Unity Catalog will continue to serve as the backbone of secure and efficient data operations. This enables large CPG companies to drive innovation and maintain a competitive edge in today’s data-driven world.

Key Takeaways

  • Collaboration is Key: Cross-team alignment ensured a smooth rollout.
  • Start Small, Scale Smart: Phased implementation minimized risks and disruptions.
  • Think Beyond Compliance: Unified governance is not just about security; it’s about boosting productivity and driving innovation.
  • Leverage UCX and C5i iDMF: With these frameworks ensures a swift and smooth transition to Unity Catalog for enhanced data management.

As the world of modern data continues to evolve, the implementation Unity Catalog for our client’s data management stands as a hallmark of excellence—a testament to how technology and strategy, when aligned, can redefine possibilities.