Your legal team has signed off on the DPDP compliance policy. Now someone has to make the data platform actually comply. That someone is the data engineer.
India’s Digital Personal Data Protection Rules 2025 were notified on November 13, 2025. The substantive compliance obligations (covering erasure, Data Principal rights, and breach notification) come into force 18 months from notification, landing on May 13, 2027. The gap between today and that date is not breathing room. It is the engineering runway for building deletion workflows, evidence archives, and retention clocks that will support the Data Protection Board of India in a live investigation.
This article is the blueprint. It maps every DPDP erasure obligation directly to Databricks architecture, exposes the compliance traps that no other guide addresses (including the deletion vector purge failure that leaves PII physically intact after a logical delete), and builds a structured, audit-supporting evidence package from first principles.
Whether you run a fintech lakehouse on AWS, an e-commerce platform on Azure, or a SaaS product on GCP, if your Databricks workspaces hold personal data of Indian residents, the obligations below apply to you.
Section 1 : The Stakes
Why DPDP Erasure Is Now a Data Engineering Problem
DPDP is not GDPR with an Indian flag. Several obligations are structurally different and technically harder to automate.
The first is the 48-hour pre-erasure notification. For covered platforms in the Third Schedule (e-commerce, online gaming, and social media entities above specified user thresholds), before scheduled automated erasure due to user inactivity, a Data Fiduciary must notify the affected Data Principal at least 48 hours in advance. For a Databricks shop running automated VACUUM pipelines on large platforms, this means the erasure pipeline cannot be a simple cron deletion. It must be a two-phase orchestrated workflow: notify, wait, then physically purge.
The second difference is the inactivity-triggered erasure deadline defined in the Third Schedule. For e-commerce entities with over two crore users, online gaming intermediaries with over fifty lakh users, and social media intermediaries with over two crore users, the purpose of processing is deemed to be no longer served if the Data Principal has not engaged for three years. These entities must then erase the data. This is an erasure deadline, not a minimum retention guarantee.
The third is the mandatory one-year log retention under Rule 8(3). All Data Fiduciaries must retain personal data, associated traffic data, and processing logs for a minimum of one year from the date of processing, for the purposes specified in the Seventh Schedule. These logs are legal evidence in a DPB investigation.
Under the Schedule to the DPDP Act 2023, failure to implement reasonable security safeguards can attract penalties up to Rs 250 crore. Failure to notify the Board or Data Principals of a personal data breach carries up to Rs 200 crore.
Section 2 : The Legal Foundation
What DPDP Actually Requires You to Do
2.1 The Retention Obligation (Section 8(7) and Rule 8)
Personal data must be erased as soon as the purpose for which it was collected is no longer being served. Three triggers fire this obligation:
- Consent withdrawal: The Data Principal revokes consent and no overriding legal basis exists.
- Purpose completion: The specified processing purpose has been fulfilled and data is no longer needed.
- Inactivity threshold: For Third Schedule entities, the purpose is deemed no longer served when the Data Principal has not engaged within the prescribed period.
Absent a valid legal basis, explicit consent, or overriding sectoral and statutory retention requirements (such as legal holds), a purpose-end trigger mandates immediate erasure.
2.2 The Third Schedule: Inactivity-Triggered Erasure Deadlines
| Sector | User Volume Threshold | Deemed Purpose-End (Erasure Deadline) |
|---|---|---|
| E-Commerce | 2 crore+ users | 3 years from last transaction or login |
| Online Gaming | 50 lakh+ users | 3 years from last login |
| Social Media | 2 crore+ users | 3 years from last login |
| General (all other fiduciaries) | Any size | When actual purpose ends or consent withdrawn |
These periods define when erasure must occur, not how long data must be kept. Sectoral laws (RBI, SEBI, GST) can mandate longer retention and override the DPDP erasure trigger for regulated records.
2.3 The 48-Hour Pre-Erasure Notification (Rule 8)
For Third Schedule entities triggering inactivity-based erasure, a Data Fiduciary must notify the Data Principal at least 48 hours in advance. Automated erasure jobs must operate in two phases: Phase 1 identifies records at threshold, inserts into the erasure queue, and dispatches notification. Phase 2 executes only after 48 hours elapse without re-engagement. Both phases must be timestamped and logged.
2.4 The Log Retention Obligation (Rule 8(3))
Rule 8(3) requires all Data Fiduciaries to retain personal data, associated traffic data, and processing logs for a minimum of one year from the date of processing, for the purposes specified in the Seventh Schedule. After this one-year period, the data and logs must themselves be erased unless a longer retention period is required by law. The Data Protection Board can request these logs under Rule 23 during an investigation.
2.5 Data Processor Liability and the 90-Day DSAR Window
Data Fiduciaries must ensure their Data Processors also delete personal data within DPDP timelines and provide documented proof of deletion. Data Principal erasure requests submitted under Section 12 must be addressed within 90 days (Rule 14).
2.6 The Erasure Documentation Requirement
Deletion dates and proof of erasure must be documented. Verifiable, datable, queryable evidence that the data was deleted, when it was deleted, and that the underlying storage was physically cleared.
Section 3 : The Technical Landscape
How Databricks Stores and Erases Data
3.1 The Soft Delete Problem
When you run a standard DELETE FROM statement on a Delta Lake table without deletion vectors enabled, Delta rewrites the affected data file groups entirely, producing new Parquet files that exclude the deleted rows. The old Parquet files remain physically intact on object storage. Delta Lake’s default data file retention threshold (delta.deletedFileRetentionDuration) is 7 days. The transaction log retention (delta.logRetentionDuration) defaults to 30 days. Until VACUUM runs and physically removes those old files, deleted row data remains on storage and is accessible via time travel:
-- PII in old Parquet versions is still readable via Time Travel
SELECT * FROM my_table TIMESTAMP AS OF '2025-06-01'
WHERE customer_id = 'dp-00142';
3.2 The Deletion Vector Compliance Trap
When deletion vectors are enabled, a DELETE writes a small sidecar bitmap file marking specific row positions as logically absent. The original Parquet file is completely untouched. After DELETE on a deletion-vector-enabled table, both the original Parquet files and the sidecar files remain actively referenced by the current state of the table, making them ineligible for standard VACUUM removal. REORG TABLE APPLY (PURGE) must be executed first to physically rewrite the data blocks and sever these active file references, producing new compacted files that exclude the deleted rows. Only after REORG are the older files unreferenced and eligible for removal by a subsequent zero-hour VACUUM.
Step 1: Execute the logical delete.
DELETE FROM catalog.schema.customer_profiles
WHERE customer_id = 'dp-00142';
Step 2: Run REORG TABLE APPLY (PURGE) to physically rewrite data files and sever active references.
REORG TABLE catalog.schema.customer_profiles
APPLY (PURGE);
Step 3: Run VACUUM to remove all unreferenced files.
-- Disable the safety check to allow immediate purge (use with caution in production)
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM catalog.schema.customer_profiles RETAIN 0 HOURS;
-- Re-enable after purge
SET spark.databricks.delta.retentionDurationCheck.enabled = true;
After this three-step sequence, no readable form of the deleted PII remains within the Delta table storage layer, provided that cloud object-store bucket versioning is disabled, shallow or deep table clones are addressed, and the Databricks disk cache is cleared on cluster restart. Object-store soft-delete features, automated cloud backups, raw ingestion environments (Kafka topics, S3 landing zones), and downstream exports are separate persistence vectors that must be addressed independently.
3.3 Standard Table Erasure Sequence (No Deletion Vectors)
-- Step 1: Logical delete (Delta rewrites affected file groups; old files remain on storage)
DELETE FROM catalog.schema.orders_bronze
WHERE customer_id = 'dp-00142';
-- Step 2: Configure retention to allow immediate physical purge
ALTER TABLE catalog.schema.orders_bronze
SET TBLPROPERTIES (
'delta.logRetentionDuration' = 'interval 1 year',
'delta.deletedFileRetentionDuration' = 'interval 0 hours'
);
-- Step 3: Physical purge
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM catalog.schema.orders_bronze RETAIN 0 HOURS;
SET spark.databricks.delta.retentionDurationCheck.enabled = true;
3.4 The Erasure Registry Pattern
This table is modeled as a standard mutable Delta table secured via Unity Catalog access grants (not append-only, as the workflow updates status columns through the lifecycle):
CREATE TABLE IF NOT EXISTS compliance.dpdp.erasure_requests (
request_id STRING NOT NULL,
data_principal_id STRING NOT NULL,
request_source STRING,
purpose_id STRING,
tables_affected ARRAY<STRING>,
request_timestamp TIMESTAMP NOT NULL,
notification_sent TIMESTAMP,
erasure_executed TIMESTAMP,
status STRING,
evidence_path STRING,
legal_hold BOOLEAN DEFAULT FALSE,
legal_hold_basis STRING
)
USING DELTA;
GRANT SELECT ON TABLE compliance.dpdp.erasure_requests
TO `compliance-investigators`;
GRANT MODIFY ON TABLE compliance.dpdp.erasure_requests
TO `compliance-service-principal@company.com`;
-- Validate DENY privilege syntax in your workspace before deploying
DENY MODIFY ON TABLE compliance.dpdp.erasure_requests
TO `data_engineers`;
Unity Catalog grants prevent unauthorized modifications while allowing the compliance workflow to update status, notification_sent, and erasure_executed as the lifecycle progresses.
3.5 Medallion Propagation: Cascading Deletes Across Bronze, Silver, and Gold
import dlt
dlt.apply_changes(
target = "silver_customer_profiles",
source = "STREAM(bronze_customer_raw)",
keys = ["customer_id"],
sequence_by = col("_commit_timestamp"),
apply_as_deletes = expr("_change_type = 'delete'")
)
Delta Lake erasure covers only Databricks-managed storage. Personal data may also exist in Kafka topics, raw S3 or ADLS landing zones, and upstream operational databases. Those sources must be addressed separately. For the complete VACUUM configuration reference and streaming delete propagation patterns, read Managing Delta Lake VACUUM and Time Travel for DPDP Right to Erasure Compliance.
Section 4 : Retention Policy Architecture
Managing the Clock
4.1 Aligning Delta Properties to DPDP Obligations
The default delta.deletedFileRetentionDuration is 7 days. For DPDP PII tables requiring on-demand forced erasure, this must be overridden to zero hours. Do not apply global VACUUM schedules to compliance-sensitive tables. Manage retention at the table level, driven by the erasure registry.
4.2 Purpose-Based Retention Engine
ALTER TABLE catalog.schema.customer_profiles
SET TAGS (
'dpdp_purpose' = 'account_management',
'dpdp_retention_years' = '3',
'dpdp_retention_start_event' = 'last_login',
'dpdp_sector' = 'ecommerce',
'pii_class' = 'direct_identifier'
);
4.3 Handling Legal Hold Overrides
ALTER TABLE catalog.schema.transaction_records
SET TAGS (
'legal_hold' = 'true',
'legal_hold_basis' = 'RBI_Master_Direction_KYC_2016',
'legal_hold_expiry'= '2030-03-31'
);
The erasure workflow must check for legal_hold = 'true' before executing any deletion. Records matching this tag are skipped and the request is set to status 'held'.
4.4 The 48-Hour Notification Workflow
Phase 1: A daily job queries erasure_requests for 'pending' rows, dispatches notifications, and updates the notification_sent timestamp.
Phase 2: After the 48-hour offset elapses, the job executes DELETE, REORG, VACUUM, captures DESCRIBE HISTORY, and updates erasure_executed and status = 'completed'.
Section 5 : Generating Audit Evidence
Building a Structured, Defensible Evidence Package
5.1 What the Regulator Will Ask For
- Was the Data Principal’s personal data held, and for what stated purpose?
- What event triggered the erasure obligation?
- Was a 48-hour notification sent before deletion, and when?
- When was the DELETE operation executed and by which identity?
- Was the deletion physically complete, not merely logical?
- Are processing logs preserved for the mandatory one-year retention period?
5.2 Evidence Artifact 1: The Erasure Registry Query
SELECT
request_id,
SHA2(data_principal_id, 256) AS principal_hash,
request_source, purpose_id, tables_affected,
request_timestamp, notification_sent,
TIMESTAMPDIFF(HOUR, request_timestamp, notification_sent)
AS hours_to_notification,
erasure_executed,
TIMESTAMPDIFF(HOUR, notification_sent, erasure_executed)
AS hours_from_notification_to_erasure,
status, evidence_path
FROM compliance.dpdp.erasure_requests
WHERE data_principal_id = 'dp-00142'
AND status = 'completed'
ORDER BY erasure_executed DESC;
5.3 Evidence Artifact 2: Delta DESCRIBE HISTORY
Delta Lake’s versioned, append-oriented transaction history records every operation with timestamps and operation parameters. DESCRIBE HISTORY exposes this as a queryable table:
DESCRIBE HISTORY catalog.schema.customer_profiles;
| version | timestamp | operation | operationParameters |
|---|---|---|---|
| 47 | 2025-06-10 03:12:44 | DELETE | predicates: [“customer_id = ‘dp-00142′”] |
| 48 | 2025-06-10 03:14:02 | REORG | applyPurge: true |
| 49 | 2025-06-10 03:15:18 | VACUUM END | numDeletedFiles: 3, numVacuumedDirectories: 1 |
5.4 Evidence Artifact 3: Unity Catalog Audit Logs
The system.access.audit table (currently in Public Preview) captures who ran what operation, on which resource, from which IP. Action name values should be validated in your workspace before building production pipelines:
SELECT event_time, user_identity:email AS executed_by,
service_name, action_name,
request_params:tableName AS table_name,
source_ip_address, response:statusCode AS status_code
FROM system.access.audit
WHERE event_time BETWEEN '2025-06-10T03:00:00' AND '2025-06-10T04:00:00'
AND request_params:tableName = 'catalog.schema.customer_profiles'
ORDER BY event_time ASC;
For the full query library, daily export pipeline, and immutable archive architecture, read How to Use Unity Catalog Audit Logs for DPDP Deletion and Audit Evidence.
5.5 Evidence Artifact 4: Storage-Level Verification
table_location = spark.sql(
"DESCRIBE DETAIL catalog.schema.customer_profiles"
).select("location").collect()[0][0]
# List remaining files using native Databricks API
files = dbutils.fs.ls(table_location)
print(f"Storage verification at {table_location}:")
for f in files:
print(f.path, f.size)
print(f"Verification timestamp: {spark.sql('SELECT current_timestamp()').collect()[0][0]}")
5.6 Assembling the DPDP Erasure Evidence Package
DPDP Erasure Evidence Package
==============================
Request ID : ER-2025-00142
Data Principal : dp-00142 [SHA-256 hash stored in registry]
Request Source : Data Principal DSAR (Section 12)
Purpose ID : account_management
Request Received : 2025-06-08 14:30:00 UTC
Notification Sent : 2025-06-08 15:02:17 UTC (via email, logged)
Erasure Executed : 2025-06-10 03:15:18 UTC
Hours Elapsed : 60.2 hours (48-hour window: HONORED)
Tables Affected:
- catalog.bronze.customer_raw
- catalog.silver.customer_profiles
- catalog.gold.customer_segments (rows recomputed)
Evidence Artifacts:
[1] Erasure Registry Row : compliance.dpdp.erasure_requests
[2] Delta History Snapshot : s3://compliance-archive/history/ER-2025-00142/
[3] UC Audit Log Export : s3://compliance-archive/audit/ER-2025-00142/
[4] Storage Verification : s3://compliance-archive/verify/ER-2025-00142/
[5] Notification Log : s3://compliance-archive/notify/ER-2025-00142/
Log Retention Expiry : 2026-06-08 (1-year from date of processing)
Retained By : compliance-automation-service-principal
5.7 One-Year Immutable Log Architecture
Unity Catalog system tables have a 365-day native retention window per event. To satisfy Rule 8(3) and ensure logs are never at risk from schema changes or preview-feature gaps, an explicit archival pipeline is required:
- Export
system.access.auditto a dedicated Delta table (compliance.logs.uc_audit_archive) on a daily schedule. - Use Unity Catalog
DENY MODIFYgrants to comprehensively block unauthorized INSERT, UPDATE, DELETE, and MERGE operations by non-service-principal identities. Validate privilege syntax in your workspace before deploying, as behavior can vary by securable and metastore setup. - Back the archive to AWS S3 Object Lock (Compliance mode), Azure Immutable Blob Storage, or GCS Bucket Lock with 366-day minimum retention.
Section 6 : Pseudonymization
Does It Satisfy DPDP Erasure?
DPDP Section 12 grants Data Principals the explicit right to request erasure. Unlike GDPR’s Article 17, DPDP does not codify pseudonymization as an equivalent or acceptable substitute. For any explicit Data Principal erasure request, complete physical deletion is the only unambiguous path to compliance under the current DPDP framework.
Section 7 : Common Mistakes and How to Avoid Them
Mistake 1: Treating Delta DELETE as physical erasure. A DELETE rewrites affected file groups and marks old files as removed, but those old Parquet files remain on object storage until VACUUM runs.
Mistake 2: Missing the REORG step on deletion-vector tables. On deletion-vector tables, both the original Parquet files and the sidecar remain actively referenced. REORG must run first to sever those references before VACUUM can remove the old files.
Mistake 3: Forgetting raw files in upstream ingestion layers. Kafka topics, raw S3 landing zones, and upstream databases are not cleared by Delta VACUUM. Address these sources independently.
Mistake 4: Leaving derived personal data unmonitored in Gold tables. Use Unity Catalog column-level lineage to trace where every PII field propagates.
Mistake 5: Relying on manual processes for the 48-hour notification. Automate within Databricks Workflows with timestamped logging of every dispatched notification.
Mistake 6: Trusting ephemeral system tables without archiving them. system.access.audit is in Public Preview. Export and archive audit logs daily to an immutable external store.
Mistake 7: Applying a single global VACUUM schedule to all tables. Retention must be managed at the table level, driven by metadata.
Mistake 8: No documented deletion evidence from downstream Data Processor workspaces. Obtain proof of deletion from all Data Processors via contractual DPA clauses and API-level deletion confirmation.
Section 8: Implementation Checklist
DPDP Erasure Readiness on Databricks
- Tag all personal data in Unity Catalog with purpose, PII class, sector, and inactivity period.
- Document the processing purpose for every dataset with a defined start event.
- Identify your Third Schedule sector and configure inactivity-triggered erasure deadlines in table-level tags.
- Audit all tables for deletion vector status using
DESCRIBE DETAIL. - Create the erasure_requests registry table as a mutable, access-controlled Delta table secured via DENY grants.
- Build the two-phase erasure Workflow: Phase 1 sends 48-hour notifications; Phase 2 executes physical deletion.
- Configure
delta.deletedFileRetentionDuration = 'interval 0 hours'on erasure-completed tables before VACUUM. - Set up the daily audit log export pipeline from
system.access.auditto an immutable archive backed by object-locked storage. - Archive DESCRIBE HISTORY snapshots for each table at the time of erasure.
- Implement delete propagation from Bronze through Silver to Gold using CDF or Lakeflow
apply_changes. - Address upstream source deletion for Kafka topics, raw S3 or ADLS landing pads, and upstream databases.
- Apply legal hold tags (
legal_hold=true) to all datasets subject to RBI, SEBI, or other sectoral retention obligations. - Assemble the DPDP Erasure Evidence Package for every completed request using the five-artifact structure in Section 5.6.
- Test the evidence package against the inquiry question framework in Section 5.1 before enforcement goes live.
Section 9: Conclusion
Compliance Is an Engineering Deliverable, Not a Policy Document
DPDP erasure compliance on Databricks is not satisfied by a policy PDF or a legal sign-off. It is satisfied by an automated, auditable, evidence-generating data engineering system that operates continuously, scales with your Data Principal volume, and produces a structured evidence package before anyone ever asks for one.
Eliminate manual DPDP erasure workflows on Databricks
Sinki.ai delivers a fully automated data erasure framework that manages deletion requests, cross-layer purge execution, audit log collection, and compliance evidence generation across your Databricks environment.
Disclaimer: This article provides technical architecture and implementation guidance only and does not constitute formal legal advice. Organizations should consult qualified legal counsel to assess their specific compliance obligations under the DPDP Act 2023 and applicable sectoral regulations.
Frequently Asked Questions
DPDP Rule 8 requires Data Fiduciaries to erase personal data as soon as the processing purpose is no longer served, whether due to consent withdrawal, purpose completion, or the inactivity-triggered deadline defined in the Third Schedule. For Third Schedule platform classes, the fiduciary must notify the Data Principal at least 48 hours before scheduled inactivity-based erasure. All Data Fiduciaries must also retain personal data, traffic data, and processing logs for a minimum of one year from the date of processing.
No. A standard DELETE rewrites affected data file groups and marks old files as removed, but those old files remain on object storage until VACUUM physically removes them. For deletion-vector-enabled tables, only a sidecar bitmap is written, leaving the original Parquet data entirely intact. REORG TABLE APPLY (PURGE) followed by VACUUM is required for those tables.
The Third Schedule defines inactivity-triggered erasure deadlines for specified fiduciary classes. E-commerce, online gaming, and social media entities above the specified thresholds must erase personal data when a Data Principal has been inactive for three years. This is an erasure deadline, not a minimum retention guarantee. Sectoral laws (RBI, SEBI, GST) can mandate longer retention for specific regulated data categories.
A structured DPDP Erasure Evidence Package includes five artifacts: the erasure registry row; a DESCRIBE HISTORY snapshot showing DELETE, REORG, and VACUUM operations in sequence; a Unity Catalog audit log export; a storage-level verification; and a timestamped notification log. This package is designed to support a DPB inquiry and should be reviewed by qualified legal counsel before reliance in a formal regulatory context.
Rule 8 of the DPDP Rules 2025 requires covered Data Fiduciaries to notify the Data Principal at least 48 hours before their personal data is scheduled for inactivity-based erasure. Erasure workflows must operate in two phases: a notification phase that dispatches the alert and records the timestamp, and a deletion phase that only executes after the 48-hour buffer has fully elapsed.
Rule 8(3) requires all Data Fiduciaries to retain personal data, associated traffic data, and processing logs for a minimum of one year from the date of processing. After this period, the data and logs must themselves be erased unless a longer retention period is required by law. In Databricks, this means daily export of Unity Catalog system table events to an immutable archive backed by object-locked cloud storage with a minimum 366-day retention period.
A defensible DPDP compliance record requires: the Data Principal’s identity and erasure trigger; proof of 48-hour pre-erasure notification where applicable; the DELETE operation record from the Delta versioned transaction history; the REORG and VACUUM operation records confirming physical file removal; and an audit log record corroborating the identity and timestamp of the deletion command. These artifacts, retained for one year from the date of processing, constitute the complete technical audit trail. This architecture guidance does not substitute for legal review.