Solving PrimeInsurance Data Challenges at the Databricks Hackathon

Solving PrimeInsurance Data Challenges at the Databricks Hackathon

Our team recently joined the Databricks Hackathon to fix a fragmented data architecture for PrimeInsurance that made basic business reporting impossible. 

The Problem: Legacy Complexity

PrimeInsurance faced a situation common to many large enterprises that grow through acquisitions. After absorbing multiple regional players, they were left with a technical environment that was impossible to manage effectively. The primary issues included:

  • 14 Disconnected Data Sources: Information was trapped in separate databases that could not communicate with each other.
  • 6 Different Regions: Each region operated under its own set of rules, formats, and data standards.
  • Data Redundancy: Without a central system, the same customer often appeared multiple times across different regional databases, leading to inaccurate reporting.

Questions Leadership Could Not Answer

Because the systems were not integrated, the management team lacked the visibility needed to answer basic operational questions:

  • What is our actual customer count? (Inflated numbers due to duplicate records made it impossible to identify unique clients).
  • Why is the claims cycle taking 18 days? (A lack of tracking across systems hid the specific points where delays were occurring).
  • Which assets need immediate attention? (Disconnected inventory data meant that some cars remained unsold for months without being flagged).

The Technical Challenge

The difficulty was not the volume of data, but the total lack of consistency. Fields had different meanings in different regions, formats were mismatched, and critical information was often missing or incorrectly entered. We recognized that jumping straight to AI or advanced dashboards would be a mistake; we first had to fix the underlying data.

In this article, we break down our strategy for unifying this data. We will cover how we cleaned the thousands of broken records, handled the “unseen work” of data engineering, and eventually built an intelligent layer that allows users to get instant answers in plain English.

Phase 1: Making Sense of the Mess

The first step was to identify why the data was so unreliable. When we looked at the 14 different sources, we found that “Customer ID” in one region did not match the format of “Customer ID” in another. Even the field names for simple things like “Claim Date” or “Vehicle Status” were inconsistent.

The most significant impact of this mess was on the customer list. On paper, PrimeInsurance appeared to have 3,605 customers. However, by running a unification and deduplication process, we discovered that the real number was only 1,604.

This reduction was a major win. It provided the business with a “defensible” number, a figure that leadership could finally trust for audits, marketing, and financial planning.

Phase 2: Doing the Unseen Work

Most organizations want to talk about AI and predictive modeling, but at the hackathon, we spent the majority of our time on data engineering. This is the “unseen work” that makes or breaks a project. Our focus was on:

  1. Standardizing Formats: Ensuring that dates, currencies, and addresses were consistent across all 14 sources.
  2. Handling Missing Values: Creating logic to deal with records that were incomplete or had corrupted fields.
  3. Catching Bad Records: Building filters to prevent “junk” data from entering the final consolidated table.

We followed a simple principle: if the foundation is shaky, every decision made on top of it will be wrong. We built a system that could ingest raw data from different sources and output a clean, unified record for every customer, claim, and vehicle.

Phase 3: Architecture for Action

Our architecture was designed to be complete in its flow, moving from raw data to a business decision without manual intervention.

  • The Ingestion Layer: We pulled data from the 14 disparate sources into a central location on the Databricks platform.
  • The Cleaning Layer: This is where we handled the deduplication and format correction mentioned above.
  • The Gold Layer: This contained the final, high-quality data that was ready for analysis.

By structuring the data this way, we ensured that the final outputs were not just numbers on a screen, but actionable insights that the business could use immediately.

Phase 4: Adding Intelligence After Integrity

Once we knew the data was accurate, we added a layer of intelligence to make the information usable for non-technical employees. We integrated Gen AI to bridge the gap between complex databases and business users.

Instead of requiring a manager to write SQL or wait for a data analyst to build a report, we enabled “Natural Language Queries.” This meant a user could simply ask:

  • “Which claims are currently overdue?”
  • “Show me the cars that haven’t moved in 60 days.”

The system would then translate these questions into queries, find the answers in our clean “Gold Layer,” and provide a narrative response. We also added a flagging system that automatically highlighted 276 problematic claims and 128 high-risk files that required urgent human review.

Phase 5: Turning Numbers Into Narratives

One of the biggest hurdles in data management is making the data relatable. A spreadsheet with thousands of rows is difficult to digest. We focused on turning those numbers into a story.

By the end of the hackathon, the three original questions finally had clear, instant answers:

  1. 1,604 Customers: A clear, deduplicated list that was auditable.
  2. Clear Claim Tracking: We identified exactly why the 18-day delay was happening, highlighting 276 claims that were stuck in the system.
  3. 176 Targeted Assets: A specific list of cars that were costing the company money by sitting idle.

This shift changed how the business interacted with its own information. It moved from a state of cross-referencing multiple systems to a single, unified view where answers were available in seconds.

The Real Value: Removing Friction

Our main takeaway from this project is that data is only valuable when people can use it without friction. Many organizations have plenty of data, but they have a “decision problem” because that data is too hard to access or trust.

We didn’t just build a technical pipeline; we removed the confusion that was preventing PrimeInsurance from operating efficiently. By focusing on the quality of the data first and the intelligence second, we created a system that provided real, measurable value.

Conclusion: Data vs. Decisions

The Databricks Hackathon confirmed a core belief we hold at Sinki: most data problems are actually clarity problems. If your team is struggling to answer basic questions about your operations, the issue likely isn’t a lack of tools, but a lack of a solid data foundation.

We demonstrated that by doing the hard work of cleaning, unifying, and standardizing data, you can transform a fragmented organization into a data-driven one. Technology should serve the business, and the best way to do that is to make data clear, accessible, and—most importantly—actionable.

If you are looking at your own systems today, ask yourself: is the challenge getting the data, or is it making the data usable? The answer to that question will determine how you should build your next data platform.

Paras Dhyani

Written by Paras Dhyani

Paras Dhyani is a Databricks Certified Data Engineer Professional specializing in scalable data architecture and analytics. He focuses on transforming complex data challenges into streamlined, production-ready engineering solutions. Through his writing, Paras provides practical insights into building and optimizing high-performance systems on the Databricks platform.

← Previous Next →

Want to stop guessing and start getting results?

Stop wrestling with data. Let's turn it into outcomes that matter.

TALK TO AN EXPERT
START A CONVERSATION ~ START A CONVERSATION ~