What Enterprises Get Wrong About Data Complexity and How to Fix It

Data Science

What Enterprises Get Wrong About Data Complexity and How to Fix It

Xonier

July 2, 2025

In today’s rapidly evolving world of data, the challenge for enterprises is no longer just managing complexity—it’s unlocking the untapped potential hidden within it.

At the recent DES 2025 summit, Shivani Bennur, onboarding engineer at Starburst, stepped into the spotlight to share a vision of how breaking free from traditional limitations can pave the way for a unified, intelligent data future.

Data Drill

Bennur described the all-too-familiar “data drill” that plagues most organisations. “We all have the enterprise data siloed and fragmented across different data sources, making it extremely difficult to access this data.”

The traditional solution, she explained, has been to use extract, transform, load (ETL) processes to churn through and centralise data in warehouses. But as Bennur pointed out, this creates its own set of problems.

“It creates a single source of truth, but with this, the data is often siloed, outdated, and fragmented. So we are not even ready for…the growing needs of data science and AI, which actually need to process a lot of unstructured data,” she said.

The result is multiple copies of data, added complexity, inconsistencies, and a governance maze that few can navigate.

Unlocking the Potential

Bennur declared that the future lies in an open, hybrid data lakehouse—one that combines the performance of a warehouse without the high cost and gives users the flexibility to access data wherever it lives without the headaches of data movement.

“What we are aiming at is not just having a lakehouse, which you already have, but actually having an open hybrid lakehouse,” she explained.

The vision is based on Trino, an open-source, parallel, distributed SQL query engine originally developed at Facebook.

“Trino is known for querying data across multiple different sources—that is called query federation—eliminating the need to move complex data around and write scripts,” Bennur said, highlighting its ability to separate compute from storage for independent scaling and cost efficiency.

She noted that Trino’s open architecture is why leading lakehouse table formats like Iceberg were designed to work with it, something that Starburst calls an Icehouse architecture. The Icehouse is a modern, open-data “lakehouse” architecture that combines Trino (an open-source distributed SQL engine) with Apache Iceberg (an open table format) to deliver a full data warehouse experience on data lakes—supporting transactional queries (insert/update/delete), scalable analytics, and AI workloads, without vendor lock-in.

‘Starburst Speaks Your Language’

“This brings us to Starburst,” Bennur explained. “The first thing I want you to know is that Starburst is a Trino company co-founded by the creators of Trino. We bring the goodness of Trino to the enterprise.”

Starburst is essentially an analytics engine that fits into any environment without requiring data movement or centralisation, combining Trino’s power with enterprise-grade features.

These include data products for querying and governance, fault-tolerant execution for ETL workflows, and advanced query optimisation.

“Starburst order scales from gigabytes to petabytes in no time, so that you can save on your resources,” she said, adding that smart indexing and caching can significantly improve query times. Notably, Metrics and logging features provide a comprehensive view of both data and platform performance.

As a central hub, Starburst allows organisations to plug in and query any data source, creating what Bennur called a “data mesh”.

This breaks down silos and delivers a trustworthy single point of access. Integration with popular business intelligence (BI) tools is seamless, requiring no changes to existing workflows.

For data engineers, Bennur painted a compelling picture. “Imagine a place where you don’t have to write heavy ETL scripts and move the data in and out, and have to spend hours just waiting for the jobs to complete. What if you could do all of that in a single place and get direct access to your data? That’s what Starburst does—Starburst speaks your language.”

The platform supports open table formats like Apache Hudi, Iceberg, and Delta Lake, allowing users to choose the best tool for the job. Its fault-tolerant execution mode means queries can resume from failures without losing progress or wasting resources.

Moreover, for Python users, PyStarburst enables pipeline development within Starburst using familiar APIs, she noted. Autoscaling allows rapid analysis of data volumes from gigabytes to petabytes, and integration with tools like DBT, Airflow, Jupyter, and Spark is effortless.

She stressed that whether customers want a fully managed cloud solution or an on-premises deployment, Starburst offers both, with robust security and governance features built in.

Challenges Exist

Yet, as Bennur acknowledged, connecting the right data to LLMs is the next big challenge for the data industry. Even the most advanced models face new challenges around data access in the AI era. According to her, general-purpose AI often hallucinates, giving generic answers due to limited, outdated, and siloed enterprise data that it can’t access or learn from. Every organization adopting AI now faces a data access challenge.

She pointed out that data collaboration is another major hurdle. AI initiatives often stall because IT and business teams struggle to architect and govern data together, leading to models that lack relevance, trust, and real-world applicability.

“Data governance is fractured because AI use cases involve moving your data across sensitive data across different technologies like vector embeddings, LLMs, and vector DBs, resulting in increased security and governance risks,” she said. Inconsistent policies across data sources only add to the security gaps and operational overhead.

Starburst to the Rescue

Starburst supports both batch and streaming data processing. For effective data collaboration, Starburst offers intuitive data discovery tools, simple SQL access, and the ability to curate governed data products for sharing across teams.

“Data governance is a really critical part in today’s hybrid and multi-cloud environments. Starburst provides robust data features, including fine-grained access control, AI-powered data generation, AI-powered data classification, data and network security, ensuring that users’ data is secure and compliant and only accessible for those in need of it,” she said.

Enhancing AI Outcomes

Bennur explained how Starburst boosts AI performance using retrieval augmented generation (RAG), which she called a “superpower memory boost”. Unlike traditional LLMs that rely only on training data, RAG dynamically pulls up-to-date enterprise information to generate context-rich responses.

She contrasted RAG with prompt stuffing, another method where enterprise data is embedded directly into the prompt, pointing out that both help LLMs produce more relevant, timely answers.

RAG works by retrieving current data, such as research papers or support tickets, combining it with the user query, and passing it to the LLM. This ensures the output is accurate and aligned with the latest information.

Bennur also showcased how Starburst simplifies complex AI workflows. Traditionally requiring coordination with engineers and multiple tools, tasks like joining Oracle clinical trial data with Delta Lake sources can now be done using simple SQL queries. LLM functions such as classify, extract, and summarise can be invoked directly.

Starburst supports integration with multiple LLM, including OpenAI and AWS Bedrock. In one example, it used vector and full-text search to analyse support tickets and then passed summaries to the LLM for insights.

RAG workflows in Starburst involve gathering data, chunking, embedding it into Iceberg tables, and performing AI search—all governed and compliant. Bennur concluded by saying Starburst unifies data access, enhances AI outcomes, and ensures security and control, helping organisations fully realise the potential of AI.

The post What Enterprises Get Wrong About Data Complexity and How to Fix It appeared first on Analytics India Magazine.

Tags :