Future Tech

Microsoft, Databricks double act tries to sew up the data platform market

Tan KW
Publish date: Mon, 27 Nov 2023, 09:28 PM
Tan KW
0 461,061
Future Tech

Analysis At Microsoft's Ignite conference, CEO Satya Nadella called Fabric perhaps the company's biggest data product launch since SQL Server, the third-highest-ranked database in the world.

Going GA earlier this month, Microsoft Fabric promises data engineering, data lakes, data warehousing, machine learning, and AI all in a single platform.

Fabric leans heavily on open source technology from Databricks, which extensively partners with Microsoft and integrates its products tightly into the Azure cloud platform.

Users would, however, be wise to keep an eye on data egress costs, and the scale-out approach may fall short of the performance customers need for enterprise business intelligence (BI) and data warehousing workloads, analysts have told The Register.

Among the Fabric GA news, Microsoft announced Mirroring, which it claims will improve analytics performance by creating a copy of external data sources within its own data lakes.

While these kinds of features might put Microsoft noses in front of rivals such as Snowflake and Google, the advantage is unlikely to last long, said Ventana Research analyst Matthew Aslett. "Everyone's pushing in the same direction and as announcements come at different times, the others catch up: it's a pretty close race at this point between all the big players," he said.

A group of other vendors in the data engineering, warehousing, and analytics market have made announcements to tie in with the Fabric launch - including SAS, Teradata, Qlik, Fivetran, and Informatica - betting that Microsoft will become the platform of choice for many users.

It's a play to ensure that should users pick Microsoft Fabric as their main data platform, they are still in the game. "A lot of organizations are looking to reduce the number of data and analytics providers they have," Aslett said. "They're obviously trying to balance that but not getting locked in. There's an interesting balance: you want to reduce the number of vendors that will be knocked down to one. But most enterprises they're dealing with have a whole range of different data platforms."

In Mirroring, Microsoft replicates a snapshot of the external database to OneLake in Delta Parquet tables and keeps the replica synced in "near real time." Users can then create shortcuts to allow other Fabric workloads - connectors, data engineering, building AI models, data warehousing - to use the data without moving it again. Microsoft promised Azure Cosmos DB and Azure SQL DB would be able to use Mirroring to access data in OneLake, while Snowflake and MongoDB customers can do the same.

Microsoft admitted that by mirroring data into Fabric, it would create an additional copy of the data, but claimed that is offset by performance advantages. The copy avoids having to send SQL queries to Snowflake, for example, because the Fabric copy uses Apache Parquet and Delta Lake, as a native format, so OneLake can paste data into memory when queries come in.

But users would need to account for egress costs when moving data out of remote systems as they weigh up the advantages of the Mirroring features, Aslett said.

"That's certainly something I would expect an enterprise would want to evaluate before they committed to using that kind of functionality," he said. "It will depend on the source and various other things, but that definitely should be a consideration."

Meanwhile, Snowflake has made its own pitch to be the platform that does everything by supporting both data lakes and warehouses, while querying external sources using the Apache Iceberg table format, a technology also supported by Cloudera and Google. It said it believes in eliminating copies of data to simplify governance and achieve greater efficiencies.

At the same time as the Fabric news was announced in mid-November, Databricks confirmed a complete overhaul with a so-called data intelligence layer dubbed DatabricksIQ, which "fuels all parts" of its lakehouse platform, designed to accommodate both unstructured data lakes and structured BI and analytics data warehouse workloads.

Databricks' new platform plan is designed to exploit the technology gained in its $1.3 billion acquisition of MosaicML, a generative AI startup. Databricks claimed it will introduce end-to-end retrieval augmented generation (RAG) to help create "high quality conversational agents on your custom data," but has yet to announce any product details.

Performance on data lakes is one thing. On warehouses it's quite another. In BI environments, hundreds, even thousands of users might be hitting the database at the same time. It's a problem an older generation of vendors addressed with query optimization and specialist hardware. While modern, cloud-based data warehouses might cope by adding nodes, users will face a commensurate cost.

In 2021, Gartner pointed out that cloud-based data lakes might struggle with SQL queries from more than 10 concurrent users. Databricks disputed the claim but said it was aware of the challenges. To support more users, a customer could spin up more end points in the cloud, the company said.

Aslett said that more organizations are becoming aware of the difficulties as they attempt to scale data lakes and support enterprise BI workloads.

"We see examples where organizations have done some small-scale testing of a cloud environment that can deliver performance on a small scale and then when they take it into production, and have a higher level of concurrent users, a high level of concurrent queries, they can then run into problems and issues in terms of performance. It is something we've seen organizations have become more cognizant of with high-performance workloads and that's one of the reasons we see some workloads remaining on premises."

For example, Adidas has built a data platform around Databricks. The environment supports the global sportswear maker's development of machine learning models. It also supports BI workloads and the company has created an acceleration layer with the in-memory database Exasol.

Exasol CTO Mathias Golombek told The Register that the company was often brought in on projects where customers find their data platform is not supporting certain workloads with sufficient performance. "Customers like Adidas can have more than 10,000 BI users looking at dashboards which are constantly updated and consumed," he said. "You need a powerful acceleration layer and that's what we provide."

According to Exasol's market research, nearly 30 percent of customers suffer performance issues with their BI tools. "That means not enough people can access the BI dashboards or they are too slow or there are limits on the complexity of questions users can ask because of the underlying data system," Golombek said. Exasol product Espresso serves as a BI accelerator built on the company's in-memory columnar database with Massively Parallel Processing (MPP) architecture and auto-tuning capabilities.

Hyoun Park, CEO of Amalgam Insights, said that by renaming its platform and integrating GenAI features, Databricks was claiming to offer the same semantic context across all of a users' data while maintaining governance of intellectual property across the AI lifecycle. "This new product positioning indicates that it is no longer enough to simply put all of your data in one place and to conduct analytics on that data," he said.

Having come up with the lakehouse concept back in 2020, Databricks has sizable funding. A Series I VC round scooped up another $500 million in September for a nominal valuation of $43 billion. The cash pile could help the company define a "next generation term for where they see the next few years of development," Park said.

Nonetheless, the complexity of managing multi-node Spark clusters meant a third-party technology layer was needed to boost performance.

"Exasol has long been known for its speed in supporting analytics, based on in-memory MPP and auto-tuning," Park said. "High-performance analytics for structured data becomes increasingly challenging to support as data volumes increase and we are reaching an inflection point where data is starting to either outgrow or strain the complexity of managing multi-node Spark clusters.

"Although there are strategies to prioritize memory such as caching frequently used data, Exasol can be used as a tool to replicate structured Databricks data once there are no additional tactics to support faster queries without using up Spark cluster resources and administrative skills."

While Databricks and Microsoft are competing and collaborating to define a market for one-stop shop data platforms supporting BI, analytics, and machine learning in a single environment, organizations that require acceptable performance across thousands of impatient end users might need to shop elsewhere to get what they need. ®

 

https://www.theregister.com//2023/11/27/microsoft_databricks_data_platform/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment