Future Tech

Microsoft touts mirroring over moving in data warehouse gambit

Tan KW
Publish date: Thu, 16 Nov 2023, 08:09 AM
Tan KW
0 461,199
Future Tech

Ignite Microsoft is advising customers using its Fabric platform to copy data from other data warehouses and analytics systems in a move against the prevailing industry trend.

Fabric - which encompasses data warehouse, data lake, analytics, BI, and machine learning - was launched earlier this year, promising to address "every aspect of an organization's analytics needs."

At the Redmond software giant's Ignite conference this week, Microsoft announced its general availability, as well as a few new features.

Among them is Mirroring, a way to add and manage existing cloud data warehouses and databases in Fabric's Synapse Data Warehouse system. Microsoft said Mirroring replicates a snapshot of the external database to OneLake in Delta Parquet tables and keeps the replica synced in "near real time."

From there, users can create shortcuts to allow other Fabric workloads - connectors, data engineering, building AI models, data warehousing - to use the data without moving it again. Microsoft promised Azure Cosmos DB and Azure SQL DB would be able to use Mirroring to access data in OneLake, while cloud-based data platform provider Snowflake and NoSQL database MongoDB customers will be able to do the same.

The move goes some way to executing a trend seen in the data warehouse and analytics space over the last couple of years. By supporting the Delta table format, other compatible analytics engines will be able to access and use the data in OneLake without moving it.

Delta is supported by application giant SAP and Databricks.

But others have adopted a different table format - Apache Iceberg - for a similar objective. They include Snowflake, Cloudera, and Google's BigLake.

Iceberg and Delta are effectively metadata layers on the Apache Parquet data storage format.

Although both formats - as well as Apache Hudi - are designed to help bring analytics engines to the data, avoiding the cost of moving it, Microsoft argues that copying data from other sources is necessary to get better performance.

Speaking to The Register, Arun Ulag, chief vice president of Azure Data, said the idea behind Mirroring was to allow customers who have data sitting in proprietary databases and data warehouses, like Snowflake, for example, to create and maintain a replica OneLake.

Although it might require storing the data in two places, Ulag argued there would be performance advantages.

"The majority of the Snowflake data is not sitting in Iceberg," he said, "but in their own proprietary database. Like other data in a proprietary data format, the only way to touch the data is to go through the SQL interface, which drives up costs for customers. It also means that there's another tier of execution which slows down performance."

Copying the data to Fabric Power BI, for example, doesn't even have to send SQL queries to Snowflake because the data is sitting in Apache Parquet and Delta Lake, OneLake's native format. "It will simply go to OneLake and paste it into memory when queries come in," Ulag said. "It gives you a significant performance acceleration because you know you're eliminating that whole SQL execution."

Snowflake has been offered the opportunity to comment on the performance advantage of copying data from its environment for analytics.

One industry expert said Microsoft would need to copy the data to get better query performance until it natively supports Iceberg, which it said it would in the future. It's also possible that Microsoft believes it can manage the data better than Snowflake to get better query performance via the way it controls clustering, they said.

Hyoun Park, CEO and chief analyst with Amalgam Insights, said: "Microsoft would be glad to take any Parquet files and put them into a Microsoft data lake and would be glad to take any Snowflake data that it can get in the process."

But behind the scenes, there may be reasons Microsoft is focusing on Delta rather than Iceberg for the time being.

"We know that there is only one major company that has focused on the Delta Lake format so far, and that is the powerhouse startup Databricks," Park said. "There is an Azure Databricks product as well, and it has been doing very well. In fact, it may be the most successful product on Microsoft Azure. Our data shows it is currently a multibillion-dollar business when considering the data lake and associated analytic and machine learning workloads.

"Microsoft has made no secret of the fact that it is staking a lot of its near-term growth on AI. This means that Microsoft wants to be able to support a Delta Lake format, and do as much of the work themselves on their own infrastructure and resources."

Park said Microsoft also has a lot of Azure cloud business that is directly reliant on Databricks and would want to make sure it do everything possible to not lose that business. "Although Iceberg is a more prevalent data lake standard, when looking across the IT vendor landscape, Databricks has been very successful in providing machine learning infrastructure at the data level," he said.

However, he said Microsoft would eventually be a significant Iceberg contributor as well.

At Ignite, Microsoft said it would extend its Copilot chatbot to Fabric. Now in public preview, the move promises to allow data scientists to use natural language to create dataflows and pipelines, write SQL statements, build reports, and develop machine learning models. ®

 

https://www.theregister.com//2023/11/15/microsoft_fabric_mirroring/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment