Future Tech

Voltron Data revs up hyper-speed analytics, leaves Snowflake in the dust

Tan KW
Publish date: Tue, 19 Mar 2024, 06:40 PM
Tan KW
0 460,104
Future Tech

Analytics software startup Voltron Data claims to have completed the full TPC-H 100 terabyte scale factor on unsorted Parquet files directly from storage in less than an hour with only 6 TB of GPU memory.

The fledgling company only launched its first product, called Theseus, in December last year. It has built the distributed execution engine around Apache Arrow, an open source framework for analytics applications that process columnar data, aiming to solve problems beyond the capabilities of CPU-based systems like Apache Spark.

Although Voltron Data has built proprietary software around Arrow, it does so with open standards and APIs, allowing users to port code from one project to another, according to CEO and co-founder Josh Patterson.

"For those types of workloads of 30 terabytes to 100 terabyte scale queries, we've been dramatically faster, cheaper, with fewer servers [than the alternatives]," he told The Register.

The idea is to complement existing systems - such as analytics engine Apache Spark or in-process OLAP database DuckDB - allowing users to port projects from one system to another according to the scale of the problem, without having to pick a major data platform and stick with it.

"For two terabyte and below queries? Yeah, use DuckDB. For two to 30 terabytes use Spark, Trino, or Presto. But when you get above 30 terabytes, you should think about using these accelerated native GPU-based systems like Theseus. The power of open standards means you can write code once and you can target all these different systems," Patterson said.

Hyoun Park, CEO and chief analyst with Amalgam Insights, reckons the benchmark that Voltron Data announced for their 100 terabyte TPC-H benchmark was an order of magnitude faster than what Snowflake supports out of the box. Snowflake is known as a data warehouse solution that is supposed to provide performance analytic capabilities.

"From a practical perspective, what I liked about the Voltron Data announcement is that they are able to get relatively high performance off of Parquet files, which have typically been used as a standard for the data lake, rather than as a performance analytic database," he said.

He said that by supporting open standards, the combination of Theseus, Spark, DuckDB, and others was creating an alternative to the big data platform providers such as Snowflake, Databricks, and Microsoft Fabric, where users can move their projects between analytics engines.

"Databricks does a great job of being a one stop shop for the data lakehouse concept. And there still is and will always be a need for single vendors that can support the entire stack of technologies needed to manage analytics on a data lake," Park told us.

"But with Databricks' shift to promoting the data intelligence platform, they can see that there is now significant competition on the datalake house front and it is not enough to simply promote that concept. Voltron Data is helping to open up the data lakehouse concept, and provide more plug-and-play capabilities for IT departments."

HPE is the first vendor to embed Theseus as its accelerated data processing engine as part of HPE Ezmeral Unified Analytics Software.

Voltron Data has accrued $110 million in seed and Series A funding since it launched. ®

 

https://www.theregister.com//2024/03/19/voltron_data_arrow/

Discussions
Be the first to like this. Showing 0 of 0 comments

Post a Comment