How Apache Spark lit up the tech world and outshone its big data brethren

Tan KW

Publish date: Fri, 14 Jun 2024, 07:49 PM

Interview Big data is no longer hailed as the "new oil." It has gone out of fashion, both in terms of hype and because its foundational technology - Apache Hadoop - was surpassed by cloud-based blob storage such as AWS S3. However, a sister project born in the big data era has become more influential in the modern world of LLMs and internet-scale data systems.

It's been roughly 10 years since Apache Spark 1.0 was released, and The Register caught up with the original author, Matei Zaharia. He wrote the code as part of his UC Berkeley PhD thesis before the project was donated to the Apache Foundation.

"I didn't imagine it would be this popular and widely used back then," he said. "The use of it is still increasing today, from everything we can see: Developers, downloads, and meetup groups and so on."

Spark started as an academic project in 2010 when Zaharia - a Romanian-Canadian national - saw the need to improve on a nascent technology called MapReduce, a Java-based programming model to manage big data across clusters on the Hadoop Distributed File System.

"There were all these data-intensive computing things happening in the web companies at the time - basically Google, Yahoo, and Microsoft," he said. "There were also these distributed computing frameworks like MapReduce, and I was really interested in learning more about those and bringing that kind of parallel computing to a lot more users."

Although MapReduce was a way of cracking big data problems on commodity hardware, it was really targeted at software engineers, Zaharia explained.

"Their approach is very different from someone doing just interactive analysis, for example. Imagine if you want to use something like Microsoft Excel, you can very quickly get some results for what you want. You're not building something that's gonna run your business, but these early frameworks started for the engineers and they didn't do the distribution, fault recovery, and scheduling for you. You had to write it all by hand."

Zaharia took inspiration from the researchers using big data for machine learning or discovering new viruses, for example. "These are really interesting use cases where they won't sit down and learn Java and spend many weeks building an application. We wanted to make it as easy as possible for them to do their stuff," he said.

Part of the plan to broaden the appeal was to introduce new programming languages. As well as Java, users can work in Scala, statistical language R, C#, and Python, a high-level general-purpose language that has achieved widespread popularity in machine learning. The de facto database language standard SQL was added in 2014.

Not only did Zaharia's plan work, Spark became open source in 2010 under a BSD license and was later donated to the Apache Software Foundation, becoming a top-level project in 2013.

In the meantime, Zaharia went on to co-found Databricks, which was initially based around providing Apache Spark as a service in the cloud. In the ten years since it was founded, the company has spread out to provide a so-called data lakehouse platform, combining both exploratory data lakes with more ordered data warehouse system queries with SQL. Earlier this year, it launched its own large language model, DBRX, an open source data catalog (Unity), and tools to build, deploy, and monitor AI and ML solutions in Mosaic AI.

While Spark remains part of Databricks' core offering, its strength lies in its independence from the vendor. All the main cloud vendors - AWS, Google, Azure, Oracle, and IBM - offer Spark as a service, while independent vendors such as Qubole, which provides an open source data lake, also offer it.

Prasad Pore, Gartner senior director and analyst for data and analytics, said Spark was widely used for data processing and preparation, as well as analytics. "When it comes to processing large amounts of data, Apache Spark is a very proven, robust and fault-tolerant technology. Because of that, Spark has pretty good adoption in the market, either via vendors, as a managed offering, or via an open source implementation."

The secret to its success lies in its ability to process data in-memory - where MapReduce had written to disk - and ensure the distributed processing does not fall over.

"In-memory processing provided a tremendous performance improvement over batch jobs," Pore told The Register. "That was the main value proposition. Fault tolerance is also a very critical element when it comes to large amounts of data processing. Imagine if you are processing 10 TB of data and somehow the batch fails. You need to know that there is a fault-tolerant mechanism. If a node fails, it should be able to recover automatically. That is the robust architecture Spark has."

While Zaharia no longer contributes code directly to the project, he helps manage and advise the Databricks team that works on Spark.

He said that by making the project open source in its early years, it has encouraged "a big ecosystem of libraries" that has helped the platform "get better" for everyone. "Making it easy to extend libraries was good," he said.

But one thing he wished he'd introduced from the start is a sort of backward compatibility between applications and Spark. He said the community was working on Spark Connect, which would let the client applications written against Spark become independent of the version server and cluster behind it.

"We're working on that now. It's kind of cool, but I wish we'd done it at the beginning if we had thought about it," he said.

He promised that after Spark 4.0, expected to be released in June, no one would have to update their apps to a new version of Spark. "Of course, they can write new apps to take advantage of the new features," he said.

While Spark may have been born in the era of big data and Hadoop, it is vital to the latest trend in computing. Zaharia said the largest LLMs in the world all prepared their data using Spark. "It's one of the use cases we care about. It's been really interesting to see new things people are doing with it." ®

https://www.theregister.com//2024/06/14/ten_years_apache_spark/

Discussions

Be the first to like this. Showing 0 of 0 comments

Featured Posts

MQ Chat

New Update. Discover investment communities that resonate with your ideas

Latest Videos

MQ Market Updates - 5 July 2024

MQ Trader

Apps

MQ Chat

Send individual or group chats with anyone on i3investor

MQ Trader

Earn MQ Points while trading with MQ Trader

MQ Affiliate

Earn side income from Affiliate Program

MQdemy

Online learning and teaching marketplace

Hot Stocks Today >

MPI

MALAYSIAN PACIFIC INDUSTRIES

1000

PTRANS

PERAK TRANSIT BERHAD

950

HLIND

HONG LEONG INDUSTRIES BHD

932

GENTING

GENTING BHD

359

MAYBANK

MALAYAN BANKING BHD

282

JCY

JCY INTERNATIONAL BERHAD

274

SAPNRG

SAPURA ENERGY BERHAD

229

YTLPOWR

YTL POWER INTERNATIONAL BHD

223

GENM

GENTING MALAYSIA BERHAD

222

OCK

OCK GROUP BERHAD

214

Daily Stocks

HSI-CXV

0.10

-0.02

145,068,200

INIX-OR

0.02

0.00

141,839,900

NOVAMSC

0.265

+0.035

98,192,300

HSI-CV4

0.065

-0.025

95,321,800

HSI-HWE

0.135

+0.02

92,508,000

HSI-HWW

0.245

+0.03

91,583,600

HSI-CXN

0.18

-0.03

90,430,200

SNS

0.91

+0.025

81,019,000

MUIIND

0.085

-0.005

75,068,300

EDUSPEC-OR

0.005

0.00

75,002,700

More active Stocks

NESTLE

122.50

+0.50

26,600

NPC

1.88

+0.23

3,000

SUNCON

4.35

+0.22

3,776,400

PETDAG

17.46

+0.22

524,700

KOBAY

2.35

+0.17

10,345,100

IJM-C76

0.725

+0.165

96,000

YBS

0.94

+0.155

33,444,200

AIRPORT

10.14

+0.14

1,646,500

KLK

20.22

+0.14

600,000

KESM

7.14

+0.14

117,000

More gainer Stocks

MPI

39.40

-0.30

282,900

DLADY

35.42

-0.26

6,800

ORIENT

6.97

-0.18

1,220,900

TOYOVEN

0.89

-0.17

12,566,500

MRDIY-C28

0.085

-0.165

113,000

HEIM

22.04

-0.14

94,600

KUAISHO-C17

0.08

-0.12

58,800

APOLLO

6.68

-0.12

6,000

SALUTE

0.57

-0.11

25,910,000

THETA

1.78

-0.11

4,552,900

More loser Stocks

MQ Trading Signals

BUY
SELL

No trading signals available.

More Trading Signals

No trading signals available.

More Trading Signals

Featured Advertisers / Partners

Top Brokers >

AmEquities

Affin Hwang

Rakuten Trade

Hong Leong Bank

Books Review >

Ride The Bull Short The Bear

CS Tan

4.9 / 5.0

This book is the result of the author's many years of experience and observation throughout his 26 years in the stockbroking industry. It was written for general public to learn to invest based on facts and not on fantasies or hearsay....

Read More