Apache Arrow, making Spark even faster [3AE]
Nowadays, as part of my daily job, I have to ask more “Why is this working?” and less of “What is this?” or “How did you do that?”. This shift in questions often ends up leading me down rabbit-holes of different tools and tech, eventually writing google search queries which lead to something like:

The thing about engineering is that there’s way too much to learn and too little time to learn it. The secret (or not so secret) hack is to identify the fundamentals and understand them properly. This leads to you connecting dots between things which might not feel related on the first go. After working with some popular data engineering tools, I realized that there’s one package present in every requirements.txt
:

Sorry, this one:

The Setup
There’s high chance you’ve never directly worked with Apache Arrow but for sure it’s there, with every project of yours which deals with big data. Following are some of them which use Apache Arrow extensively :
Apache Spark
Dremio
Pandas
Ray
HuggingFace datasets
You can check the extensive list here. The point being, Apache Arrow is like that cool kid who gets invited to every party on the campus. So I figured, why not try to understand what makes it so popular. Unfortunately not many people have written about it. So, this is me making an attempt to do the same.
In simplest words possible:
Apache Arrow is a way to efficiently represent and work with data in computer memory, regardless of the programming language being used.
The top 3 concepts involved in the definition above would be:
1. Storing the data in a project specific columnar format
2. Storing the data in memory for faster processing
3. Having interfaces in multiple languages for cross platform development
Before diving deep into how Arrow works in action, let’s go through some of its core ideas which make it so relevant in today’s data landscape.
How is data stored with Apache Arrow?
Arrow stores data as a table-like structure. It refers to data as an in-memory format so less bothered about how its stored in disk, but more interested in how its represented once stored in memory. That, my friend is what you call and know as columnar format, which is incredibly helpful when you’re interested in a subset of data.
The diagram below shows about how some data from a csv file looks when on disk, and how fancy it becomes when loaded in memory (Useful when you want to search for a dotted line).

How does reading data work with Apache Arrow?
A side mission you need to do is to understand what serialization and deserialization is. And tbh, I might even write about it some time, as they get regularly discussed every now and then for any kind of data discussion.
But long story short, for every data transfer between 2 systems, bit of serialization overhead is involved. As data increases, the overhead grows along with it. Arrow solves this problem and hence introduces zero-copy reads by bringing in standardization. This prevents copying data unnecessarily and 2 systems, one working in python and another in R can use the same arrow object to read data. Instead, both of them can use memory buffers which has pointer based access to the data one’s interested in.
How does writing data with Apache Arrow work?
Writing data in Arrow involves using its in-memory format, which has memory buffers organized in columns and batches. The reason this is better than any other existing format like CSV or JSON is because:
a) data stored is in columnar structure, so all data of one column resides together.
b) this enables vectorised processing, which in simple words: performing operations on entire columns efficiently
c) additionally, include metadata + schema information which enable better schema evolution and type safe data access.
The diagram below is a great demonstration on what it means to work with an arrow object:

The Build
One way to see how Arrow is expanding its footprint in today’s world of data is by going through some already popular tech and identifying parts of them which are now completely managed by Arrow. One popular example is the to_pandas
functionality of Spark. This page discusses the same in detail. We can test the same to see whether it actually works.
Lets start by creating some dummy data:
import pandas as pd
num_records = 10000000
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'] * (num_records // 5),
'age': [25, 30, 35, 40, 45] * (num_records // 5)
}
df = pd.DataFrame(data)
df.to_csv('large_dataset.csv', index=False)
Now, lets write a simple program to which uses arrow for conversion from one format to another.
from pyspark.sql import SparkSession
import time
spark = SparkSession.builder.appName("myApp").getOrCreate()
df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
def to_pandas_without_arrow(df):
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
start_time = time.time()
pandas_df = df.toPandas()
end_time = time.time()
execution_time = end_time - start_time
return pandas_df, execution_time
def to_pandas_with_arrow(df):
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
start_time = time.time()
pandas_df = df.toPandas()
end_time = time.time()
execution_time = end_time - start_time
return pandas_df, execution_time
pandas_df_without_arrow, execution_time_without_arrow = to_pandas_without_arrow(df)
pandas_df_with_arrow, execution_time_with_arrow = to_pandas_with_arrow(df)
spark.stop()
print("Execution Time without Arrow:", execution_time_without_arrow, "seconds")
print("Execution Time with Arrow:", execution_time_with_arrow, "seconds")
While we’re at it, let’s do it the other way as well. To convert a pandas dataframe into a Spark one to see whether the format standardization really does work or not.
import pandas as pd
import time
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame as SparkDataFrame
pd.DataFrame.iteritems = pd.DataFrame.items
spark = SparkSession.builder.appName("myApp").getOrCreate()
pandas_df = pd.read_csv("large_dataset.csv")
def pandas_to_spark_without_arrow(pandas_df):
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
start_time = time.time()
spark_df = spark.createDataFrame(pandas_df)
end_time = time.time()
execution_time = end_time - start_time
return spark_df, execution_time
def pandas_to_spark_with_arrow(pandas_df):
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
start_time = time.time()
spark_df = spark.createDataFrame(pandas_df)
end_time = time.time()
execution_time = end_time - start_time
return spark_df, execution_time
spark_df_without_arrow, execution_time_without_arrow = pandas_to_spark_without_arrow(pandas_df)
spark_df_with_arrow, execution_time_with_arrow = pandas_to_spark_with_arrow(pandas_df)
spark.stop()
print("Execution Time without Arrow:", execution_time_without_arrow, "seconds")
print("Execution Time with Arrow:", execution_time_with_arrow, "seconds")
And voila! Our archer is ready to shoot down all the enemies in the next section.

The Payoff
Alright, its the big showtime we came here for.
For a file large_dataset.csv
, which has 10000000 records.
The first program, which does PySpark ➜ Pandas, would have statistics looking something like this:

The second program, which does Pandas ➜ PySpark, would have the following statistics:

From 265 seconds to mere 2.5 ? That’s more than 99% of time, saved because of just one flag. I know for a fact that results will vary under different circumstances, but even if they do, even half of this is too good to be true. This is only possible because of everything we discussed in the previous sections.
And that my friend is the power of the arrow.

For the next one (assuming there is enough engagement and anyone actually wants to make it happen), what I have thought is to make a full fledged web-app, which:
Uses a Python web-service to fetch RSS feed data
Uses a NodeJS backend server to get this data from the python service.
Uses a React frontend to display the same in real time
All of the above should be with Arrow to draw comparisons 🧙🏽♂️ !!!
In pop culture, I finished binging 2 long running shows, The Walking Dead and The Sinner.
Dune: Part Two was everything every Denis fan needed it to be. So glad I got to experience it in theatre. Might easily win a few Oscars in the next year for great visuals and sound-effects.

The Beekeeper was a great no-brainer action flick, just another John Wick, but this time he likes bees instead of a dog.
Finally, a lot of Seedhe Maut in my work playlists, which is how sometimes it looks like when you’re swamped in Hadoop errors for 5+ hours and end up realizing its a release version issue.
Before going back to living under a rock, a picture from a recent trip:

fin!!!