CSV vs. Parquet vs. Arrow: Storage Formats Explained for Normal People

CSV vs. Parquet vs. Arrow Storage Formats


CSV vs. Parquet vs. Arrow

Alright, let's talk about something that sounds boring but actually matters alot. File formats. I know, I know—your eyes are already glazing over. But hear me out. If you work with data, even just a little bit, the way you save your files is probably slowing you down. Or costing you money. Or both.

You've got your old reliable CSV. Then there's this newer thing called Parquet that everyone keeps mentioning. And now Arrow is crashing the party. What are they? Which one should you use? And why do data engineers get so excited about columnar formats?

I'm gonna break this down in plain English. No complicated jargon, no assume you already know stuff. Just real talk about how these three formats work, when to use each one, and why picking the wrong one can make your life miserable. Let's dig in.

CSV - The Old Faithful (And Why It's Letting You Down)

CSV stands for Comma Separated Values. It's been around since the dawn of personal computing basically. You probably use it everyday without thinking about it. Export from Excel? That's CSV. Download a report from some dashboard? CSV. Share data with a colleague who uses a different tool? CSV again.

How it works: CSV is dead simple. It's plain text. Each row is a line. Each column is separated by a comma. Sometimes a semicolon if you're in Europe. That's it. You can open it in Notepad and read it. A human can understand it without any special software. That's its superpower.

The good parts: Everyone supports it. I mean everyone. Excel, Google Sheets, Python, R, SQL databases, that weird accounting software from 1998—they all speak CSV. It's the universal language of data. Also, it's human readable. You can see the data right there. No decoding needed.

The bad parts (and there are many): CSV is SLOW. Like, painfully slow. Every time you read a CSV file, the computer has to parse every single character, figure out where the commas are, convert "123" from text to a number, and do this row by row. If you have a million rows? That's a million conversions.

Also, CSV is dumb about data types. Everything is text. That column that says "2024-01-15"? CSV doesn't know it's a date. That column that looks like numbers? CSV doesn't know they're integers. Every tool has to guess. And tools guess wrong all the time. Ever opened a CSV and had leading zeros disappear from ZIP codes? Yeah, that's CSV.

Compression is another problem. CSV is verbose. Every column name repeats in the header. Every comma takes space. Numbers are written out character by character. A big CSV can be gigabytes when a smarter format would be megabytes. Your storage costs, your network transfer costs—you're paying for all that wasted space.

When to use CSV: Honestly? Only when you have to. When you're sharing data with someone who isn't technical and needs to open it in Excel. When you're doing a quick one-off export. When interoperability with legacy systems is non-negotiable. Otherwise, there are better options.

Parquet - The Columnar Wonder

Parquet showed up around 2013, created by Twitter and Cloudera. It was designed for big data—like, really big. We're talking terabytes and petabytes. The kind of data that would make CSV just give up and cry.

How it works (the simple version): CSV stores data row by row. Row 1, then Row 2, then Row 3. Parquet stores data column by column. All of Column A, then all of Column B, then all of Column C. This seems like a small change but it's actually revolutionary.

Why does columnar storage matter? Imagine you have a table with 100 columns but you only need 3 of them for your analysis. With CSV, the computer has to read every single row AND every single column, then throw away the 97 columns you don't need. That's a huge waste of I/O and memory. With Parquet, the computer reads only the 3 columns you actually want. It skips the rest entirely. This is massively faster and uses way less resources.

Compression is insane: Because Parquet stores similar data together—all the dates in one block, all the ZIP codes in another—it can compress incredibly well. Numbers that are close to each other compress better than random scattered values. Parquet files are often 75-90% smaller than the same data in CSV. That's not a typo. Ninety percent smaller.

Schema and types: Parquet stores the schema inside the file. It knows that column 3 is an integer, column 4 is a date, column 5 is a string. When you read a Parquet file, there's no guessing. The data types are preserved. No more missing leading zeros. No more "is this a number or text?" ambiguity. It just works.

Predicate pushdown (fancy term, simple concept): This is huge. Parquet files store metadata about each block of data—min values, max values, etc. So if you query "WHERE date > '2024-01-01'", Parquet can look at the metadata and say "this block's max date is 2023-12-31, so I don't even need to read it." It skips entire chunks of data without ever touching them. This makes queries lightning fast on big datasets.

The downsides: Parquet is not human readable. You can't open it in Notepad. You need special tools or libraries. Also, it's not great for tiny files. The overhead of the schema and metadata means that if you have a 10KB CSV, converting it to Parquet might actually make it bigger. Parquet shines at scale—hundreds of MBs and up.

When to use Parquet: Anytime you're working with large datasets. Data lakes. Analytical workloads. Data warehousing. If you're storing data in cloud storage like S3 or ADLS and querying it with tools like Spark, Presto, or even Power BI, Parquet is probably your best friend.

Arrow - The In-Memory Rocket

Arrow is the new kid. It's not exactly a storage format like CSV and Parquet. It's more of a in-memory format specification. But it's important and it's changing how data moves around.

What is Arrow actually? Apache Arrow is a columnar memory format. It defines how data should be arranged in RAM for maximum performance. The key insight is that the bottleneck in data processing is almost never the CPU—it's moving data around. Reading from disk. Serializing and deserializing. Copying data from one format to another. Arrow eliminates most of that.

Zero-copy: This is Arrow's magic trick. When data is in Arrow format, multiple tools can access it without copying or converting it. A Python library can read it. An R script can read it. A SQL engine can read it. They all just look at the same chunk of memory. No serialization. No parsing. No "save to CSV then load into the next tool." It's already there, ready to use.

Columnar, like Parquet: Arrow is also columnar. So it gets all the benefits of columnar processing—compression, vectorized operations, skipping unnecessary columns. But it's designed for RAM, not disk. It's optimized for the CPU cache, for SIMD instructions, for modern hardware.

Interoperability: Arrow has become the bridge between data tools. Pandas can output Arrow. Spark can read Arrow. DuckDB, R, TensorFlow, PyTorch—they all speak Arrow now. It's becoming the universal translator for data.

Feather and IPC: Arrow has a couple of on-disk formats too. Feather is Arrow saved to disk. It's fast—like, really fast—to read and write. But the files are bigger than Parquet because Feather prioritizes speed over compression. It's great for intermediate data, not archival storage.

The downsides: Arrow is not a good long-term storage format. The files are big. The format is still evolving. And if you're not doing high-performance analytics, you might not need it. It's a solution to a problem you might not have yet.

When to use Arrow: When you need speed. When you're passing data between different tools in a pipeline. When you're doing interactive analytics and waiting even a few seconds is annoying. When you're using DuckDB or Polars or modern data tools that leverage Arrow natively.

The Head-to-Head Comparison

Let's put them side by side and see how they actually stack up.

Read Speed: Arrow is fastest, by far. It's literally designed for this. Parquet is slower to read than Arrow but much faster than CSV, especially if you're only reading a subset of columns. CSV is slowest, especially as files get bigger.

Write Speed: CSV is actually pretty fast to write. It's just spitting out text. Arrow (Feather) is also very fast. Parquet is slower to write because it has to do all that compression and encoding. It's worth the tradeoff for storage savings, but it's not instant.

File Size: Parquet wins, no contest. CSV is huge. Arrow/Feather is medium—bigger than Parquet, smaller than CSV usually. If you're paying for cloud storage, Parquet saves you money.

Schema Enforcement: Parquet and Arrow both have strong, explicit schemas. CSV has no schema—it's just text and hope.

Human Readable: CSV only. Parquet and Arrow are binary formats. You can't eyeball them.

Ecosystem Support: CSV is everywhere. Parquet is supported by every major data tool (Spark, Hive, Presto, BigQuery, Redshift, Snowflake, etc.). Arrow is growing fast but not as universal yet.

Compression: Parquet is king. CSV can be compressed with gzip or zip, but then you lose human readability and random access. Arrow Feather doesn't compress much by default.

Cost Effectiveness: For storage, Parquet. For compute, Arrow (less CPU time). For development time, CSV (but you pay for it later).

Real World Scenarios - What Should You Actually Do?

Okay enough theory. Here's practical advice for real situations.

Scenario 1: You're building a data lake on cloud storage.

Use Parquet. Full stop. Partition it by date or category. Compress with Snappy or Zstandard. You'll save 80% on storage costs and your queries will be 10x faster. This is not controversial—it's what everyone does.

Scenario 2: You're passing data between steps in a pipeline.

Use Arrow Feather. Or just keep it in memory in Arrow format if your tools support it. Don't write CSV to disk between transformations. You're just slowing yourself down and wearing out your SSD.

Scenario 3: You're sharing data with a non-technical stakeholder.

CSV. Or better yet, Excel. Don't send them a Parquet file. They'll have no idea what to do with it. CSV is ugly but it works.

Scenario 4: You're doing interactive data analysis in Python or R.

Use Parquet for your base datasets. Read it once. Then use Arrow-backed tools like Polars or DuckDB for the actual analysis. Pandas 2.0+ also supports Arrow-backed DataFrames, which are much faster than traditional Pandas.

Scenario 5: You're archiving data for compliance or legal reasons.

Parquet, but maybe keep a CSV copy too. Parquet is open source and will be readable forever, but some auditors get nervous about binary formats. It's silly, but sometimes you gotta play the game.

Scenario 6: You're working with streaming data.

Arrow is great here. Its columnar format allows for efficient vectorized processing of batches. Many streaming engines are adopting Arrow for exactly this reason.

Common Mistakes People Make

I've seen these over and over. Learn from other people's pain.

Mistake 1: Using CSV for everything.

I get it. It's comfortable. It's familiar. But your 2GB CSV file that takes 5 minutes to load? That could be a 200MB Parquet file that loads in 30 seconds. You're burning time and money.

Mistake 2: Not partitioning Parquet files.

A single giant Parquet file is better than a giant CSV, but it's not optimal. Partition by columns you filter on frequently—date, region, category. This enables predicate pushdown and makes queries even faster.

Mistake 3: Using Parquet for tiny files.

If you have thousands of tiny Parquet files, you're actually making things worse. The overhead of opening each file, reading metadata, etc. adds up. Parquet wants files that are tens to hundreds of MB. Consolidate your small files.

Mistake 4: Ignoring Arrow for cross-language work.

If you're building a system where Python talks to R talks to Java, and you're passing CSV strings between them, you're doing it wrong. Arrow can pass data with zero serialization cost. It's a game changer.

Mistake 5: Assuming all CSV is the same.

CSV has like fifty variations. Quotes or no quotes? Escape characters? Multi-line fields? Encoding? Delimiter? Every time you touch CSV, something breaks. It's the gift that keeps on giving.

The Future - Where Things Are Going

CSV isn't going anywhere. It's too entrenched. But for serious data work, it's becoming a legacy format—something you use to import/export, not something you use to actually work with data.

Parquet is the standard for analytical storage. It's mature, widely adopted, and not going away. If you're building a modern data stack, Parquet is in your future.

Arrow is the wildcard. It's gaining momentum incredibly fast. DuckDB, Polars, Ibis, newer versions of Pandas—they're all Arrow-native. Microsoft is integrating Arrow into Fabric. The major cloud providers are adding Arrow support. In five years, I think Arrow will be as common as Parquet, but for different use cases—Parquet for storage, Arrow for processing and interchange.

The holy grail? Tools that can read Parquet directly into Arrow memory with zero copying, then operate on that data with SIMD-optimized kernels. That's already happening. The line between storage format and memory format is blurring.

The Bottom Line

Stop using CSV for everything. I mean it. You're making your life harder than it needs to be.

Use Parquet for your data lakes, your analytical workloads, your "I need to keep this forever" datasets. Use Arrow for data that's actively being processed, moving between systems, or needs to be accessed at interactive speeds. Use CSV when you absolutely have to—when the person on the other end doesn't know what Parquet is and just wants to open the file in Excel.

None of this is hard. The tools are all free, open source, and well documented. You don't need permission. Just start saving your files differently. Your future self will thank you when queries take seconds instead of minutes and your cloud storage bill isn't ridiculous.

FAQs: 

1. Can I convert CSV to Parquet without losing data?

Yes, absolutely. Tools like Pandas, Spark, DuckDB, and even command-line utilities can do this. The data stays the same, it's just stored differently. The only caveat is that you need to define the schema correctly—Parquet cares about data types, so make sure dates are dates, numbers are numbers, etc.

2. Is Parquet always smaller than CSV?

Usually, yes, often dramatically smaller. But for very small files or data that's already highly compressed (like JPEGs inside a CSV—don't do that), Parquet might not help. For typical tabular data, expect 75-90% size reduction.

3. Can I open Parquet files in Excel?

Not directly, no. Excel doesn't natively read Parquet. You can use Power Query in newer versions of Excel to import Parquet, or use a third-party add-in. But honestly, if you're sending Parquet to someone who only uses Excel, you're probably using the wrong format.

4. What's the difference between Parquet and Arrow?

Parquet is for disk storage. It's optimized for space efficiency and long-term archival. Arrow is for memory and fast processing. It's optimized for speed and zero-copy sharing. They're complementary, not competitors. Many modern systems read Parquet from disk into Arrow in memory.

5. Is Arrow faster than Parquet for querying?

For data already in memory, yes, Arrow is much faster. But Parquet is on disk. You have to read it first. The question is whether the time to read Parquet + query is faster than reading CSV + query. For most analytical queries, Parquet wins because you read less data. Arrow wins when data is already loaded or when you're doing many operations on the same dataset.

6. What compression should I use with Parquet?

Snappy is the default and a good balance of speed and size. Zstandard (zstd) often gives better compression with similar speed. Gzip gives the smallest files but is slower to read/write. For most use cases, Snappy or zstd are fine. Don't overthink it.

7. Can I use Parquet for streaming data?

You can, but it's not ideal. Parquet files need to be completed before they're readable. You can't append to a Parquet file like you can with CSV. For streaming, look at Arrow Flight or formats designed for streaming like Apache Kafka's native format.

8. Does Power BI support Parquet?

Yes, Power BI can read Parquet files directly, especially through Power Query. With Microsoft Fabric, Parquet is actually the native storage format for OneLake. So yeah, big support there.

9. Is CSV completely useless now?

No, that's not fair. CSV is like a screwdriver—it's not fancy, it's not fast, but sometimes it's exactly what you need. For quick one-off exports, for sharing with non-technical users, for simple data interchange between disparate systems, CSV still has a place. Just don't build your entire infrastructure around it.

10. What's the learning curve for switching from CSV to Parquet/Arrow?

Honestly? Almost zero. In Python, it's df.to_parquet('file.parquet') instead of df.to_csv('file.csv'). In R, it's write_parquet(). In Spark, it's df.write.parquet(). The APIs are the same. You already know how to use them. You're just changing the file extension and getting superpowers. There's no excuse.


P.S. I wrote this because I got tired of watching people wait five minutes for CSVs to load when Parquet would take thirty seconds. Don't be that person. Your time is worth more than a file format. 

P.P.S. If someone asks you "should I use CSV or Parquet?" and they're working with more than 100MB of data, the answer is Parquet. It's not even a debate anymore. Spread the word.

Post a Comment

Previous Post Next Post

Contact Form