Categories
Data analysis & Visualization

Keep It Simple: Exploring a CSV Entirely with Bash

I was scrolling through my inbox the other day when a random email caught my eye. Some small online store I vaguely remember talking to months ago had attached their full sales CSV 50k rows, no warning, no context. Just “here’s the data, thought you might find it interesting.” (This is all made up, obviously. Nobody actually sends raw data like that out of the blue. It’s just a setup for the story.)

Normally my reflex would be to open Python, spin up a quick script, load pandas, start slicing. But I glanced at the clock, saw I had about twenty minutes free, and decided to test something. The file is plain text. My terminal is already open. Why not see how much I can learn using only Bash and classic Unix tools?

File name: sales.csv. I begin where everyone begins.

  head -n 10 sales.csv

Columns show up clear: order_id,date,customer_id,product,category,units,price_eur,status. Dates in YYYY-MM-DD format, commas separating everything, header on the first row. No messy quotes or escaped characters. Perfect.

First, how much data am I even dealing with?

wc -l sales.csv

50,012 lines total. That means 50,011 actual orders plus the header. Solid dataset for a small shop.

Curiosity kicks in: how many unique products are in here?

I extract column 4 (product), skip the header, keep only unique values, and count them.

cut -d',' -f4 sales.csv | tail -n +2 | sort -u | wc -l

183 different products. More variety than I expected.

Category Breakdown

Next, let’s see which categories move the most volume.

awk -F',' 'NR>1 {units[$5] += $6} END {for (c in units) print units[c], c}' sales.csv | sort -nr | head -10

This sums column 6 (units) for each unique value in column 5 (category), then sorts descending and shows the top 10. Electronics dominates, followed by clothing and home goods. Typical pattern.

Revenue is the obvious big question.

awk -F',' 'NR>1 && $8 ~ /completed/ {total += $6 * $7} END {print "Completed revenue:", total "€"}' sales.csv

I filter for completed orders only (column 8) and multiply units by price. Comes out around 850k euros. Healthy number.

Average order value follows naturally.

awk -F',' 'NR>1 && $8 ~ /completed/ {value += $6*$7; count++} END {print "Avg order:", value/count "€"}' sales.csv

Roughly 72 euros per completed order. Fits the product mix.

Monthly Trends

Now I want monthly trends.

awk -F',' 'NR>1 && $8 ~ /completed/ {split($2, d, "-"); month=d[1]"-"d[2]; rev[month] += $6*$7} END {for (m in rev) print m, rev[m]}' sales.csv | sort

This splits the date column, groups by year-month, sums revenue for completed orders, and sorts chronologically. Classic December holiday spike, January dip, then steady growth through the year.

Cancellation rate feels worth checking.

awk -F',' 'NR>1 {status[$8]++} END {print "Canceled:", status["canceled"]/(NR-1) * 100 "%"}' sales.csv

Around 8%. Reasonable.

Quick Summary Block

At this point I have enough for a clean overview, so I build a quick summary block.

echo "=== Quick Sales Overview ==="
echo "Total orders: $(tail -n +2 sales.csv | wc -l)"
echo "Unique products: $(cut -d',' -f4 sales.csv | tail -n +2 | sort -u | wc -l)"
echo "Top category by units: $(awk -F',' 'NR>1 {units[$5]+=$6} END {for (c in units) if (max<units[c]) {max=units[c]; best=c}} END {print best " (" max ")"}' sales.csv)"
echo "Total completed revenue: $(awk -F',' 'NR>1 && $8 ~ /completed/ {sum += $6*$7} END {printf "%.0f€\n", sum}' sales.csv)"
echo "Cancellation rate: $(awk -F',' 'NR>1 {if ($8 ~ /canceled/) canc++} END {printf "%.1f%%\n", canc/(NR-1)*100}' sales.csv)"

Run it once, copy the output. Instant readable report.

For fun, I check order size distribution.

cut -d',' -f6 sales.csv | tail -n +2 | awk '{if ($1==1) bin="1 unit"; else if ($1<=5) bin="2-5"; else if ($1<=10) bin="6-10"; else bin="10+"; count[bin]++} END {for (b in count) print b ": " count[b]}'

Mostly small orders, with a small tail of bulk purchases.

And then, before I even realized it, twenty minutes had passed. Yet there I was with all the key numbers, trends, risks, and strengths clearly laid out.

I could have fired up Python and gotten the same answers, maybe with prettier plots. But for pure exploration on a single clean CSV under a few hundred MB, Bash delivered faster than any notebook ever could.

Sometimes keeping it simple really is the smartest move.

Leave a Reply

Your email address will not be published. Required fields are marked *