Categories
Data Engineering

Polars: A High-Performance DataFrame Library for Rust

I know I’m late, but let me talk about Polars! I want to discuss Polars in Rust because it’s really important, it greatly enhances Rust’s capabilities for data science, which is our main focus here.

This blog is still pretty new, so I’m writing this a bit late. Nothing else stopped me!

Polars is a blazing-fast DataFrame library built in Rust, designed for efficient processing of large datasets. You can think of it as pandas on steroids, much faster and more memory-efficient.

What Sets Polars Apart?

If you’re already used to Pandas in Python, you might naturally wonder, “Why should I switch?” Well, today it’s not just for fun, there are many good reasons to try Polars. You don’t have to switch completely, but it’s definitely worth exploring!

Polars stands out from other DataFrame libraries mainly because Rust powers it, a language known for amazing speed and strong memory safety. This foundation lets Polars handle large datasets much faster and more efficiently than many traditional tools.

One of its coolest features is lazy evaluation, which means Polars builds an optimized query plan before running any operations. This cuts down on unnecessary work and speeds up complex data workflows.

On top of that, Polars automatically uses all the cores of your CPU by running tasks in parallel, no extra setup required. It also leverages the Apache Arrow columnar memory format, improving how your computer accesses and caches data, which matters a lot for high-performance processing.

Finally, Polars works smoothly with common data formats like CSV, Parquet, and JSON, so you can easily plug it into your existing data pipelines. All these features make Polars a uniquely fast, flexible, and reliable choice for data science with Rust.

Getting Started with Polars

For this article, I’ve included Rust examples since Polars works great here, but it’s also available for Python, Node.js, R, and SQL if you want to explore later.

[dependencies]
polars = { version = "0.38", features = ["lazy", "csv"] }
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let df = CsvReader::from_path("data.csv")?
        .infer_schema(None)
        .has_header(true)
        .finish()?;

    let filtered = df.lazy()
        .filter(col("age").gt(lit(30)))
        .collect()?;

    println!("{:?}", filtered);

    Ok(())
}

Now we’re working with Polars, and here’s a simple example. Most of the time, everything begins with a CSV file, so in this case, we also loaded a CSV and apply a basic filter.

I’ll briefly explain the methods we used so you won’t have to guess why they’re there.

.infer_schema(None) lets CsvReader guess the column types automatically. .has_header(true) tells it the first row has column names. And .finish()? actually reads the file and gives us the DataFrame. Simple as that!

Later, we switch to the lazy API with df.lazy(), which lets Polars optimize things before running them. Then we use .filter(col("age").gt(lit(30))) to keep only the rows where the age is greater than 30. Finally, .collect()? runs the whole thing and gives us the filtered DataFrame.

Actually, you don’t even need the filter, you can simply skip it and use .collect() to get your optimized DataFrame directly.

┌───────┬─────┐
nameage
------
stri64
╞═══════╪═════╡
Alice42
Jack35
Mark34
Kira39
└───────┴─────┘

And you can expect output like this.

Group by and Aggregate

We can play around more, and that’s exactly why we’re here! So, let’s dive in, this time, we’ll load the same dataset, group the data by age, and calculate the average of their scores. (The datasets are made up, so feel free to create your own!)

use polars::lazy::prelude::*;
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let lf = LazyCsvReader::new("data.csv").has_header(true).finish()?;

    let result = lf
        .group_by([col("age")])
        .agg([col("score").mean().alias("average_score")])
        .collect()?;

    println!("{:?}", result);

    Ok(())
}
┌─────┬───────────────┐
ageaverage_score
------
i64f64
╞═════╪═══════════════╡
2889.0
3480.0
└─────┴───────────────┘

It’s really easy to work with Polars, and on top of that, you get so many advantages, like easy wasn’t already enough!

Now let’s talk about .group_by and .agg, two key methods that do the real magic here. With .group_by, we tell Polars how to split the data, like grouping everything by age.

Then .agg jumps in to say what we want to calculate for each group, like taking the average score.

Sorting and Selecting Columns

Let’s continue with other common operations: sorting and selecting columns. Actually, it’s mostly the same, just a few small changes.

use polars::lazy::prelude::*;
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let lf = LazyCsvReader::new("data.csv").has_header(true).finish()?;

    let result = lf
        .select([col("name"), col("score")])
        .sort("score", Default::default())
        .collect()?;

    println!("{:?}", result);

    Ok(())
}
┌─────────┬───────┐
namescore
------
stri64
╞═════════╪═══════╡
Charlie75
Alice85
Dave88
Bob90
└─────────┴───────┘

By default, sorting is descending, so here we just select the name and score columns and sort by the score column.

Renaming and Dropping Columns
use polars::lazy::prelude::*;
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let lf = LazyCsvReader::new("data.csv").has_header(true).finish()?;

    let result = lf
        .select([col("name"), col("score").alias("final_score"), col("age")])
        .drop(["age"])
        .collect()?;

    println!("{:?}", result);

    Ok(())
}
┌─────────┬─────────────┐
namefinal_score
------
stri64
╞═════════╪═════════════╡
Alice85
Bob90
Charlie75
Dave88
└─────────┴─────────────┘

Here, we pick the columns we want with .select(), renaming "score" to "final_score" on the fly using .alias(). Then, we clean things up by dropping the "age" column with .drop().

It’s like saying: “Keep what I need, rename what’s clearer, and toss out the rest.” Simple, smooth, and exactly what you want when handling data.

Ah, and in addition, if you want to drop a row, you need to use filtering.

use polars::lazy::prelude::*;
use polars::prelude::*;

fn main() -> PolarsResult<()> {
    let lf = LazyCsvReader::new("data.csv").has_header(true).finish()?;

    // Drop rows where name == "Alice"
    let result = lf.filter(col("name").neq(lit("Alice"))).collect()?;

    println!("{:?}", result);

    Ok(())
}

Conclusion

This article will be also updated over time if I get any feedback, but I think it’s solid start for people who wanna look up Polars in Rust.

Essantial topics we cover today: dropping , selecting , renaming, filtering, math operations, sorting so you are ready to go for your next dataset process in Rust polars.

Give it a shot because I think this create really has huge potantial for data science and I believe it’s really a good news for both Rust and Python programmers.

Pull up, and I rip it up like ballet.

Leave a Reply

Your email address will not be published. Required fields are marked *