Building Your First ETL Pipeline in Rust

Okay, we’re on a streak with Rust articles. This is my third Rust article, and now I’ll be giving a practical guide to complement my previous theoretical ETL article.

I assume you already know what an ETL pipeline is, or at least have read my previous article on the topic, so I won’t go into it in detail. I just want to briefly explain why Rust is actually a great choice for ETL pipelines.

Okay, I have to admit that I’ve developed a good relationship with Rust lately, which is the first reason I’m using it instead of Python these days. Beyond that, Rust is a solid choice over Python for data engineering.

It allows you to build safe and fast applications by default, and performance is crucial for ETL pipelines, especially when you’re working with millions of rows for extraction and transformation.

That’s why I prepared this article for people who also want to take advantage of Rust for ETL pipelines.

Setup

$ cargo new rust-etl
$ cd rust-etl

$ cargo new rust-etl
$ cd rust-etl

[dependencies]
csv = "1.3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

[dependencies]
csv = "1.3"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

Be sure to add the dependencies to your Cargo.toml file before you start.

Sample Dataset

name,age,email
Yuki,21,Yuki@example.com
April,13,April@example.com
Alex,34,Alex@example.com

name,age,email
Yuki,21,Yuki@example.com
April,13,April@example.com
Alex,34,Alex@example.com

We will build this pipeline to extract data from a CSV file. Later, we will remove users under 16 and save the eligible users for the campaign into a JSON file.

[E]xtracting

In Part [E] we will extract data from CSV file, CSV file should be located in your data folder.

use std::error::Error;
use std::fs::File;
use serde::{Deserialize, Serialize};
use csv::ReaderBuilder;

use std::error::Error;
use std::fs::File;
use serde::{Deserialize, Serialize};
use csv::ReaderBuilder;

Import the necessary libraries; they are required for error handling, opening files, deserializing CSV records for easier handling, and working with the csv crate.

#[derive(Debug, Deserialize, Serialize)]
struct User {
    name: String,
    age: u8,
    email: String,
}


#[derive(Debug, Deserialize, Serialize)]
struct User {
    name: String,
    age: u8,
    email: String,
}

We defined a structure called User, which will represent each row in the CSV file. Each row will be mapped to a User structure.

fn extract_data() -> Result<Vec<User>, Box<dyn Error>> {
    let file = File::open("../data/users.csv")?;
    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .delimiter(b',')
        .from_reader(file);

    let mut users: Vec<User> = Vec::new();

    for result in rdr.deserialize() {
        let user: User = result?;
        users.push(user);
    }

    Ok(users)
}

fn extract_data() -> Result<Vec<User>, Box<dyn Error>> {
    let file = File::open("../data/users.csv")?;
    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .delimiter(b',')
        .from_reader(file);

    let mut users: Vec<User> = Vec::new();

    for result in rdr.deserialize() {
        let user: User = result?;
        users.push(user);
    }

    Ok(users)
}

We kick things off with the extract_data function, which is responsible for reading the CSV file and turning it into a list of User structures.

This process is pretty simple. First, we open the CSV file. Then, we create a reader with the has_headers method set to true, so the first row is treated as headers instead of data. We also specify the delimiter to define how the data is separated in the CSV file, and, of course, we provide the file path.

After that, we create a vector to hold the User objects, and we push each row into the vector so we can work with it later. Now, it’s time for the transformation step.

[T]ransforming

fn transform_data(users: Vec<User>) -> Vec<User> {
    users.into_iter().filter(|user| user.age >= 18).collect()
}

fn transform_data(users: Vec<User>) -> Vec<User> {
    users.into_iter().filter(|user| user.age >= 18).collect()
}

The transformation step is quite simple here because we don’t have much to do. We just separate the underage users from the adults for campaign analysis.

[L]oad

fn load_data(adults: Vec<User>) -> Result<(), Box<dyn Error>> {
    let out_file = File::create("../data/adults.json")?;
    serde_json::to_writer_pretty(out_file, &adults)?;
    Ok(())
}

fn load_data(adults: Vec<User>) -> Result<(), Box<dyn Error>> {
    let out_file = File::create("../data/adults.json")?;
    serde_json::to_writer_pretty(out_file, &adults)?;
    Ok(())
}

Thanks to serde_json, we can easily write our content (vector) into a JSON file in a pretty format.

The program will create the file if it doesn’t already exist, so there’s no need to create it manually.

Wrapping It Up

fn run_etl() -> Result<(), Box<dyn Error>> {
    let users = extract_data()?;

    let adults = transform_data(users);

    load_data(adults)?;

    println!("ETL complete!");
    Ok(())
}

fn run_etl() -> Result<(), Box<dyn Error>> {
    let users = extract_data()?;

    let adults = transform_data(users);

    load_data(adults)?;

    println!("ETL complete!");
    Ok(())
}

Now we’ve created another function to combine all the E, T, and L functions together. We’re ready to create the main function and test it out.

Testing

fn main() {
    if let Err(err) = run_etl() {
        eprintln!("ETL failed: {}", err);
    }
}

// ETL complete!

fn main() {
    if let Err(err) = run_etl() {
        eprintln!("ETL failed: {}", err);
    }
}

// ETL complete!

You should see “ETL complete” if your files are located correctly. Otherwise, if you encounter an error, you’ll see “ETL failed: reason” in this format.

Now that we have a JSON file, let’s check it out.

[
  {
    "name": "Yuki",
    "age": 21,
    "email": "Yuki@example.com"
  },
  {
    "name": "Alex",
    "age": 34,
    "email": "Alex@example.com"
  }
]

[
  {
    "name": "Yuki",
    "age": 21,
    "email": "Yuki@example.com"
  },
  {
    "name": "Alex",
    "age": 34,
    "email": "Alex@example.com"
  }
]

It looks pretty well written in JSON format without any issues. April was removed due to her age, and everything looks perfectly fine and works as expected.

Conclusion

Working with Rust is really easy once you get into it, just like any other language. However, I personally find Rust easier to handle than C. Rust also offers some advantages over C: its performance is very close to C’s, but it also provides a safer language by default.

So, for your data pipelines, Rust can be a solid choice if you need blazingly fast pipelines.

If you’re already developing an application in Rust (like a backend) and want to apply machine learning or data processing, it’s definitely worth using Rust over Python.

If you like having full control over your code, with the ability to dive into details and customize everything, Rust is probably a good choice.

Just burn your fire, do it now!

Setup

Sample Dataset

[E]xtracting

[T]ransforming

[L]oad

Wrapping It Up

Testing

Conclusion

Related Posts

Leave a Reply Cancel reply