Categories
Machine Learning & AI

Loading, Shuffling, and Splitting Datasets in Rust

In the previous article, I created a simple linear regression model, but I skipped the dataset splitting part. One of the readers pointed out that it’s crucial, and I shouldn’t have skipped it, even for that example.

So now, I’m preparing this support article to address that request. Later, I’ll connect this article to the previous one to cover all the crucial steps of machine learning modeling in Rust.

In this article, we will focus on three main steps: loading a dataset (again, I’ll create one myself, as I’m not using a real-life dataset here either), shuffling, and splitting the data into training and testing datasets.

Setting Up The Project

$ cargo new split
$ cd split
$ nvim Cargo.toml

1. [dependencies]
2. serde = { version = "1.0", features = ["derive"] }
3. serde_derive = "1.0"
4. rand = "0.8"
5. csv = "1.1"

$ mkdir data
$ mv Downloads/data.csv /data/

Create a new project, then edit your Cargo.toml file to add the necessary dependencies. Next, create a data folder and move your data.csv into that folder. And there you go, your project structure is ready.

Define the Record Struct

We need a struct to represent the rows in our dataset. Each field in the struct will match a column in the CSV file.

For example, let’s say we’re tracking workers’ hours, so our struct might look like this:

use serde::Deserialize;

#[derive(Debug, Deserialize, Clone)]
struct Record {
    worker_name: String,
    how_many_hours_worked: f64,
}

If you noticed, I used some macros like Debug, Deserialize, and Clone. These are just for making it easier to work with this struct, allowing us to debug, deserialize, and clone records when needed.

I’ve created a sample dataset with random values, which you can check below:

worker_name,how_many_hours_worked
Alice,40.5
Bob,35.0
Charlie,42.0
Diana,38.5
Eve,45.0
Frank,37.5
Grace,41.0
Hannah,34.0
Ian,39.0
Jack,43.5

Load the CSV Data

Now, let’s write a function to load the CSV file. This will give us a vector of Record structs, which we can work with.

use std::error::Error;
use csv::ReaderBuilder;
use serde::de::DeserializeOwned;

fn load_csv<T: DeserializeOwned>(file_path: &str) -> Result<Vec<T>, Box<dyn Error>> {
    let mut rdr = ReaderBuilder::new().has_headers(true).from_path(file_path)?;
    let mut records = Vec::new();
    
    for result in rdr.deserialize() {
        let record: T = result?;
        records.push(record);
    }
    
    println!("Loaded {} records.", records.len());  
    Ok(records)
}

This function reads the CSV file and returns the data as a vector of Record structs. The rdr.deserialize() call takes care of converting each CSV row into a Record.

Shuffle and Split the Dataset

Next, let’s create a function that shuffles the data and splits it into training and testing sets. We’ll use a 70-30 split, with 70% of the data going to training and 30% to testing.

But stop! Why are we doing this? If you checked our overfitting article, I mentioned that sometimes models can memorize patterns and use them to predict known data with higher accuracy. This makes it important to use unknown data to test whether your model is actually learning or just memorizing.

And for that, you don’t need another survey to collect data. You can simply use your existing dataset, split and shuffle it, and then use it for both training and testing.

use rand::seq::SliceRandom;

fn split_dataset(data: Vec<Record>, ratio: f64) -> (Vec<Record>, Vec<Record>) {
    let mut rng = rand::thread_rng();
    let len = data.len();
    let split_index = (len as f64 * ratio) as usize;
    
    let mut data_clone = data.clone();
    data_clone.shuffle(&mut rng);
    
    let (train_slice, test_slice) = data_clone.split_at(split_index);
    
    (train_slice.to_vec(), test_slice.to_vec())  
}

So, we shuffle the data with rand::seq::SliceRandom::shuffle. That’s the part where we randomize the order of the dataset, making sure the model doesn’t get used to any specific order of the data.

Once the data is shuffled, we split it according to the ratio we’ve set. For example, we use 70% for training and 30% for testing. It’s not rocket science, just basic splitting to make sure the model gets to learn from some data and test its performance on unseen data.

Now, after splitting, we end up with slices: one for training and one for testing. But Rust likes its Vec structures, so we convert those slices (train_slice and test_slice) into Vec<Record> and return them. That’s it!

Main Function

Now, let’s gather all the functions in main and test them.

fn main() -> Result<(), Box<dyn Error>> {
    let data: Vec<Record> = load_csv("data/data.csv")?;
    let (train_set, test_set) = split_dataset(data, 0.7);
    
    println!("Training set size: {}", train_set.len());
    println!("Testing set size: {}", test_set.len());
    
    Ok(())
}
Loaded 10 records.
Training set size: 7
Testing set size: 3

Easy peasy! Loading, shuffling, and splitting are done. Now you can use your single dataset for both training and testing.

Conclusion

You’ve now got a basic setup for loading a CSV, shuffling the data, and splitting it into training and testing sets using Rust.

It’s not as quick as Python’s pandas, but it’s definitely doable, and you get the performance and control that Rust brings to the table.

Give Rust a chance! This is my second time writing an article about Rust, and I’m still having way more fun than with Python. It actually gives you control over everything.

Picture perfect criminal.

Leave a Reply

Your email address will not be published. Required fields are marked *