Categories
Data Engineering

Writing SQL For Data Engineering

SQL is still the most commonly used query language. Many people use it for analysis tasks, such as searching for a specific user in a database, exporting rows to Excel, and grouping categories, among others.

However, in data engineering, SQL is not just a tool for data analysis; it plays a key role in building automated data pipelines and developing the underlying infrastructure.

If you’re working with a machine learning model, the data that feeds into it comes thanks to the data engineer who built the data pipeline. Of course, there are many ways to develop data pipelines, and SQL isn’t the only option. However, in this article, we’ll focus on SQL because it’s a widely adopted approach in the industry.

Query-Once vs Run Daily

There are two ways to use SQL. The most common approach is to write a query, get the results you need, and then leave.

However, if you want to develop solutions for data engineering, your SQL queries should run daily without any issues.

Your SQL code shouldn’t create duplicates, it should be readable (for easy maintenance), and it should be simple to test and resilient (to ensure it runs smoothly on a daily or hourly basis without issues).

Using CTEs for Structure

For one-time queries, it’s absolutely fine to combine everything into a single line. After the operation, we’ll get rid of it anyway.

But when working with SQL queries that need to run regularly, using CTEs to break the query into steps is helpful for maintaining the code and improving readability.

WITH MonthlySales AS (
    SELECT category_id, SUM(sale_amount) AS total_sales
    FROM sales
    WHERE sale_date >= DATEADD(MONTH, -1, GETDATE())
    GROUP BY category_id
)
SELECT 
    c.category_id,
    m.total_sales,
    p.category_name
FROM MonthlySales m
JOIN categories c ON m.category_id = c.category_id
ORDER BY m.total_sales DESC;

This SQL query calculates the total sales for each product category in the last month. It first creates a temporary result (CTE) called MonthlySales that sums the sales amount for each category within the past month.

Then, it joins this data with the categories table to get the category names. Finally, the results are ordered by total sales, from highest to lowest, to show which categories performed best.

This approach makes the query easier to read and maintain by breaking it into clear steps with a CTE.

It also boosts performance and lets you easily adjust for different time frames or categories. Plus, you can reuse the CTE, which keeps things neat and avoids repetition.

Window Functions

Window functions allow you to perform calculations across sets of rows that are related to the current row, without collapsing the dataset.

For data engineers, they’re essential tools for tasks like ranking, deduplication, change tracking, and time-based comparisons.

Functions like ROW_NUMBER, RANK, and DENSE_RANK are commonly used to sort and rank rows within partitions, great for picking top records or removing duplicates. LAG and LEAD are your go-to tools for comparing values between rows, like checking how a value has changed since the previous event.

WITH session_analysis AS (
  SELECT
    user_id,
    session_id,
    session_start,
    session_duration,

    -- Assign row numbers by session recency
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY session_start DESC
    ) AS row_num,

    -- Rank sessions by duration
    RANK() OVER (
      PARTITION BY user_id
      ORDER BY session_duration DESC
    ) AS duration_rank,

    -- Dense rank to avoid gaps in ranking
    DENSE_RANK() OVER (
      PARTITION BY user_id
      ORDER BY session_duration DESC
    ) AS duration_dense_rank,

    -- Compare duration to previous session
    LAG(session_duration) OVER (
      PARTITION BY user_id
      ORDER BY session_start
    ) AS previous_duration,

    -- Compare duration to next session
    LEAD(session_duration) OVER (
      PARTITION BY user_id
      ORDER BY session_start
    ) AS next_duration

  FROM user_sessions
)
SELECT *
FROM session_analysis
WHERE row_num = 1; 

This query looks at user sessions and helps you see important details about each user’s activity.

It picks out the most recent session for every user, shows how each session ranks based on how long it lasted, and lets you compare the current session’s length to the previous and next ones.

In the end, it only shows the latest session per user, so you get a clear picture of their most recent behavior along with helpful info about their other sessions.

Window functions are essential for data engineers because they let you analyze and compare rows within groups without losing detail.

They simplify complex queries, making your data pipelines more efficient and easier to maintain.

Building Efficient and Scalable Data Pipelines

When your tables get too large to process all at once, it’s important to build your pipelines to handle new data incrementally.

Instead of reprocessing everything every time, design your logic to automatically detect and process only new or changed rows, making your workflows faster, more efficient, and easier to maintain as data keeps growing.

SELECT *
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE)
  AND order_date <= CURRENT_DATE;

This query selects only the orders from the beginning of the current month up to today. By focusing on recent data, you avoid reprocessing old records and keep your data pipeline efficient as your data grows.

Modular and Maintainable SQL

Tools like dbt have totally changed the game when it comes to writing and managing SQL.

With dbt, you get to version-control your SQL models in Git, put them through automated tests with assertions, document everything without lifting a finger, and deploy your transformations just like software code.

Conclusion

SQL remains a foundational skill for data engineers. Beyond simple analysis, it’s the backbone of automated, reliable data pipelines.

By writing clean, modular queries using CTEs and window functions, and by processing data incrementally, you can build scalable pipelines that support advanced analytics and machine learning.

Mastering these SQL best practices will make your data workflows more efficient, maintainable, and future-proof.

Leave a Reply

Your email address will not be published. Required fields are marked *