find missing dates with DuckDB

code
duckdb
Published

July 26, 2024

Background

Recently, the business feedback that there is data missing, and it is necessary to locate the missing dates in order to supplement historical data.

After research, it was decided to use the Gantt chart to display the daily execution of all tasks.

DuckDB

Considering that the task data already exists in Postgres, and visualization is needed, DuckDB has a Postgres plugin that can directly obtain data Therefore, choose to use DuckDB with Jupyter Notebook to quickly verify the logic.

Actual operation

The part of DuckDB operating Postgres can refer to PostgreSQL Extension

Omit the irrelevant code, the code logic is divided into several steps

1. Get data

Here use mock data

Code
import duckdb

conn = duckdb.connect()

mock_data_sql = """
SELECT * FROM (
    VALUES 
      (1, DATE '2024-07-01'),
      (1, DATE '2024-07-02'),
      (1, DATE '2024-07-03'),
      (1, DATE '2024-07-05'),
      (1, DATE '2024-07-06'),
      (2, DATE '2024-07-01'),
      (2, DATE '2024-07-02'),
      (2, DATE '2024-07-03'),
      (3, DATE '2024-07-01'),
      (3, DATE '2024-07-02'),
      (3, DATE '2024-07-03'),
      (3, DATE '2024-07-04'),
      (3, DATE '2024-07-05'),
      (3, DATE '2024-07-06'),
      (3, DATE '2024-07-07'),
      (3, DATE '2024-07-08'),
      (3, DATE '2024-07-09'),
      (4, DATE '2024-07-05'),
      (4, DATE '2024-07-06'),
      (4, DATE '2024-07-07'),
      (4, DATE '2024-07-08'),
      (4, DATE '2024-07-09'),
  ) AS t(source_id, end_date)
"""
df = conn.execute(mock_data_sql).df()
df
source_id end_date
0 1 2024-07-01
1 1 2024-07-02
2 1 2024-07-03
3 1 2024-07-05
4 1 2024-07-06
5 2 2024-07-01
6 2 2024-07-02
7 2 2024-07-03
8 3 2024-07-01
9 3 2024-07-02
10 3 2024-07-03
11 3 2024-07-04
12 3 2024-07-05
13 3 2024-07-06
14 3 2024-07-07
15 3 2024-07-08
16 3 2024-07-09
17 4 2024-07-05
18 4 2024-07-06
19 4 2024-07-07
20 4 2024-07-08
21 4 2024-07-09

2. Group by date

The main difficulty in finding missing dates is to group by time continuity, continuous time is placed in the same group, so if a source_id has multiple time periods, it means that there is a time gap.

Use window functions to process the time of the current row, and get the grouping time of the current row based on the time difference. The code is as follows:

group_date_sql = """
 SELECT 
    source_id,
    end_date,
    end_date - INTERVAL (ROW_NUMBER() OVER (PARTITION BY source_id ORDER BY end_date) - 1) DAY AS group_date
  FROM df
  order by source_id, end_date
"""
grouped_date_df = conn.execute(group_date_sql).df()
grouped_date_df
source_id end_date group_date
0 1 2024-07-01 2024-07-01
1 1 2024-07-02 2024-07-01
2 1 2024-07-03 2024-07-01
3 1 2024-07-05 2024-07-02
4 1 2024-07-06 2024-07-02
5 2 2024-07-01 2024-07-01
6 2 2024-07-02 2024-07-01
7 2 2024-07-03 2024-07-01
8 3 2024-07-01 2024-07-01
9 3 2024-07-02 2024-07-01
10 3 2024-07-03 2024-07-01
11 3 2024-07-04 2024-07-01
12 3 2024-07-05 2024-07-01
13 3 2024-07-06 2024-07-01
14 3 2024-07-07 2024-07-01
15 3 2024-07-08 2024-07-01
16 3 2024-07-09 2024-07-01
17 4 2024-07-05 2024-07-05
18 4 2024-07-06 2024-07-05
19 4 2024-07-07 2024-07-05
20 4 2024-07-08 2024-07-05
21 4 2024-07-09 2024-07-05

3. Group by source_id and group date

group_sql = """
SELECT 
    source_id,
    MIN(end_date) AS start_date,
    MAX(end_date) AS end_date,
  FROM grouped_date_df
  GROUP BY source_id, group_date
  ORDER BY source_id, group_date
"""
grouped_df = conn.execute(group_sql).df()
grouped_df
source_id start_date end_date
0 1 2024-07-01 2024-07-03
1 1 2024-07-05 2024-07-06
2 2 2024-07-01 2024-07-03
3 3 2024-07-01 2024-07-09
4 4 2024-07-05 2024-07-09

4. Visualization

Code
import plotly.express as px
fig = px.timeline(grouped_df, x_start='start_date', x_end='end_date', y='source_id')
fig.update_yaxes(autorange="reversed")
fig.show()

5. Full code

sql = """
with raw_data as (
  SELECT * FROM (
    VALUES 
      (1, DATE '2024-07-01'),
      (1, DATE '2024-07-02'),
      (1, DATE '2024-07-03'),
      (1, DATE '2024-07-05'),
      (1, DATE '2024-07-06'),
      (2, DATE '2024-07-01'),
      (2, DATE '2024-07-02'),
      (2, DATE '2024-07-03'),
      (3, DATE '2024-07-01'),
      (3, DATE '2024-07-02'),
      (3, DATE '2024-07-03'),
      (3, DATE '2024-07-04'),
      (3, DATE '2024-07-05'),
      (3, DATE '2024-07-06'),
      (3, DATE '2024-07-07'),
      (3, DATE '2024-07-08'),
      (3, DATE '2024-07-09'),
      (4, DATE '2024-07-05'),
      (4, DATE '2024-07-06'),
      (4, DATE '2024-07-07'),
      (4, DATE '2024-07-08'),
      (4, DATE '2024-07-09'),
  ) AS t(source_id, end_date)
), group_date as (
  SELECT 
    source_id,
    end_date,
    end_date - INTERVAL (ROW_NUMBER() OVER (PARTITION BY source_id ORDER BY end_date) - 1) DAY AS group_date
  FROM raw_data
  order by source_id, end_date
), final as (
  SELECT 
    source_id,
    MIN(end_date) AS start_date,
    MAX(end_date) AS end_date,
  FROM grouped_date_df
  GROUP BY source_id, group_date
  ORDER BY source_id, group_date
)
from final
"""
date_df = conn.execute(sql).df()
gap_fig = px.timeline(date_df, x_start='start_date', x_end='end_date', y='source_id')
gap_fig.update_yaxes(autorange="reversed")
gap_fig.show()

Summary

This article mainly introduces how to use DuckDB to find missing dates. Through window functions and grouping functions, missing dates can be quickly located.

PS: Recently, the frequency of using DuckDB in work is getting higher and higher. The reason is that DuckDB is lightweight, fast, and the efficiency is surprisingly high when combined with Jupyter Notebook.