最近一个读者问到了这样的一个问题:
好久没写R代码了,这里写一写吧。
library(tidyverse)
sample_data = tibble('mytime'=c(
lubridate::as_datetime("2022-01-01 12:30:00"),
lubridate::as_datetime("2022-01-01 13:30:00"),
lubridate::as_datetime("2022-01-01 14:00:00"),
lubridate::as_datetime("2022-01-01 15:30:00")
),
'value' = c(1,4,5,8),
'value2' = c(3,4,5,6),
'value3' = c(20, 20, 30, 40))
sample_data
# # A tibble: 4 × 4
# mytime value value2 value3
#
# 1 2022-01-01 12:30:00 1 3 20
# 2 2022-01-01 13:30:00 4 4 20
# 3 2022-01-01 14:00:00 5 5 30
# 4 2022-01-01 15:30:00 8 6 40
可以看出来上面的数据中:
2022-01-01 12:30
到2022-01-01 15:30
的。13:00
、14:30
、15:00
时刻的数据。思路很简单:
create_time_seq <- function(start_date, end_date, by){
n = lubridate::interval(start = start_date, end = end_date) / by
res <- start_date + by * c(0:n)
return(res)
}
time_seq <- create_time_seq(
start_date = sample_data %>% pull(mytime) %>% min(),
end_date = sample_data %>% pull(mytime) %>% max(),
by = lubridate::minutes(30)
)
time_seq
# [1] "2022-01-01 12:30:00 UTC" "2022-01-01 13:00:00 UTC" "2022-01-01 13:30:00 UTC"
# [4] "2022-01-01 14:00:00 UTC" "2022-01-01 14:30:00 UTC" "2022-01-01 15:00:00 UTC"
# [7] "2022-01-01 15:30:00 UTC"
这里的难点主要是在create_time_seq
函数里面,这个函数的主要功能有:
开始时间
到结束时间
,按照步长为by
的时间节点的个数。最后time_seq
就是按照样本数据的时间范围,和要求的步长,生成了一串符合要求的时间序列列表。
tibble(
'true_time' = time_seq
) %>% left_join(
y=sample_data,
by=c('true_time' = 'mytime')
)
# # A tibble: 7 × 4
# true_time value value2 value3
#
# 1 2022-01-01 12:30:00 1 3 20
# 2 2022-01-01 13:00:00 NA NA NA
# 3 2022-01-01 13:30:00 4 4 20
# 4 2022-01-01 14:00:00 5 5 30
# 5 2022-01-01 14:30:00 NA NA NA
# 6 2022-01-01 15:00:00 NA NA NA
# 7 2022-01-01 15:30:00 8 6 40
把列表放到数据框里面,然后巧妙的使用left_join函数,对两个表进行关联,从而对缺失的节点直接进行填充。
lubridate
包来处理。我就知道你不信,我给你写过代码就知道了
# part1
from datetime import datetime,timedelta
import pandas as pd
# part 2
sample_date = pd.DataFrame({'mytime':[datetime.strptime(i,'%Y-%m-%d %X') for i in ['2022-01-01 12:30:00', '2022-01-01 13:30:00', '2022-01-01 14:00:00', '2022-01-01 15:30:00']],
'value1':[1,4,5,8], 'value2':[3,4,5,6], 'value3':[20, 20, 30, 40]})
sample_date
# mytime value1 value2 value3
# 0 2022-01-01 12:30:00 1 3 20
# 1 2022-01-01 13:30:00 4 4 20
# 2 2022-01-01 14:00:00 5 5 30
# 3 2022-01-01 15:30:00 8 6 40
# part 3
def create_time_seq(start_date, end_date, by):
n = (end_date - start_date ) / by
res = [start_date + by * i for i in range(int(n+1))]
return res
time_seq = create_time_seq(start_date=sample_date['mytime'].min(),
end_date=sample_date['mytime'].max(),
by=timedelta(minutes=30))
time_seq
# [Timestamp('2022-01-01 12:30:00'),
# Timestamp('2022-01-01 13:00:00'),
# Timestamp('2022-01-01 13:30:00'),
# Timestamp('2022-01-01 14:00:00'),
# Timestamp('2022-01-01 14:30:00'),
# Timestamp('2022-01-01 15:00:00'),
# Timestamp('2022-01-01 15:30:00')]
# part 4
pd.DataFrame({'true_time':time_seq}).pipe(
lambda x: x.merge(
right=sample_date,
how='left',
left_on=['true_time'],
right_on=['mytime']
)
).pipe(
lambda x: x.drop(columns=['mytime'])
)
# true_time value1 value2 value3
# 0 2022-01-01 12:30:00 1.0 3.0 20.0
# 1 2022-01-01 13:00:00 NaN NaN NaN
# 2 2022-01-01 13:30:00 4.0 4.0 20.0
# 3 2022-01-01 14:00:00 5.0 5.0 30.0
# 4 2022-01-01 14:30:00 NaN NaN NaN
# 5 2022-01-01 15:00:00 NaN NaN NaN
# 6 2022-01-01 15:30:00 8.0 6.0 40.0
list