pandas
可以说是 python
中数据处理的中流砥柱,不会点 pandas
,你都不敢说自己了解 python
。pandas
是数据处理神器,时间数据处理自然也是不在话下,今天咱们就来聊一聊 pandas
处理时间数据的应用。
我们可以从两个维度来描述时间,一种是时间点或者说时间时刻,一种是时间长度。而时间长度又包括时间差和时间段。
时间点数据处理
时间点就是指某一时间,比如说当前时间,当前时间戳,今天。时间点相关的问题场景经常是:今天日期是什么?现在的时间是多少?今天是周几?今天的本年第几天?等
pd.Timestamp
系列函数专门用于处理时间点数据。
01:将各种时间格式转换为标准时间格式 yyyy-MM-dd HH:mm:ss
代码语言:javascript复制import pandas as pd
print(pd.Timestamp(year=2022, month=9, day=16, hour=9, minute=21, second=35))
print(pd.Timestamp('2022-09-16 9:21:35'))
print(pd.Timestamp('Sep 16, 2022 9:21:35'))
# output:
2022-09-16 09:21:35
2022-09-16 09:21:35
2022-09-16 09:21:35
需要说明的是 pd.Timestamp
是纳秒级别的,如果将一个整数或者浮点数传给 pd.Timestamp
,将返回自 unix
元年(1970-01-01)之后多少纳秒的时间点。
print(pd.Timestamp(9000000000))
print(pd.Timestamp(9000009000))
# output:
1970-01-01 00:00:09
1970-01-01 00:00:09.000009
02:获取当前时间
代码语言:javascript复制import pandas as pd
print(pd.Timestamp.now())
# output:
2022-09-18 08:11:10.797284
03:获取当前日期
代码语言:javascript复制import pandas as pd
print(pd.Timestamp.now().date())
# output:
2022-09-18
04:通过日期元素获取标准日期
代码语言:javascript复制import pandas as pd
print(pd.Timestamp(2022, 9, 18).date())
# output:
2022-09-18
05:通过日期、时间元素获取标准日期时间
代码语言:javascript复制import pandas as pd
print(pd.Timestamp(2022, 9, 18, 12, 30, 46))
print(pd.Timestamp(year=2022, month=9, day=18, hour=12, minute=30, second=46))
# output:
2022-09-18 12:30:46
2022-09-18 12:30:46
06:通过日期时间元素获取标准时间
代码语言:javascript复制import pandas as pd
print(pd.Timestamp(2022, 9, 18, 12, 30, 46).time())
print(pd.Timestamp(year=2022, month=9, day=18, hour=12, minute=30, second=46).time())
# output:
12:30:46
12:30:46
07:通过时间戳获取本地时区时间
代码语言:javascript复制import pandas as pd
print(pd.Timestamp.fromtimestamp(1663340762))
# output:
# 北京时间,东八区
2022-09-16 23:06:02
08:通过时间戳获取UTC时间
代码语言:javascript复制import pandas as pd
print(pd.Timestamp.utcfromtimestamp(1663340762))
# output:
2022-09-16 15:06:02
09:通过时间戳获取本地时区日期
代码语言:javascript复制import pandas as pd
print(pd.Timestamp.fromtimestamp(1663340762).date())
# output:
2022-09-16
10:通过 pd.Timestamp 属性获取日期时间元素
代码语言:javascript复制import pandas as pd
now = pd.Timestamp.now()
print("Current year: ", now.year)
print("Current quarter: ", now.quarter)
print("Current month: ", now.month)
print("Current week num: ", now.week)
print("Current week num: ", now.weekofyear)
print("Current day: ", now.day)
print("Current day_of_year: ", now.day_of_year)
print("Current day of year: ", now.dayofyear)
print("Current day_of_week: ", now.day_of_week) # [0-6]
print("Current day of week: ", now.dayofweek) # [0-6]
print("Current hour: ", now.hour)
print("Current minute: ", now.minute)
print("Current second: ", now.second)
print("Current microsecond: ", now.microsecond)
print("Current days_in_month: ", now.days_in_month) # [28-31]
print("Current days in month: ", now.daysinmonth) # [28-31]
print("is leap year: ", now.is_leap_year) # [True|False]
# output:
Current year: 2022
Current quarter: 3
Current month: 9
Current week num: 37
Current week num: 37
Current day: 18
Current day_of_year: 261
Current day of year: 261
Current day_of_week: 6
Current day of week: 6
Current hour: 22
Current minute: 17
Current second: 39
Current microsecond: 37986
Current days_in_month: 30
Current days in month: 30
is leap year: False
11:通过 pd.Timestamp 方法获取日期时间元素
代码语言:javascript复制import pandas as pd
now = pd.Timestamp.now()
print("Current month_name: ", now.month_name())
print("Current weekday: ", now.weekday()) # [0-6]
print("Current week_day_name: ", now.day_name())
print("Current date: ", now.date())
print("Current ctime: ", now.ctime())
print("Current time: ", now.time())
print("Current timestamp: ", now.timestamp())
print("Current timetuple: ", now.timetuple())
print("Current time: ", now.timetz())
print("Current time zone: ", now.tzname()) # 不设置就没有
# output:
Current month_name: September
Current weekday: 6
Current week_day_name: Sunday
Current date: 2022-09-18
Current ctime: Sun Sep 18 23:04:28 2022
Current time: 23:04:28.491830
Current timestamp: 1663542268.49183
Current timetuple: time.struct_time(tm_year=2022, tm_mon=9, tm_mday=18, tm_hour=23, tm_min=4, tm_sec=28, tm_wday=6, tm_yday=261, tm_isdst=-1)
Current time: 23:04:28.491830
Current time zone: None
12:日期时间转换为字符串格式 strftime( )
代码语言:javascript复制import pandas as pd
now = pd.Timestamp.now()
t = now.strftime("%H:%M:%S")
print("time:", t)
s1 = now.strftime("%Y/%m/%d, %H:%M:%S")
# Y/mm/dd H:M:S format
print("s1:", s1)
s2 = now.strftime("%d/%m/%Y, %H:%M:%S")
# dd/mm/YY H:M:S format
print("s2:", s2)
# output:
time: 23:07:47
s1: 2022/09/18, 23:07:47
s2: 18/09/2022, 23:07:47
上面用到的 %Y、%m、%d、%H
等是格式代码。strftime()
方法接收一个或多个格式代码,并根据它返回一个格式化的字符串。
常见日期时间格式代码如下(datetime format codes
):
%Y - full year [0001, 0002, ..., 2018, 2019,..., 9999]
%y - short year [01, 02, ..., 18, 19,..., 99]
%-y - decimal short year [1, 2, ..., 18, 19,..., 99]
%m - zero-padded decimal month [01, 02, ..., 11, 12]
%-m - decimal month [1, 2, ..., 11, 12]
%d - zero-padded decimal day [01, 02, ..., 30, 31]
%-d - decimal day [1, 2, ..., 30, 31]
%H - 24-hour [00, 01, ..., 22, 23]
%-H - decimal 24-hour [0, 1, ..., 22, 23]
%I - 12-hour [01, 02, ..., 11, 12]
%-I - decimal 12-hour [1, 2, ..., 11, 12]
%M - minute [00, 01, ..., 58, 59]
%-M - decimal minute [0, 1, ..., 58, 59]
%S - second [00, 01, ..., 58, 59]
%-S - decimal second [0, 1, ..., 58, 59]
%j - day number [001, 002, ..., 365, 366]
%-j - decimal day number [1, 2, ..., 365, 366]
%f - Microsecond [000000, 000001, ..., 999998, 999999]
%B - full month name [January, February, ..., November, December]
%b - short month name [Jan, Feb, ..., Nov, Dec]
%A - full weekday name [Monday, Tuesday, ..., Saturday, Sunday]
%a - short weekday name [Mon, Tue, ..., Sat, Sun]
%W - year week number [00, 01, ..., 52, 53](星期一为星期的开始)
%U - year week number [00, 01, ..., 52, 53](星期天为星期的开始)
%w - week number [0, 0, ..., 5, 6](星期天为星期的开始)
%c - Locale’s appropriate date and time representation [Sat Sep 17 14:43:06 2022]
%X - local time display as %H:%M:%S
%x - local date display as %m/%d/%y
%Z - zone [CST, UST, GMT](默认没有时区,设置之后才可以展示)
%z - UTC offset [ 0800, 1200, ..., -0800, -1200](默认没有时区,设置之后才可以展示)
%p - AM OR PM [AM, PM]
%% - just % character
13:字符串格式转换为日期时间 to_datetime( )
代码语言:javascript复制import pandas as pd
# strptime( ) 未实现
date_string = "18 September, 2022"
print("date_string =", date_string)
date_object = pd.to_datetime(date_string, format="%d %B, %Y")
print("date_object =", date_object)
# output:
date_string = 18 September, 2022
date_object = 2022-09-18 00:00:00
14:给时间设置时区并展示
代码语言:javascript复制import pandas as pd
import pytz
local = pd.Timestamp.now()
print("Local:", local.strftime("%m/%d/%Y, %H:%M:%S"))
tz_NY = pytz.timezone('America/New_York')
datetime_NY = pd.Timestamp.now(tz_NY)
print("datetime_NY:", datetime_NY.strftime("%m/%d/%Y, %H:%M:%S, timezone:%Z"))
tz_London = pytz.timezone('Europe/London')
datetime_London = pd.Timestamp.now(tz_London)
print("datetime_London:", datetime_London.strftime("%m/%d/%Y, %H:%M:%S, timezone:%Z"))
tz_Shanghai = pytz.timezone('Asia/Shanghai')
datetime_Shanghai = pd.Timestamp.now(tz_Shanghai)
print("datetime_Shanghai:", datetime_Shanghai.strftime("%m/%d/%Y, %H:%M:%S, timezone:%Z"))
# output:
Local: 09/18/2022, 23:22:57
datetime_NY: 09/18/2022, 11:22:57, timezone:EDT
datetime_London: 09/18/2022, 16:22:57, timezone:BST
datetime_Shanghai: 09/18/2022, 23:22:57, timezone:CST
时间差数据处理
时间差是指两个时间点在时间上的差异,是两个时间比较之后的结果。可以用不同的时间单位表示,例如,天、小时、分钟、秒。它们既可以是正数,也可以是负数。
pd.Timedelta
系列函数专门用于处理时间差数据。
15、通过字符串获取时间差
代码语言:javascript复制import pandas as pd
print(pd.Timedelta('1 days 1 hours 18 minutes 27 seconds'))
# output:
1 days 01:18:27
16、通过整数获取时间差
代码语言:javascript复制import pandas as pd
print(pd.Timedelta(8,unit='h')) # 8 hours
# output:
0 days 08:00:00
17、通过数据偏移量构造时间差
代码语言:javascript复制import pandas as pd
print(pd.Timedelta(days=3)) # 3 days
# output:
3 days 00:00:00
18、获取时间差的各项元素值
代码语言:javascript复制import pandas as pd
# given datetimes
now = pd.Timestamp.now()
date_time = pd.Timestamp(year=2022, month=8, day=31, hour=12, minute=30)
# Calculate the difference between two datetimes
timedelta = now - date_time
# get a components namedtuple-like object
print(timedelta.components)
# get timedelta elements
print("timedelta days: ",timedelta.components.days)
print("timedelta hours: ",timedelta.components.hours)
print("timedelta minutes: ",timedelta.components.minutes)
print("timedelta seconds: ",timedelta.components.seconds)
print("timedelta milliseconds:",timedelta.components.milliseconds)
print("timedelta microseconds:",timedelta.components.microseconds)
print("timedelta nanoseconds :",timedelta.components.nanoseconds)
# output:
Components(days=19, hours=10, minutes=13, seconds=18, milliseconds=798, microseconds=228, nanoseconds=0)
timedelta days: 19
timedelta hours: 10
timedelta minutes: 13
timedelta seconds: 18
timedelta milliseconds: 798
timedelta microseconds: 228
timedelta nanoseconds : 0
19、获取两个日期的时间差,单位是天
代码语言:javascript复制import pandas as pd
ts1 = pd.Timestamp(2022, 9, 11)
ts2 = pd.Timestamp(2022, 9, 18)
diff_days = (ts2 - ts1).days
print(diff_days)
# output:
7
20、计算两个时间的时间差,单位是秒
代码语言:javascript复制import pandas as pd
ts1 = pd.Timestamp(2022, 9, 18, 12, 30, 46)
ts2 = pd.Timestamp(2022, 9, 18, 13, 31, 46)
diff_seconds = (ts2 - ts1).seconds
print(diff_seconds)
# output:
3660
21、计算未来时间
代码语言:javascript复制import pandas as pd
now = pd.Timestamp.now()
print('Given Time:', now)
# Three weeks later
future_time = now pd.Timedelta(3, unit='w')
print('Future Time:', future_time)
# output:
Given Time: 2022-09-19 23:12:06.468015
Future Time: 2022-10-10 23:12:06.468015
22、比较两个时间段大小
代码语言:javascript复制import pandas as pd
# Creating two timedelta objects
d1 = pd.Timedelta(hours=57)
d2 = pd.Timedelta(hours=25, seconds=2)
# not equal check
print(d2 != d1) # True
# check if timedelta are same
print(d2 == 25) # False
# Compare two TimeDelta
print(d1 > d2)
# output:
True
False
True
23、将 pd.Timedelta 转换为字符串格式
代码语言:javascript复制import pandas as pd
ts = pd.Timedelta(days=34, minutes=7, seconds=64)
# str() constructor
print(str(ts))
# __str__() self
print(str(ts.__str__()))
# output:
34 days 00:08:04
34 days 00:08:04
24、计算日期所在月份的第一天
pandas
尚未提供 计算日期所在月份的第一天
的内置方法,因此,需要我们自己实现。既然是第一天,那就是把日期中的 day
元素调整为 1
就好了,具体来说,有如下三种实现方法:
方法一、将 day 元素替换为 1
代码语言:javascript复制import pandas as pd
given_date = pd.Timestamp.today().date()
first_day_of_month = given_date.replace(day=1)
print("First day of month: ", first_day_of_month)
# output:
First day of month: 2022-09-01
方法二、将当前日期减去 (day 元素值 -1)
比如今天是 7 号,减去 6 天,就能得到 1 号了。
代码语言:javascript复制import pandas as pd
given_date = pd.Timestamp.today().date()
first_day_of_month = given_date - pd.Timedelta(days = given_date.day-1)
print("First day of month: ", first_day_of_month)
# output:
First day of month: 2022-09-01
方法三、使用 strftime( ) 方法获取
代码语言:javascript复制import pandas as pd
given_date = pd.Timestamp.today().date()
first_day_of_month = given_date.strftime("%Y-%m-01")
print("First day of month: ", first_day_of_month)
# output:
First day of month: 2022-09-01
25、计算日期所在月份的最后一天
很不幸,pandas
也尚未提供计算日期所在月份的最后一天
的内置方法,月末计算比月初计算还要复杂,因为每个月天数是不一样的,有 28、29、30、31
四种可能,这该怎么办呢?
有一种快捷的方式是:先计算出次月的第一天,然后,用这个日期减去 1 天,这样,就得到了当月的最后一天,那次月的第一天又怎么算呢?当月第一天在加上一个月就可以了。跟示例 24 对应,这里也有三种方法。
但是,pd.Timedelta
不支持月粒度时间处理,这里只能借助其他模块实现
方法一、使用 calendar 实现
代码语言:javascript复制import pandas as pd
import calendar
given_date = pd.Timestamp.today().date()
last_day_of_month = given_date.replace(day = calendar.monthrange(given_date.year, given_date.month)[1])
print("Last day of month: ", last_day_of_month)
# output:
Last day of month: 2022-09-30
方法二、使用 dateutil.relativedelta() 实现
代码语言:javascript复制import pandas as pd
from dateutil.relativedelta import relativedelta
given_date = pd.Timestamp.today().date()
last_day_of_month = (given_date relativedelta(day=31))
print("Last day of month: ", last_day_of_month)
# output:
Last day of month: 2022-09-30
时间段数据处理
时间段表示一段时间持续的长度,比如一年、一个月、一天,与时间差类似,但又存在区别。因为它不是两个事件比较之后的结果。
pd.Period
系列函数专门用于处理时间段数据。
先看一个示例
代码语言:javascript复制import pandas as pd
period = pd.Period('2022-9-24', freq='W')
这里,pd.Period
返回的是 2022-09-24
所在时间周期 周
的这个时间段,也就是 2022-09-19 ~ 2022-09-25
,它表示的是一个时间范围,W
就是周的别名,时间周期 freq
别名释义表如下:
Alias | Description |
---|---|
B | business day frequency |
C | custom business day frequency |
D | calendar day frequency |
W | weekly frequency |
M | month end frequency |
SM | semi-month end frequency (15th and end of month) |
BM | business month end frequency |
CBM | custom business month end frequency |
MS | month start frequency |
SMS | semi-month start frequency (1st and 15th) |
BMS | business month start frequency |
CBMS | custom business month start frequency |
Q | quarter end frequency |
BQ | business quarter end frequency |
QS | quarter start frequency |
BQS | business quarter start frequency |
A, Y | year end frequency |
BA, BY | business year end frequency |
AS, YS | year start frequency |
BAS, BYS | business year start frequency |
BH | business hour frequency |
H | hourly frequency |
T, min | minutely frequency |
S | secondly frequency |
L, ms | milliseconds |
U, us | microseconds |
N | nanoseconds |
26、获取时间段各项时间元素值
代码语言:javascript复制import pandas as pd
period = pd.Period('2022-9-24 12:12:12', freq='M')
print("period.year: ", period.year)
print("period.quarter: ", period.quarter)
print("period.month: ", period.month)
print("period.week: ", period.week)
print("period.day: ", period.day)
print("period.hour: ", period.hour)
print("period.minute: ", period.minute)
print("period.second: ", period.second)
print("period.is_leap_year: ", period.is_leap_year)
print("period.day_of_year: ", period.day_of_year)
print("period.dayofyear: ", period.dayofyear)
print("period.days_in_month:", period.days_in_month)
print("period.daysinmonth: ", period.daysinmonth)
print("period.day_of_week: ", period.day_of_week) # [0-6]
print("period.dayofweek: ", period.dayofweek) # [0-6]
print("period.weekday: ", period.weekday) # [0-6]
print("period.weekofyear: ", period.weekofyear)
print("period.qyear: ", period.qyear)
print("period.start_time: ", period.start_time)
print("period.end_time: ", period.end_time)
print("period.freqstr: ", period.freqstr)
print("period.freq: ", period.freq)
# output:
period.year: 2022
period.quarter: 3
period.month: 9
period.week: 38
period.day: 25
period.hour: 0
period.minute: 0
period.second: 0
period.is_leap_year: False
period.day_of_year: 268
period.dayofyear: 268
period.days_in_month: 30
period.daysinmonth: 30
period.day_of_week: 6
period.dayofweek: 6
period.weekday: 6
period.weekofyear: 38
period.qyear: 2022
period.start_time: 2022-09-19 00:00:00
period.end_time: 2022-09-25 23:59:59.999999999
period.freqstr: W-SUN # 周粒度,周日结束
period.freq: <Week: weekday=6>
27、使用 asfreq 改变时间段的周期
pd.Period.asfreq( )
方法有两个参数,freq
是所需时间周期,how
表示使用转换后时间段的开始还是结束
import pandas as pd
period = pd.Period('2022-9-24 12:12:12', freq='W')
print("period.start_time: ", period.start_time)
print("period.end_time: ", period.end_time)
period_start = period.asfreq(freq='D', how='start') # how='S'
print("period_start:", period_start)
period_end = period.asfreq(freq='D', how='end') # how='E', default 'end'
print("period_end: ", period_end)
# output:
period.start_time: 2022-09-19 00:00:00
period.end_time: 2022-09-25 23:59:59.999999999
period_start: 2022-09-19
period_end: 2022-09-25
28、获取当前时间的所在时间段
代码语言:javascript复制import pandas as pd
period = pd.Period.now(freq='M')
print("period: ", period)
# output:
period: 2022-09
29、格式化时间段字符串
时间格式编码含义请参考示例 12
代码语言:javascript复制import pandas as pd
period1 = pd.Period(freq='Q-JUL', year=2006, quarter=1)
print(period1.strftime('%F-Q%q'))
# Output the last month in the quarter of this date
print(period1.strftime('%b-%Y'))
period2 = pd.Period(freq='D', year=2001, month=1, day=1)
print(period2.strftime('%d-%b-%Y'))
print(period2.strftime('%b. %d, %Y was a %A'))
# output:
2006-Q1
Oct-2005
01-Jan-2001
Jan. 01, 2001 was a Monday
30、获取时间周期某时间戳的标准时间格式字符串
pd.Period.to_timestamp( )
方法有两个参数,freq
是所需时间周期(如果原时间周期大于等于周,则为 D
否则为 S
),how
表示使用转换后时间段的开始还是结束。
import pandas as pd
prd1 = pd.Period('2022-9-24 12:12:12', freq='W')
print(prd1)
prd2 = prd1.to_timestamp(freq='D', how='start')
print(prd2)
prd3 = pd.Period('2022-9-24 12:00:00', freq='4H')
print(prd3)
prd4 = prd3.to_timestamp(freq='S', how='end')
print(prd4)
# output:
2022-09-19/2022-09-25
2022-09-19 00:00:00
2022-09-24 12:00
2022-09-24 15:59:59.999999999
批量数据转换
上边介绍的都是单个数据转换,下面通过一个示例介绍一下如何进行批量数据的转换。
假设,有如下样式的数据集:
strdate |
---|
2022/9/1 |
2022/9/2 |
2022/9/12 |
2022/9/22 |
2022/12/1 |
2022/12/23 |
计划调整为如下格式:(获取日期所在月份的第一天)
strdate |
---|
2022/9/1 |
2022/9/1 |
2022/9/1 |
2022/9/1 |
2022/12/1 |
2022/12/1 |
利用 pandas
对某列数据进行调整,经常使用 apply
方法,需要首先定义一个函数,用于编写单个数据的处理逻辑,根据示例 24 中介绍的方法,可以使用如下自定义函数实现。
import pandas as pd
def get_firstday(time):
return time.replace(day = 1).strftime('%Y/%m/%d').replace('/0','/')
df=pd.read_csv('./str2date.txt')
print("df1:n", df)
# convert column strdate to datetime
df['strdate'] = pd.to_datetime(df['strdate'],format = '%Y/%m/%d')
print("df2:n", df)
# apply get_firstday function to column strdate
df['strdate'] = df['strdate'].apply(get_firstday)
print("df3:n", df)
# output:
df1:
strdate
0 2022/9/1
1 2022/9/2
2 2022/9/12
3 2022/9/22
4 2022/12/1
5 2022/12/23
df2:
strdate
0 2022-09-01
1 2022-09-02
2 2022-09-12
3 2022-09-22
4 2022-12-01
5 2022-12-23
df3:
strdate
0 2022/9/1
1 2022/9/1
2 2022/9/1
3 2022/9/1
4 2022/12/1
5 2022/12/1
如上仅为示例,其他类型的时间数据转换,只需修改自定义函数即可。