我有一个客户购买的CSV文件,没有按照我读入Pandas Dataframe的特定顺序.我想为每次购买添加一个列,并显示自上次购买以来已经过了多少时间,按客户分组.我不确定它在哪里得到差异,但它们太大了(即使在几秒钟内).
CSV:
Customer Id,Purchase Date
4543,1/1/2015
4543,2/5/2015
4543,3/15/2015
2322,1/1/2015
2322,3/1/2015
2322,2/1/2015
Python:
import pandas as pd
import time
start = time.time()
data = pd.read_csv('data.csv', low_memory=False)
data = data.sort_values(by=['Customer Id', 'Purchase Date'])
data['Purchase Date'] = pd.to_datetime(data['Purchase Date'])
data['Purchase Difference'] = (data.groupby(['Customer Id'])['Purchase Date']
.diff()
.fillna('-')
)
print data
输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 2678400000000000
4 2322 2015-03-01 2419200000000000
0 4543 2015-01-01 -
1 4543 2015-02-05 3024000000000000
2 4543 2015-03-15 328320000000000
期望的输出:
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01 -
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01 -
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
解决方法:
一旦转换为时间戳,您就可以将diff应用于Purchase Date列.
df['Purchase Date'] = pd.to_datetime(df['Purchase Date'])
df.sort_values(['Customer Id', 'Purchase Date'], inplace=True)
df['Purchase Difference'] = \
[str(n.days) + ' day' + 's' if n > pd.timedelta(days=1) else '' if pd.notnull(n) else ""
for n in df.groupby('Customer Id', sort=False)['Purchase Date'].diff()]
>>> df
Customer Id Purchase Date Purchase Difference
3 2322 2015-01-01
5 2322 2015-02-01 31 days
4 2322 2015-03-01 28 days
0 4543 2015-01-01
1 4543 2015-02-05 35 days
2 4543 2015-03-15 38 days
6 4543 2015-03-15
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。