关于零售商店,我有以下pandas交易数据集:
print(df)
product Date Assistant_name
product_1 2017-01-02 11:45:00 John
product_2 2017-01-02 11:45:00 John
product_3 2017-01-02 11:55:00 Mark
...
我想为Market Basket Analysis创建以下数据集:
product Date Assistant_name Invoice_number
product_1 2017-01-02 11:45:00 John 1
product_2 2017-01-02 11:45:00 John 1
product_3 2017-01-02 11:55:00 Mark 2
...
简而言之,如果一个事务具有相同的Assistant_name和Date,我认为它确实生成了一个新的Invoice.
解决方法:
最简单的是factorize
,连在一起的列:
df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
print (df)
product Date Assistant_name Invoice
0 product_1 2017-01-02 11:45:00 John 1
1 product_2 2017-01-02 11:45:00 John 1
2 product_3 2017-01-02 11:55:00 Mark 2
如果性能很重要,请使用pd.lib.fast_zip:
df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
时序:
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [178]: %%timeit
...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
...:
9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: %%timeit
...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
...:
11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [180]: %%timeit
...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
...:
6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。