InteractorA InteractorB
AGAP028204 AGAP005846
AGAP028204 AGAP003428
AGAP028200 AGAP011124
AGAP028200 AGAP004335
AGAP028200 AGAP011356
AGAP028194 AGAP008414
我正在使用Pandas,我想要删除两次出现的行,但是就像下面这样……
InteractorA InteractorB
AGAP002741 AGAP008026
AGAP008026 AGAP002741
对此……
InteractorA InteractorB
AGAP002741 AGAP008026
因为它们对于所有意图和目的都是一样的.
是否有内置的方法来处理这个?
解决方法:
我最终创建了一个hacky脚本,它遍历行和必要的数据片段,并检查是否出现连接或是否出现反向,并根据需要删除行索引.
import pandas as pd
checklist = []
indexes_to_drop = []
interactions = pd.read_csv('original_interactions.txt', delimiter = '\t')
for index, row in interactions.iterrows():
check_string = row['InteractorA'] + row['InteractorB']
check_string_rev = row['InteractorB'] + row['InteractorA']
if (check_string or check_string_rev) in checklist:
indexes_to_drop.append(index)
else:
pass
checklist.append(check_string)
checklist.append(check_string_rev)
no_dups = interactions.drop(interactions.index[indexes_to_drop])
print no_dups.shape
no_dups.to_csv('no_duplicates.txt',sep='\t',index = False)
2017年编辑:几年后,有了更多的经验,对于寻找类似东西的人来说,这是一个更优雅的解决方案:
In [8]: df
Out[8]:
InteractorA InteractorB
0 AGAP028204 AGAP005846
1 AGAP028204 AGAP003428
2 AGAP028200 AGAP011124
3 AGAP028200 AGAP004335
4 AGAP028200 AGAP011356
5 AGAP028194 AGAP008414
6 AGAP002741 AGAP008026
7 AGAP008026 AGAP002741
In [18]: df['check_string'] = df.apply(lambda row: ''.join(sorted([row['InteractorA'], row['InteractorB']])), axis=1)
In [19]: df
Out[19]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
7 AGAP008026 AGAP002741 AGAP002741AGAP008026
In [20]: df.drop_duplicates('check_string')
Out[20]:
InteractorA InteractorB check_string
0 AGAP028204 AGAP005846 AGAP005846AGAP028204
1 AGAP028204 AGAP003428 AGAP003428AGAP028204
2 AGAP028200 AGAP011124 AGAP011124AGAP028200
3 AGAP028200 AGAP004335 AGAP004335AGAP028200
4 AGAP028200 AGAP011356 AGAP011356AGAP028200
5 AGAP028194 AGAP008414 AGAP008414AGAP028194
6 AGAP002741 AGAP008026 AGAP002741AGAP008026
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。