微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

pyspark:RDD:groupByKey(),reduceByKey()

1.parallelize()创建RDD:

words = sc.parallelize([("hadoop",1),("is",1),("good",1),\
    ("spark",1),("is",1),("fast",1),("spark",1),("is",1),\
        ("better",1)])
wordsres1 = words.groupByKey()
wordsres1.collect()

2.groupByKey()结果:

[('hadoop', <pyspark.resultiterable.ResultIterable at 0x7effad99feb8>),
 ('is', <pyspark.resultiterable.ResultIterable at 0x7effad99f828>),
 ('good', <pyspark.resultiterable.ResultIterable at 0x7effad99fcf8>),
 ('spark', <pyspark.resultiterable.ResultIterable at 0x7effad99fda0>),
 ('fast', <pyspark.resultiterable.ResultIterable at 0x7effad99fbe0>),
 ('better', <pyspark.resultiterable.ResultIterable at 0x7effad99fd68>)]

通过groupByKey,对原始的RDD数据进行分组,结果如下:

("hadoop",1)

("is",(1,1,1))

("spark",(1,1)

("good",1)

("fast",1)

("better",1)

3.reduceByKey()结果:

wordsres2 = words.reduceByKey(lambda a,b:a+b)
wordsres2.collect()


#结果:
[('hadoop', 1),
 ('is', 3),
 ('good', 1),
 ('spark', 2),
 ('fast', 1),
 ('better', 1)]

通过reduceByKey,对原始的RDD数据进行jisuan,结果如下

("hadoop",1)("is",3)
("spark",2)("good",1)
("fast",1)("better",1)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐