微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

spark执行优化——依赖上传到HDFS(spark.yarn.jar和spark.yarn.archive的使用)

1.简述

使用yarn的方式提交spark应用时,在没有配置spark.yarn.archive或者spark.yarn.jars时, 看到输出的日志在输出Neither spark.yarn.jars nor spark.yarn.archive is set;一段指令后,会看到不停地上传本地jar到HDFS上,内容如下,这个过程会非常耗时。可以通过在spark-defaults.conf配置里添加spark.yarn.archive或spark.yarn.jars来缩小spark应用的启动时间。

 Will allocate AM container, with 896 MB memory including 384 MB overhead
2020-12-01 11:16:11 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 11:16:11 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 11:16:11 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 11:16:12 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2020-12-01 11:16:14 INFO  Client:54 - Uploading resource file:/tmp/spark-897c6291-e0bd-47e6-8d42-7f67225c4819/__spark_libs__5294834939010995385.zip -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/__spark_libs__5294834939010995385.zip
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/wordcount.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/zookeeper-3.4.6.jar
2020-12-01 11:16:18 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606792499194_0001/xz-1.0.jar

2.spark官网对这两个配置的解释

在这里插入图片描述

中文释义大概如下

在这里插入图片描述

3.spark.yarn.jars使用

3.1 将spark根目录下jars里的所有jar包上传到HDFS

 hadoop fs -mkdir -p  /spark-yarn/jars
 hadoop fs -put /opt/module/spark-2.3.2-bin-hadoop2.7/jars/* /spark-yarn/jars/

3.2 修改spark-defaults.conf

spark.yarn.jars hdfs://hadoop122:9000/spark-yarn/jars/*.jar

3.3 效果

2020-12-01 13:53:52 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 13:53:52 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 13:53:52 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/JavaEWAH-0.3.2.jar
2020-12-01 13:53:53 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/jars/RoaringBitmap-0.5.11.jar

3.4 可能遇到的错误

ERROR client.TransportClient: Failed to send RPC RPC

Caused by: java.io.IOException: Failed to send RPC 5353749227723805834 to /192.168.10.122:58244: java.nio.channels.ClosedChannelException
	at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
	at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(UnkNown Source)

关闭通道异常,看上去是超时的问题,这个问题当运行spark-shell --master yarn-client时,可能也会出现。在yarn-site.xml里添加如下配置可以解决

<property>
		<name>yarn.nodemanager.pmem-check-enabled</name>
		<value>false</value>
</property>
<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>false</value>
</property>

4.spark.yarn.archive使用

4.1 将spark根目录下jars里的所有jar包上传到HDFS

打包要注意所有的jar都在zip包的根目录中

cd /opt/module/spark-2.3.2-bin-hadoop2.7/jars/
zip -q -r spark_jars_2.3.2.zip *
hadoop fs -mkdir /spark-yarn/zip
hadoop fs -put spark_jars_2.3.2.zip /spark-yarn/zip/

4.2 修改spark-defaults.conf

spark.yarn.archive hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip

4.3 效果

2020-12-01 14:41:53 INFO  Client:54 - Setting up container launch context for our AM
2020-12-01 14:41:53 INFO  Client:54 - Setting up the launch environment for our AM container
2020-12-01 14:41:53 INFO  Client:54 - Preparing resources for our AM container
2020-12-01 14:41:54 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip
2020-12-01 14:41:54 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/wordcount.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/wordcount.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zstd-jni-1.3.2-2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zstd-jni-1.3.2-2.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/zookeeper-3.4.6.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/zookeeper-3.4.6.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xz-1.0.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xz-1.0.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xmlenc-0.52.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xmlenc-0.52.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xml-apis-1.3.04.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xml-apis-1.3.04.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xercesImpl-2.9.1.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xercesImpl-2.9.1.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/xbean-asm5-shaded-4.4.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/xbean-asm5-shaded-4.4.jar
2020-12-01 14:41:55 INFO  Client:54 - Uploading resource file:/home/workspace/wordcount/lib/spark-core_2.11-2.3.2.jar -> hdfs://hadoop122:9000/user/root/.sparkStaging/application_1606801972366_0009/spark-core_2.11-2.3.2.jar

4.4 可能遇到的错误

应用的driver日志

错误: 找不到或无法加载主类 org.apache.spark.deploy.yarn.ApplicationMaster

如果像如下的打包方式,就会保留目录的层级到zip包中,就会报错如上

zip -q -r spark_jars_2.3.2.zip /opt/module/spark-2.3.2-bin-hadoop2.7/jars/*

在这里插入图片描述

5.对比

官方说明:说如果两个参数都配置,应用程序会优先使用 spark.yarn.archive 配置的路径,spark.yarn.archive 的优先级高于 spark.yarn.jars.
二者提交应用时都会出现

Source and destination file systems are the same. Not copying ...

提示,但是使用spark.yarn.jars的方式,所有上传过的jar,都会提示

Source and destination file systems are the same. Not copying 

而使用spark.yarn.archive的方式,只是会有

Source and destination file systems are the same. Not copying hdfs://hadoop122:9000/spark-yarn/zip/spark_jars_2.3.2.zip

一个提示,后来的jar仍然会从本地上传。参照4.4的日志,
那么使用spark.yarn.archive的方式是怎么加快文件的分发速度的?
亦或该如下理解
这两种方式都能加快依赖文件的分发速度,spark.yarn.jars对于已经上传的jar也可以免去从本地上传依赖的过程?
欢迎知道的小伙伴来讨论

参考文章
https://blog.csdn.net/liyaya0201/article/details/105277681
https://www.cnblogs.com/yyy-blog/p/11110388.html
https://www.jianshu.com/p/e44e948b8d5f
https://blog.csdn.net/u012957549/article/details/89361485

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐