大数据学习笔记第3课 基于Yarn的Spark实时计算
1、说明
本文是在前面2课搭建好的hadoop集群的基础上进行的,如果不熟悉环境请先看前面的2课练习。
- 《大数据学习笔记第1课 Hadoop基础理论与集群搭建》
- 《大数据学习笔记第2课 Zookeeper & Kafka集群搭建》
- 《大数据学习笔记第2课(续) 通过filebeat收集nginx访问日志到kafka集群》
本文的测试程序使用的是hadoop官方案例程序,程序所在目录如下(关于mapreduce程序结构与案例源码不在本文范围)
/program/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar
2、hadoop单节点运行mapreduce程序
mapreduce的程序如果计算的数据量很小则不需要使用集群计算,因为启动集群会有额外资源开销计算的效率反而会慢。
1、首先进入hadoop1的终端,然后切换当前目录
cd /program/hadoop-3.3.0/bin
2、使用以下命令查看hadoop官方案例程序的主要功能,如下图:
3、通过以下命令执行mapreduce程序实现对hadoop配置文件中的单词进行统计的功能,并把结果放到output目录下,如下:
./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount file:///program/hadoop-3.3.0/etc/hadoop/* output
执行完毕之后可以通过下图看出实现了对file:///program/hadoop-3.3.0/etc/hadoop/目录下所有文件的所有单词的次数统计。
./hdfs dfs -chmod -R 777 /
3、配置Yarn集群
如果计算的数据量很大,则适合使用集群的进行计算,这通常是计算的时间远远大于集群初始化及其他资源分配与管理的时间。要想启用yarn集群,则需要按以下步骤进行配置。
1、先进入hadoop01的终端,切换当前目录如下
cd /program/hadoop-3.3.0/bin
2、通过vim命令编辑…/etc/hadoop/mapred-site.xml,开启yarn集群计算,内容如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<!-- 默认是local 表示不配置走本地多线程计算,yarn表示开启集群计算 -->
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
</configuration>
3、通过vim命令编辑…/etc/hadoop/yarn-site.xml,配置resourcemanager,内容如下:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce-shuffle</value>
</property>
</configuration>
以上信息表示使用hadoop01作为yarn的resourcemanager。
4、通过scp命令把…/etc/hadoop/yarn-site.xml复制到hadoop02和hadoop03节点上,如下图:
5、在启动yarn集群之前,先通过vim命令编辑…/sbin/start-yarn.sh和…/sbin/stop-yarn.sh,在文件顶部添加以下内容:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
不然在启动的时候会报以下错误:
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.
6、通过…/sbin/start-yarn.sh启动yarn集群,并通过jps查看运行进程,如下图:
hadoop2
hadoop3
7、使用yarn图形界面查看集群
可以通过http://hadoop01:8088进入yarn集群的图形管理界面,如下图:
上图可以看出yarn的集群共24GB内存,24核。
4、使用hadoop Yarn集群运行mapreduce程序
5、配置spark
6、基于spark运行mapreduce程序
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。