微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

Spark进阶一: 测试环境搭建

运行环境:

  1. 操作系统:ubuntu 16@H_404_8@
  2. JDK:1.8.0_261-b12@H_404_8@
  3. hadoop: 3.2.2@H_404_8@
  4. spark: 3.1.2@H_404_8@

一、hadoop单机模式

  1. 下载安装

    @H_404_8@

安装包官网下载即可:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

image-20211003091531935

然后解压缩到指定目录,我的目录为:

/home/ffzs/softwares/hadoop-3.2.2
  1. 设置免密码登录

    @H_404_8@

创建密钥, 已经有的可以跳过这一步,如之前git设置过:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

将自己的秘钥放在ssh授权目录,这样ssh登录自身就不需要输入密码了:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  1. 修改配置

    @H_404_8@

首先设置java路径, 打开hadoop-3.2.2/etc/hadoop目录中的hadoop-env.sh文件设置$JAVA_HOME:

export JAVA_HOME=/home/ffzs/softwares/jdk1.8.0_261

同样的目录, 修改core-site.xml文件添加如下内容,设置认hdfs路径:

<configuration>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>file:/home/ffzs/hadoop/tmp</value>
        </property>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://localhost:9000</value>
        </property>
</configuration>

修改hdfs-site.xml文件

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:/home/ffzs/hadoop/tmp/dfs/name</value>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:/home/ffzs/hadoop/tmp/dfs/data</value>
        </property>
</configuration>

修改mapred-site.xml文件

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

修改yarn-site.xml文件

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
    	<name>yarn.nodemanager.env-whitelist</name>					<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLAsspATH_PREPEND_disTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

这时通过hadoop-3.2.2/bin中的hdfs执行命令初始化hdfs:

./hdfs namenode -format

运行介绍显示如下即为成功初始化:

image-20211003093905334

  1. 启动

    @H_404_8@

启动hdfs:

(base) [~/softwares/hadoop-3.2.2]$ ./sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [ffzs-ub]

启动yarn:

(base) [~/softwares/hadoop-3.2.2]$ ./sbin/start-yarn.sh 
Starting resourcemanager
Starting nodemanagers

通过jps查看启动进程情况:

image-20211003094414716

二、spark单机模式

  1. 下载安装@H_404_8@

通过官网下载spark:https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

解压到相应的目录即可:

/home/ffzs/softwares/spark-3.1.2-bin-hadoop3.2
  1. 配置@H_404_8@

spark-3.1.2-bin-hadoop3.2/conf目录中:

cp spark-env.sh.template spark-env.sh

然后将JAVA_HOME, HADOOP_HOME添加到spark-env.sh中即可:

export JAVA_HOME=/home/ffzs/softwares/jdk1.8.0_261
  1. 运行@H_404_8@

通过sbin目录的start-all.sh运行spark:

(base) [~/softwares/spark-3.1.2-bin-hadoop3.2]$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /home/ffzs/softwares/spark-3.1.2-bin-hadoop3.2/logs/spark-ffzs-org.apache.spark.deploy.master.Master-1-ffzs-ub.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/ffzs/softwares/spark-3.1.2-bin-hadoop3.2/logs/spark-ffzs-org.apache.spark.deploy.worker.Worker-1-ffzs-ub.out

jps查看出现Master和Worker进程:

image-20211003100702220

通过http://localhost:8080/访问spark-ui:

image-20211003100855654

可以通过spark-shell启动测试:

image-20211003101240179

三、hive配置

通过MysqL作为元数据库, 这里通过docker启动一个MysqL用户名root,密码123zxc。

    MysqL:
        image: MysqL:8
        container_name: MysqL
        networks:
            - spring
        restart: always
        ports:
            - 33060:33060
            - 3306:3306
        volumes:
            - ./MysqL/db:/var/lib/MysqL
            - ./MysqL/conf.d:/etc/MysqL/conf.d
        environment:
            - MysqL_ROOT_PASSWORD=123zxc    
        command: --default-authentication-plugin=MysqL_native_password

然后新建hive-site.xml文件,在spark-3.1.2-bin-hadoop3.2/conf目录,写入内容如下, 这里需要注意一下使用MysqL8和MysqL5的配置不一样, 我使用的是MysqL8:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<!-- MysqL5
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.MysqL.jdbc.Driver</value>
	</property>
	-->
        <property>
                <name>javax.jdo.option.ConnectionDriverName</name>
                <value>com.MysqL.cj.jdbc.Driver</value>
        </property>
	<!-- MysqL5
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:MysqL://ffzs-ub:3306/hive_db?createDatabaseIfNotExist=true</value>
	</property>
	-->
        <property>
                <name>javax.jdo.option.ConnectionURL</name>
		<value>jdbc:MysqL://ffzs-ub:3306/hive_db?createDatabaseIfNotExist=true&amp;useSSL=false&amp;serverTimezone=GMT&amp;allowPublicKeyRetrieval=true</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionUserName</name>
                <value>root</value>
        </property>
        <property>
                <name>javax.jdo.option.ConnectionPassword</name>
                <value>123zxc</value>
        </property>
        <property>
                <name>datanucleus.schema.autocreateAll</name>
                <value>true</value>
        </property>
</configuration>

通过spark sql启动:

spark-sql --driver-class-path mysql-connector-java-8.0.26.jar --master spark://ffzs-ub:7077

由于使用的MysqL作为元数据库生成表时会有一些报错出现,因此部分表我选择手动修改生成

use hive_db;
CREATE TABLE `TBLS`
(
    `TBL_ID` BIGINT NOT NULL,
    `CREATE_TIME` INTEGER NOT NULL,
    `DB_ID` BIGINT NULL,
    `LAST_ACCESS_TIME` INTEGER NOT NULL,
    `OWNER` VARCHAR(767) BINARY NULL,
    `RETENTION` INTEGER NOT NULL,
    `IS_REWRITE_ENABLED` BIT NOT NULL,
    `SD_ID` BIGINT NULL,
    `TBL_NAME` VARCHAR(256) BINARY NULL,
    `TBL_TYPE` VARCHAR(128) BINARY NULL,
    `VIEW_EXPANDED_TEXT` TEXT NULL,
    `VIEW_ORIGINAL_TEXT` TEXT NULL,
    CONSTRAINT `TBLS_PK` PRIMARY KEY (`TBL_ID`)
) ENGINE=INNODB;
CREATE TABLE `COLUMNS_V2`
(
    `CD_ID` BIGINT NOT NULL,
    `COMMENT` VARCHAR(256)  NULL,
    `COLUMN_NAME` VARCHAR(766)  NOT NULL,
    `TYPE_NAME` TEXT  NOT NULL,
    `INTEGER_IDX` INTEGER NOT NULL,
    CONSTRAINT `COLUMNS_PK` PRIMARY KEY (`CD_ID`,`COLUMN_NAME`)
) ENGINE=INNODB;
CREATE TABLE `SERDE_ParaMS`
(
    `SERDE_ID` BIGINT NOT NULL,
    `ParaM_KEY` VARCHAR(256) BINARY NOT NULL,
    `ParaM_VALUE` TEXT BINARY NULL,
    CONSTRAINT `SERDE_ParaMS_PK` PRIMARY KEY (`SERDE_ID`,`ParaM_KEY`)
) ENGINE=INNODB;
CREATE TABLE `TABLE_ParaMS`
(
    `TBL_ID` BIGINT NOT NULL,
    `ParaM_KEY` VARCHAR(256) BINARY NOT NULL,
    `ParaM_VALUE` TEXT BINARY NULL,
    CONSTRAINT `TABLE_ParaMS_PK` PRIMARY KEY (`TBL_ID`,`ParaM_KEY`)
) ENGINE=INNODB;
CREATE TABLE `SD_ParaMS`
(
    `SD_ID` BIGINT NOT NULL,
    `ParaM_KEY` VARCHAR(256) BINARY NOT NULL,
    `ParaM_VALUE` TEXT BINARY NULL,
    CONSTRAINT `SD_ParaMS_PK` PRIMARY KEY (`SD_ID`,`ParaM_KEY`)
) ENGINE=INNODB

启动后在http://localhost:8080/中可以看到相应的spark sql任务:

image-20211003124515470

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。

相关推荐