基于 Hadoop集群 继续搭建多节点『Hive』。
确保四台虚拟机互通。
caroly01 | caroly02 | caroly03 | caroly04 |
---|---|---|---|
MySQL元数据服务 | 单节点 | 服务端 | 客户端 |
启动『Hadoop集群』,通过『jps』命令查看启动状况。
在『caroly01』中安装关系型数据库(MySQL):
yum install mysql-server -y // yum安装mysql
service mysqld start // 启动mysql服务
mysql // 无密码进入mysql
select host,user,password from user; // 查询有哪些用户
grant all privileges on *.* to 'root'@'%' identified by '123' with grant option; // 修改root密码为123,并远程访问
delte from user where host!='%'; // 删除非%用户
flush privileges; // 刷新
quit; // 退出
mysql -u root -p 123 // 有密码进入mysql
在『caroly02』中解压『Hive』安装包到指定目录:
cd ~/software
tar xf apache-hive-1.2.1-bin.tar.gz
mv apache-hive-1.2.1-bin hive
cp -r hive /opt/caroly/
配置环境变量:
vi /etc/profile +
export HIVE_HOME=/opt/caroly/hive
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin
. /etc/profile
修改『Hive』配置文件:
cd /opt/caroly/hive/conf/
mv hive-default.xml.template hive-site.xml
vi hive-site.xml
光标定位到<configuration>这一行
:.,$-1d
将configuration
之间的删除,新添如下,保存退出(:wq):
<property> <!-- 本机表默认位置 -->
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive_remote/warehouse</value>
</property>
<property> <!-- 包含元数据的数据存储的JDBC连接字符串 -->
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://caroly01/hive_remote?createDatabaseIfNotExist=true</value>
</property>
<property> <!-- 包含元数据的数据存储的JDBC驱动程序类名称 -->
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property> <!-- 用户名 -->
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property> <!-- 密码 -->
<name>javax.jdo.option.ConnectionPassword</name>
<value>123</value>
</property>
执行『Hive』:
hive
如果报错:Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
将mysql驱动包放入 hive 的 lib 目录中。
如果报错:The specified datastore driver ("com.mysql.jdbc.Driver") was not find in the CLASSPATH.
将mysql驱动包放入 hive 的 lib 目录中。
如果报错:java.lang.IncompatibleClassChangeError: Found class jline, Terminal, but interface was expected
cd /opt/caroly/hive/lib
可以找到有一个 jline-2.12.jar
cd /opt/caroly/hadoop-2.9.2/share/hadoop/yarn/lib
发现同样有一个 jline的jar包,但是版本不一样,将低版本的删掉。
rm -rf jline-*.jar
cp /opt/caroly/hive/lib/jline-2.12.jar /opt/caroly/hadoop-2.9.2/share/hadoop/yarn/lib
在『caroly01』中创建表:
create table tbl(id int, age int);
show tables;
insert into tbl values(1,1);
select * from tbl;
在本地浏览器访问:
caroly03:8088
可以看到有执行一条sql语句。
在本地浏览器访问:
caroly01:50070
点击:Utilities --> Browse the file system
访问此目录:/user/hive_remote/warehouse
可以发现上传成功。
MySQL中同样可以看到:
mysql -u root -p 123
show databases;
use hive_remote;
show tables;
select * from TBLS;
select * from COLUMNS_V2;
将『hive』目录分发给『caroly03』、『caroly04』:
scp -r /opt/caroly/hive/ caroly03:/opt/caroly/
scp -r /opt/caroly/hive/ caroly04:/opt/caroly/
在『caroly03』和 『caroly04』中都配置环境变量:
vi /etc/profile +
export HIVE_HOME=/opt/caroly/hive
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$ZOOKEEPER_HOME/bin:$HIVE_HOME/bin
. /etc/profile
在『caroly03』中修改如下:
cd /opt/caroly/hive/conf
vi hive-site.xml
修改前两项如下:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://caroly01/hive?createDatabaseIfNotExist=true</value>
</property>
在『caroly04』中删除所有
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://caroly03:9083</value>
</property>
在『caroly03』中执行:
hive --service metastore
注:这是一个阻塞式窗口,卡住不动是正常现象。
新开一个『caroly03』, 执行:
ss -nal
发现会有端口为 9803 的服务,表面元数据已启动好。
在『caroly04』中执行:
hive
create table psn1(id int);
insert into psn1 values(1);
在『caroly01』中可以看到有多了一张表:hive
同样在『caroly01:50070』中可以看到目录/user/hive/warehouse
中有psn1
在『caroly02』中执行:
hdfs dfs -cat /user/hive_remote/warehouse/tbl/*
发现输出:11
查看分隔符:
hdfs dfs -get /user/hive_remote/warehouse/tbl/*
cat -A 000000_0
输出如下:1^A1$
说明^A
为分隔符,这是默认分隔符。
举例表中数据如下:
1,张三1,dota-book-movie,beijng:wudaokou-shanghai:pudong
2,张三2,dota-book-movie,beijng:wudaokou-shanghai:pudong
3,张三3,dota-movie,beijng:wudaokou-shanghai:pudong-shenzhen:nanshan
4,张三4,dota-book,beijng:wudaokou-shanghai:pudong
5,张三5,dota-book-movie,shanghai:pudong
6,张三6,dota-book,beijng:wudaokou-shanghai:pudong
其中,第一列类型为『int』,第二列为『string』,第三列为『array』,第四列为『map』。
建表语句如下:
hive> create table psn
(
id int,
name string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
看一下刚建表的结构:
hive> desc formatted psn;
hive> desc formatted psn;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Detailed Table Information
Database: default
Owner: root
CreateTime: Sun May 17 21:54:23 CST 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://mycluster/user/hive/warehouse/psn
Table Type: MANAGED_TABLE
Table Parameters:
transient_lastDdlTime 1589723663
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim -
field.delim ,
mapkey.delim :
serialization.format ,
Time taken: 0.168 seconds, Fetched: 32 row(s)
可以看到Storage Desc Params
中有多出刚刚指定的格式,并且Table Type
显示为内部表。
写入数据:
mkdir data
cd data/
vi data
将上述举例数据复制到data
文件中。
hive> load data local inpath '/root/data/data' into table psn;
hive> select * from psn;
如果查询的时候希望看到字段名称,需要配置如下:
hive> set hive.cli.print.header=true;
创建usr
目录并上传一份数据文件:
hdfs dfs -mkdir /usr
hdfs dfs -put data /usr/
创建外部表:
create external table psn2
(
id int,
name string,
likes array,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/usr/';
查看表内容,发现表中已有数据:
select * from psn2;
查看表详细结构,发现Table Type
为EXTERNAL_TABLE
。
hive> desc formatted psn2;
OK
... ...
Table Type: EXTERNAL_TABLE
... ...
Time taken: 0.117 seconds, Fetched: 38 row(s)
内外表的区别:
注:hive:读时检查(实现解耦,提高数据记载的效率);关系型数据库:写时检查。
hive> create table psn3
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
查看表详细结构:
hive> desc formatted psn3;
OK
# col_name data_type comment
id int
name string
likes array<string>
address map<string,string>
# Partition Information
# col_name data_type comment
age int
... ...
导入分区数据(本地上传是copy,hdfs上传是move):
load data local inpath '/root/data/data' into table psn3 partition(age=10);
hive> create table psn4
(
id int,
name string,
likes array,
address map<string,string>
)
partitioned by(age int,sex string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
导入分区数据:
hive> load data local inpath '/root/data/data' into table psn4 partition(age=20,sex='man');
创建目录(partition的顺序非固定:按字段名匹配。):
hive> alter table psn4 add partition(age=30,sex='boy');
删除目录(可以只删除一个):
hive> alter table psn4 drop partition(sex='min');
修复分区:
hive> msck repair table psn4;
清空表数据:
hive> truncate table psn4;
插入数据二:
hive> create table psn10
(
id int,
name string
)
row format delimited
fields terminated by ',';
hive> create table psn11
(
id int,
likes array
)
row format delimited
fields terminated by ','
collection items terminated by '-';
hive> FROM psn
INSERT OVERWRITE TABLE psn10
SELECT id,name
insert into psn11
select id,likes
在『caroly03』中执行:
hiveserver2
注:这也是一个阻塞式窗口,卡住不动是正常现象。
在『caroly04』中执行:
beeline
beeline> !connect jdbc:hive2://caroly03:10000/default root xxx
beeline> show tables;
后面的root
以及xxxx
可随意填写,但不能为空。
于eclipse
中编写如下代码:
import java.sql.Statement;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
public class HiveJdbcClient {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
Connection conn= DriverManager.getConnection("jdbc:hive2://caroly03:10000/default", "root", "");
Statement stmtStatement= conn.createStatement();
String sqlString= "select * from psn limit 5";
ResultSet resultSet= stmtStatement.executeQuery(sqlString);
while (resultSet.next()) {
System.out.println(resultSet.getString(1)+ "_"+ resultSet.getString("name"));
}
}
}
可通过代码操作数据。
开启支持动态分区:
# 设为true打开动态分区功能
hive> set hive.exec.dynamic.partition=true;
# strice默认有一个分区列是静态分区;nostrice可全部为动态分区
hive> set hive.exec.dynamic.partition.mode=nostrice;
# 每一个执行 mr 节点上,允许创建的动态分区的最大数量(100)
hive> set hive.exec.max.dynamic.partitions.pernode;
# 所有执行 mr 节点上,允许创建的所有动态分区的最大数量(1000)
hive> set hive.exec.max.dynamic.partitions;
# 所有的 mr job 允许创建的文件的最大数量(100000)
hive> set hive.exec.max.created.files;
创建两张表:
hive> create table psn21
(
id int,
name string,
age int,
gender string,
likes array<string>,
address map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
hive> create table psn22
(
id int,
name string,
likes array<string>,
address map<string,string>
)
partitioned by(age int, gender string)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';
数据文件为 data4
:
1,张三1,12,max,lol-book-movie,beijng:wudaokou-shanghai:pudong
2,张三2,11,man,lol-book-movie,beijng:wudaokou-shanghai:pudong
3,张三3,12,man,lol-movie,beijng:wudaokou-shanghai:pudong-shenzhen:nanshan
4,张三4,11,min,lol-book,beijng:wudaokou-shanghai:pudong
5,张三5,11,max,lol-book-movie,shanghai:pudong
6,张三6,12,man,lol-book,beijng:wudaokou-shanghai:pudong
导入到初始表中:
hive> load data local inpath '/root/data/data4' into table psn21;
执行 SQL 语句如下:
hive> from psn21
insert into psn22 partition(age,gender)
select id,name,likes,address,age,gender
通过select
语句查询发现动态分区表现的不明显,访问caroly01:50070
,依次进入:Utilities -> Browse the file system -> user -> hive -> warehouse -> psn22 。可以发现有多个分区。
开启支持分桶:
# 设置为true之后,mr运行时会根据bucket的个数自动分配reduce task个数。
# 注意:一次作业产生的桶(文件数量)和reduce task个数一致。
hive> set hive.enforce.bucketing=true;
创建一张表:
hive> create table psn31( id int, name string, age int)
row format delimited fields terminated by ',';
测试数据如下:
1,tom,11
2,cat,22
3,dog,33
4,hive,44
5,hbase,55
6,mr,66
7,alice,77
8,scala,88
将数据导入表中:
hive> load data local inpath '/root/data/bucket' into table psn31;
分桶表:
hive> create table psn_pucket( id int, name string, age int)
clustered by(age) into 4 buckets
row format delimited fields terminated by ',';
写入数据:
hive> insert into psn_pucket select id,name,age from psn31;
查看桶中数据:
hdfs dfs -cat /user/hive/warehouse/psn_pucket/000000_0
hdfs dfs -cat /user/hive/warehouse/psn_pucket/000001_0
hdfs dfs -cat /user/hive/warehouse/psn_pucket/000002_0
hdfs dfs -cat /user/hive/warehouse/psn_pucket/000003_0
或:
select * from psn_pucket tablesample(bucket 1 out of 4);
select * from psn_pucket tablesample(bucket 2 out of 4);
select * from psn_pucket tablesample(bucket 3 out of 4);
select * from psn_pucket tablesample(bucket 4 out of 4);
本文由 caroly 创作,如果您觉得本文不错,请随意赞赏
采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载 / 出处外,均为本站原创或翻译,转载前请务必署名
原文链接:https://caroly.fun/archives/hive多节点搭建
最后更新:2021-04-29 16:07:41
Update your browser to view this website correctly. Update my browser now