一、虚拟机安装CentOS7并配置共享文件夹
二、CentOS 7 上hadoop伪分布式搭建全流程完整教程
三、本机使用python操作hdfs搭建及常见问题
四、mapreduce搭建
五、mapper-reducer编程搭建
六、hive数据仓库安装
cd /home/huangqifa/software/
touch mapper.py
编辑内容
sudo gedit mapper.py
粘贴如下内容:
#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1)
# input comes from standard input
# remove leading and trailing whitespace
# split the line into words
# write the results to STDOUT
touch reducer.py
sudo gedit reducer.py
粘贴如下
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
Continue
if current_word == word:
current_count += count
else:
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word == word:
print '%s\t%s' % (current_word, current_count)
赋权
sudo chmod +x mapper.py
sudo chmod +x reducer.py
touch test00.txt
粘贴如下
foo foo quux labs foo bar quux
测试mapper.py
echo "foo foo quux labs foo bar quux" | ./mapper.py
测试reducer.py
echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py
#其中sort -k 1起到了将mapper的输出按key排序的作用:-k, -key = POS1[,POS2] .
hdfs dfs -mkdir -p /user/input
上传test00.txt到hdfs中的 /user/input目录
hdfs dfs -put /home/huangqifa/software/test00.txt /user/input
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.7.jar -files /home/huangqifa/software/mapper.py,/home/huangqifa/software/reducer.py -mapper "mapper.py" -reducer "reducer.py" -input /user/input/test00.txt -output /user/output
注意修改为自己的mapper.py、reducer.py路径
若已存在/user/output执行时会报错
hdfs dfs -rm -r /user/output
查看输出文件
hdfs dfs -cat /user/output/*
hadoop fs -ls /user/output/
hadoop fs -get /user/output/part-00000
或者通过浏览器网页下载
参考
https://blog.csdn.net/andy_wcl/article/details/104610931
https://blog.csdn.net/qq_39315740/article/details/98108912