大数据技术 hadoop运行wordcount实例-职坐标

大数据技术 hadoop运行wordcount实例

沉沙 2018-09-25 来源：阅读 2397 评论 0

摘要：本篇教程探讨了大数据技术 hadoop运行wordcount实例，希望阅读本篇文章以后大家有所收获，帮助大家对大数据技术的理解更加深入。

本篇教程探讨了大数据技术 hadoop运行wordcount实例，希望阅读本篇文章以后大家有所收获，帮助大家对大数据技术的理解更加深入。

1.查看hadoop版本

[hadoop@ltt1 sbin]$ hadoop version
Hadoop 2.6.0-cdh5.12.0
Subversion //github.com/cloudera/hadoop -r dba647c5a8bc5e09b572d76a8d29481c78d1a0dd
Compiled by jenkins on 2017-06-29T11:33Z
Compiled with protoc 2.5.0
From source with checksum 7c45ae7a4592ce5af86bc4598c5b4
This command was run using /home/hadoop/hadoop260/share/hadoop/common/hadoop-common-2.6.0-cdh5.12.0.jar

2.通过hadoop自带的jar文件，可以简单测试一些功能。

查看hadoop-mapreduce-examples-2.6.0-cdh5.12.0.jar文件所支持的MapReduce功能列表

[hadoop@ltt1 sbin]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.12.0.jar
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

3.在hdfs上创建文件夹

hadoop fs -mkdir /input

4.查看hdfs的更目录列表

[hadoop@ltt1 ~]$ hadoop fs -ls /Found 2 itemsdrwxr-xr-x   - hadoop supergroup          0 2017-09-17 08:11 /inputdrwx------   - hadoop supergroup          0 2017-09-17 08:07 /tmp

5.上传本地文件到hdfs

hadoop fs -put $HADOOP_HOME/*.txt /input

6.查看hdfs上input目录下文件

[hadoop@ltt1 ~]$ hadoop fs -ls /input
Found 3 items
-rw-r--r--   2 hadoop supergroup      85063 2017-09-17 08:15 /input/LICENSE.txt
-rw-r--r--   2 hadoop supergroup      14978 2017-09-17 08:15 /input/NOTICE.txt
-rw-r--r--   2 hadoop supergroup       1366 2017-09-17 08:15 /input/README.txt

7.wordcount简单测试。

[hadoop@ltt1 ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.12.0.jar wordcount /input /output
17/09/17 08:19:12 INFO input.FileInputFormat: Total input paths to process : 3
17/09/17 08:19:13 INFO mapreduce.JobSubmitter: number of splits:3
17/09/17 08:19:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1505605169997_0002
17/09/17 08:19:14 INFO impl.YarnClientImpl: Submitted application application_1505605169997_0002
17/09/17 08:19:14 INFO mapreduce.Job: The url to track the job: //ltt1.bg.cn:9180/proxy/application_1505605169997_0002/
17/09/17 08:19:14 INFO mapreduce.Job: Running job: job_1505605169997_0002
17/09/17 08:19:27 INFO mapreduce.Job: Job job_1505605169997_0002 running in uber mode : false
17/09/17 08:19:27 INFO mapreduce.Job:  map 0% reduce 0%
17/09/17 08:19:39 INFO mapreduce.Job:  map 33% reduce 0%
17/09/17 08:19:48 INFO mapreduce.Job:  map 100% reduce 0%
17/09/17 08:19:50 INFO mapreduce.Job:  map 100% reduce 100%
17/09/17 08:19:50 INFO mapreduce.Job: Job job_1505605169997_0002 completed successfully
17/09/17 08:19:50 INFO mapreduce.Job: Counters: 50>>  //www.cnblogs.com/tijun/  <<
    File System Counters
        FILE: Number of bytes read=42705
        FILE: Number of bytes written=588235
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=101699
        HDFS: Number of bytes written=30167
        HDFS: Number of read operations=12
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=2
    Job Counters
        Launched map tasks=3
        Launched reduce tasks=1
        Data-local map tasks=2
        Rack-local map tasks=1
        Total time spent by all maps in occupied slots (ms)=47617
        Total time spent by all reduces in occupied slots (ms)=8244
        Total time spent by all map tasks (ms)=47617
        Total time spent by all reduce tasks (ms)=8244
        Total vcore-milliseconds taken by all map tasks=47617
        Total vcore-milliseconds taken by all reduce tasks=8244
        Total megabyte-milliseconds taken by all map tasks=48759808
        Total megabyte-milliseconds taken by all reduce tasks=8441856
    Map-Reduce Framework
        Map input records=2035
        Map output records=14239
        Map output bytes=155828
        Map output materialized bytes=42717
        Input split bytes=292
        Combine input records=14239
        Combine output records=2653
        Reduce input groups=2402
        Reduce shuffle bytes=42717
        Reduce input records=2653
        Reduce output records=2402
        Spilled Records=5306
        Shuffled Maps =3
        Failed Shuffles=0
        Merged Map outputs=3
        GC time elapsed (ms)=881
        CPU time spent (ms)=22320
        Physical memory (bytes) snapshot=690192384
        Virtual memory (bytes) snapshot=10862809088
        Total committed heap usage (bytes)=380243968
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters
        Bytes Read=101407
    File Output Format Counters
        Bytes Written=30167

8.查看wordcount运行结果（由于结果太长，只举出了部分结果）

[hadoop@ltt1 ~]$ hadoop fs -cat /output/*
worldwide,    4
would    1
writing    2
writing,    4
written    19
xmlenc    1
year    1
you    12
your    5
zlib    1
252.227-7014(a)(1))    1
§    1
“AS    1
“Contributor    1
“Contributor”    1
“Covered    1
“Executable”    1
“Initial    1
“Larger    1
“Licensable”    1
“License”    1
“Modifications”    1
“Original    1
“Participant”)    1
“Patent    1
“Source    1
“Your”)    1
“You”    2
“commercial    3
“control”    1

>>  //www.cnblogs.com/tijun/  <<
至此，通过一个wordcount的一个小栗子，简介实践了一下hdfs的创建文件夹，上传文件，查看目录，运行wordcount实例。