기록/그 외 프로젝트 기록

[Hadoop] Python으로 wordcount하기(Hadoop Streaming)

5월._. 2022. 9. 22.

728x90

코드는 딱히 특별하지 않아서 설명하지 않는다. (내가 이 글로 말하고 싶은 부분은 3,4번에 있다.)

1. mapper

#!/usr/bin/env phthon3
# -*-coding:utf-8 -*

import sys

for line in sys.stdin:
    words = line.strip().split()
    for word in words:
        print('{}\t{}'.format(word, 1))

2. reducer

#!/usr/bin/env python3
# -*-coding:utf-8 -*

import sys

def print_output(word, count):
        print('{}\t{}'.format(word, count))


word, count = None, 0

for line in sys.stdin:
    fields = line.strip().split('\t')

    if fields[0] != word:
        if word is not None:
            print_output(word, count)

        word, count = fields[0], 0

    count += 1

print_output(word, count)

3. 실행문

hadoop-streaming.jar을 실행하면서 mapper, reducer로 python 파일을 주는 방식이다. (이 방식인지 잘 몰라서 엄청나게 헤맸다. 다른 사람들은 그러지 않기를 바란다..)

따라서 file 명령어로 꼭 mapper와 reducer 파일 위치를 명시해줘야한다.

input, output은 hdfs 경로로 접근한다.

hdfs dfs -rm -r output | \
hadoop jar /usr/lib/hadoop/hadoop-streaming.jar \
-mapper "python3 mapper.py" \
-file mapper.py \
-reducer "python3 reducer.py" \
-file reducer.py \
-input data \
-output output

4. 장단점

파이썬 코드를 그대로 mapreduce에 활용할 수 있다는 점은 좋지만, hadoop-streaming.jar가 실행되면서 내부적으로 파이썬코드를 이용하는 것이기 때문에 로그를 제대로 활용할 수 없다. 몇 번째 줄에서 오류가 나는지, 어떤 오류인지 파악할 수 없다. 로그확인하면 이 메세지만 나올것이다..

'기록 > 그 외 프로젝트 기록' 카테고리의 다른 글

[Hadoop] Map Reduce를 위한 maven project 만들기 (0)	2022.10.11
[NLP] Open Korean Text 자바로 구현 (0)	2022.10.10
[Hadoop] Map Reduce - timed out after 600secs (0)	2022.09.28