Summary

챕터2에서는 OS별 스파크 설치와 언어별 실행 스크립트를 설명함.
AWS의 EC2와 EMR 인스턴스에 띄워보는등 on-premise와 cloud모두 설명

Keywords & Terms

EMR(Elastic Map-Reduce)
YARN(Yet Another Resource Negotiator)
RDD(Resilient Destributed Dataset)

스파크 배포 모드 개요

일반적으로는

Local
Standalone
YARN(hadoop)
mesos(apache)

로 배포하고 리소스를 관리하는 방식만 다름 YARN, mesos같이 외부 스케줄러쓰면
로컬모드로 실행 or Spark Standalone 스케줄러 쓰면되는데,
이때 스파크 외부 종속성(?)이 제거된다고함

기본적으론 \

여기가 설명은 더 잘나와있음

언어별 실행 스크립트

# bin 폴더 실행 파일 
-rwxr-xr-x 1 root root 1099 Nov 16  2016 beeline
-rw-r--r-- 1 root root 2143 Nov 16  2016 load-spark-env.sh
-rwxr-xr-x 1 root root 3265 Nov 16  2016 pyspark       // 파이썬 
-rwxr-xr-x 1 root root 1040 Nov 16  2016 run-example
-rwxr-xr-x 1 root root 3126 Nov 16  2016 spark-class
-rwxr-xr-x 1 root root 1049 Nov 16  2016 sparkR        // R
-rwxr-xr-x 1 root root 3026 Nov 16  2016 spark-shell   // scala repl
-rwxr-xr-x 1 root root 1075 Nov 16  2016 spark-sql     // spark on sql
-rwxr-xr-x 1 root root 1050 Nov 16  2016 spark-submit  // scala jar

spark-submit 문법

# spark-submit 실행 옵션 
$ ./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <app jar | python file | R file> \
  [application-arguments]

# 클러스터 매니저가 yarn인 경우 실행 방법 
# JAR 파일 실행 
$ spark-submit --master yarn \
  --queue spark_queue \
  --class sdk.spark.SparkWordCount \
  --conf spark.shuffle.service.enabled=true \
  ./Spark-Example.jar

# 파이썬 파일 실행 
$ spark-submit --master yarn \
  --queue spark_queue \
  ./py01.py

설정	비고
—master	클러스터 매니저 설정
—deploy-mode	드라이버의 디플로이 모드 설정
—class	main 함수가 들어 있는 클래스 지정
—name	애플리케이션의 이름 지정. 스파크 웹 UI에 표시
—jars	애플리케이션 실행에 필요한 라이브러리 목록. 콤마로 구분
—files	애플리케이션 실행에 필요한 파일 목록
—queue	얀의 실행 큐이름 설정
—executor-memory	익스큐터가 사용할 메모리 바이트 용량. 512m. 1g 등도 사용 가능
—driver-memory	드라이버 프로세스가 사용할 메모리 바이트 용량. 512m. 1g 등도 사용 가능
—num-executors	익스큐터의 개수 설정
—executor-cores	익스큐터의 코어 개수 설정

실제예제

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster in cluster deploy mode
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit \
  --master spark://207.184.161.138:7077 \
  examples/src/main/python/pi.py \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

# Run on a Kubernetes cluster in cluster deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://xx.yy.zz.ww:443 \
  --deploy-mode cluster \
  --executor-memory 20G \
  --num-executors 50 \
  http://path/to/examples.jar \
  1000

로컬모드 & 독립실행형으로는

$SPARK_HOME/bin/spark-submit \
--class org.apche.spark.examples.SparkPi \
--master local \ # 로컬
or
--master spark://mysparkmaster:7077 \ # 독립  
$SPARK_HOME/examples/jars/spark-examples*.jar 10

YARN

$SPARK_HOME/bin/spark-submit \
--class org.apche.spark.examples.SparkPi \
--master yarn
--deploy-mode cluster \ # client 배포모드도있음 
$SPARK_HOME/examples/jars/spark-examples*.jar 10

mesos

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

스파크 설치 방법

리눅스 윈도우 맥OSX 다 가능 8GB+, 8 core 이상 좋음
JVM에서 실행되도록 compiled scala로 작성되어있어서 JDK설치 필수
https://spark.apache.org/downloads.html 에 설치방법 있음
MAC | Linux에 Spark 설치

자바설치

1 2	$ sudo apt-get install openjdk-8-jdk-headless // Linux $ brew cask install java

스파크 설치 & tar

$ wget https://dlcdn.apache.org/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz
$ tar zxvf spark-3.4.2-bin-hadoop3.tgz
$ rm -rf spark-3.4.2-bin-hadoop3.tgz
$ sudo mv spark-3.4.2-bin-hadoop3 /opt/spark

환경변수설정

$ vi ~/.zshrc -> source ~/.zshrc
or
$ export SPARK_HOME=/opt/spark 
$ export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

확인
1
2
3
4
5
$ spark-shell
or
$ spark-submit --class org.apche.spark.examples.SparkPi \
--master local \ # 로컬
$SPARK_HOME/examples/jars/spark-examples*.jar 1000
스파크 at 클라우드
EC2, EMR(Elastic Mapreduce)과 대표적인데 EMR이 하이브, 피그, 프레스토, 제플린 등
에코시스템 가진 하둡 클러스터라 프로비저닝도 되어서 이거 쓰는거 추천

[PySpark_#2] 파이썬을 활용한 스파크 프로그래밍, (2장/8장)

스파크 배포

Summary

Keywords & Terms

스파크 배포 모드 개요

언어별 실행 스크립트

spark-submit 문법

실제예제

스파크 설치 방법

MAC | Linux에 Spark 설치

스파크 at 클라우드

FEATURED TAGS

FRIENDS