Get the Needed Prerequistes Software
git (for building Spark)
http://git-scm.com/download/win
Install and verify git was added to your path
Python 3.5.1
https://www.python.org/ftp/python/3.5.1/python-3.5.1-amd64.exe
Or even better, try the Anaconda version
Define Java
Get Java 8 (Oracle JDK):
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
Prepare SBT
Get SBT 0.13.11 (for building spark compilation)
https://dl.bintray.com/sbt/native-packages/sbt/0.13.11.2/sbt-0.13.11.2.msi
"%_JAVACMD%" %_JAVA_OPTS% %SBT_OPTS% -Xmx2048m -Xms2048m -cp "%SBT_HOME%sbt-launch.jar" xsbt.boot.Boot %*
Or better, change these settings at C:\Program Files (x86)\sbt\conf\sbtconfig.txt
Buid Spark
Download Spark 1.6 (No need to intall now, see details below)
http://www.apache.org/dyn/closer.lua/spark/spark-1.6.0/spark-1.6.0.tgz
- Extract the source using 7zip (or anything else that can use tgz files) for example to C:\Spark
- Build spark:
- cd C:\Spark
- "c:\Program Files (x86)\sbt\bin\sbt.bat" package
- "c:\Program Files (x86)\sbt\bin\sbt.bat" assembly
Create Hadoop folder (for example c:\hadoop)
Define HADOOP_HOME environment variable (c:\hadoop)
Download to c:\hadoop\bin the winutils.exe file
Run It Like a Pro
Create a test file c:\spark\verify.txt and enter to it the following text: this is a trial
Run pyspark:
> c:\spark\bin\pyspark
In pyspark enter the following code:
text_file = sc.textFile("file://c:/spark/verify.txt")
text_file.collect()
The file should be printed
Keep Performing,
Moshe Kaplan