Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.
|A Spark application could be spark-shell or your own custom Spark application.|
What makes the huge difference between the operating systems is Hadoop that is used internally for file system access in Spark.
You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.
|You do not have to install Apache Hadoop to work with Spark or run Spark applications.|
|Read the Apache Hadoop project’s Problems running Hadoop on Windows.|
Among the issues is the infamous
java.io.IOException when running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).
16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387) at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327) at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365) at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.util.Utils$.classForName(Utils.scala:228) at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963) at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)
You need to have Administrator rights on your laptop. All the following commands must be executed in a command-line window (
Read the official document in Microsoft TechNet — Start a Command Prompt as an Administrator.
winutils.exe binary from https://github.com/steveloughran/winutils repository.
You should select the version of Hadoop the Spark distribution was compiled with, e.g. use
winutils.exe binary to a directory of your choice, e.g.
HADOOP_HOME to reflect the directory with
PATH environment variable to include
%HADOOP_HOME%\bin as follows:
You can change
Execute the following command in
cmd that you started using the option Run as administrator.
winutils.exe chmod -R 777 C:\tmp\hive
Check the permissions (that is one of the commands that are executed under the covers):
winutils.exe ls -F C:\tmp\hive
spark-shell and observe the output (perhaps with few WARN messages that you can simply disregard).
As a verification step, execute the following line to display the content of a
scala> spark.range(1).withColumn("status", lit("All seems fine. Congratulations!")).show(false) +---+--------------------------------+ |id |status | +---+--------------------------------+ |0 |All seems fine. Congratulations!| +---+--------------------------------+
Disregard WARN messages when you start
If you see the above output, you’re done. You should now be able to run Spark applications on your Windows. Congrats!
hive-site.xml file with the following content:
<configuration> <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property> </configuration>
Start a Spark application, e.g.
HADOOP_CONF_DIR environment variable set to the directory with