Hadoop Native Libraries for Apache Spark

2023-06-06 —Categories: Big Data

Hadoop Native Libraries for Apache Spark

The post on Diagnosing Hadoop Native Library Load Failures was focused on Hadoop being run as a standalone application. However, it can also be one component among many in an application with broader scope, such as Apache Spark. Having not used spark before, I found the Quick Start – Spark 3.4.0 Documentation (apache.org) informative. It suggested downloading the packaged release of Spark from the Spark website but I went with this CDN https://dlcdn.apache.org/spark/spark-3.4.0/ since it was the same one I had downloaded my Hadoop build from.

Setting Up and Launching Spark

Download and extract Spark using these commands:

cd ~/java/binaries
mkdir spark
cd spark
curl -Lo spark-3.4.0-bin-hadoop3.tgz https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz

tar xzf spark-3.4.0-bin-hadoop3.tgz
cd spark-3.4.0-bin-hadoop3

Spark needs JAVA_HOME to be set (otherwise the first message displayed will be ERROR: JAVA_HOME is not set and could not be found).

export JAVA_HOME=~/java/binaries/jdk/x64/jdk-11.0.19+7

Next, I started the Spark shell by running this command as per the Quick Start docs:

./bin/spark-shell

Notice that the same Hadoop warning from Diagnosing Hadoop Native Library Load Failures showed up again! However, we have already seen that the Hadoop logging level can be customized. The key question now is how to enable DEBUG logging in spark

saint@ubuntuvm:~/java/binaries/spark/spark-3.4.0-bin-hadoop3$ ./bin/spark-shell
23/06/01 10:31:38 WARN Utils: Your hostname, ubuntuvm resolves to a loopback address: 127.0.1.1; using 172.18.28.45 instead (on interface eth0)
23/06/01 10:31:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/06/01 10:31:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://ubuntuvm.mshome.net:4040
Spark context available as 'sc' (master = local[*], app id = local-1685637105440).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.0
      /_/
         
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 17.0.6)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Command Line Customization of Spark Logging Level

The Overview – Spark 3.4.0 Documentation (apache.org) page states that …

Users can also download a “Hadoop free” binary and run Spark with any Hadoop version by augmenting Spark’s classpath.
Spark 3.4.0 Documentation (apache.org)

The classpath augmentation doc (Using Spark’s “Hadoop Free” Build) is what informs me that the way Spark uses Hadoop can be customized by entries in conf/spark-env.sh. Unfortunately, there are no log level settings in the spark-env.sh.template file in that directory. After a bit of a winding journey, I discover that the way to customize the logging level is to first create a conf/log4j2.properties file by running:

cp conf/log4j2.properties.template conf/log4j2.properties

Next, change the logging level by updating this line:

logger.repl.level = warn

Launching the Spark shell now displays a much more informative error message. It is now evident that the paths being searched for native libraries do not include the path we need.

23/06/01 11:16:31 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
23/06/01 11:16:31 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path: [/usr/java/packages/lib, /usr/lib64, /lib64, /lib, /usr/lib]
23/06/01 11:16:31 DEBUG NativeCodeLoader: java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
23/06/01 11:16:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Fixing the Spark Hadoop Native Libraries

I searched for how to pass spark extra java options. The Tuning – Spark 3.4.0 Documentation mentioned the spark.executor.defaultJavaOptions and spark.executor.extraJavaOptions arguments, which I found documented at Configuration – Spark 3.4.0 Documentation. These are the flags I (unsuccessfully) tried passing to the Spark shell to load the Hadoop native library:

--conf "spark.executor.extraJavaOptions=-Djava.library.path=..."
--conf "spark.executor.spark.driver.extraLibraryPath=..."
--conf "spark.executor.spark.executor.extraLibraryPath=..."

The required flag is the –driver-library-path. Sounds like the extraLibraryPath options didnt’ work because the JVM has already started by the time those are being processed.

./bin/spark-shell --driver-library-path=/home/saint/java/binaries/hadoop/x64/hadoop-3.3.5/lib/native

The –driver-library-path flag allows Spark to successfully load the Hadoop native libraries. The logging messages confirm this:

...
3/06/01 11:57:06 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
23/06/01 11:57:06 DEBUG NativeCodeLoader: Loaded the native-hadoop library
...

Appendix: Resources Reviewed for Spark Logging Level Changes

It was the SPARK-7261 pull request that led me to look for the log4j2.properties file. Changing rootLogger.level did not have any effect but scrolling through revealed the key line setting logger.repl.level.

Saint's Log

Hadoop Native Libraries for Apache Spark

Setting Up and Launching Spark

Command Line Customization of Spark Logging Level

Fixing the Spark Hadoop Native Libraries

Appendix: Resources Reviewed for Spark Logging Level Changes

Article info

Categories

Tags

Leave a Reply