Your resource for web content, online publishing
and the distribution of digital products.
S M T W T F S
 
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
 
9
 
 
 
12
 
13
 
14
 
15
 
16
 
17
 
18
 
19
 
20
 
21
 
22
 
23
 
24
 
25
 
26
 
27
 
28
 
29
 
30
 
31
 
 
 

Building a Custom Docker Image for K8s Spark Operator to Fix Vulnerabilities

Tags: api
DATE POSTED:October 11, 2024

There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image.

\ To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself.

Spark image

Building a Spark image without Hadoop using a specific version of Spark

RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \ && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \ && mv spark-3.5.1-bin-without-hadoop /opt/spark \ && rm spark-3.5.1-bin-without-hadoop.tgz Spark-operator image

We build the Spark Operator image, we will need several Hadoop libraries to run submit commands.

\ For example, the FIPS version build is given, the differences in the build and run commands.

\ For building on Go, the parameter GOEXPERIMENT=boringcrypto is used

\ For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password

\ You can build an image without FIPS changes.

\ To run spark-submit, we will add Hadoop libraries during the build process:

  • hadoop-client-runtime
  • hadoop-client-api
  • slf4j-api

\ entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh

\ Example Dockerfile for building Spark Operator

ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop ARG GOLANG_IMAGE=golang-1.21 ARG SPARK_OPERATOR_VERSION=1.3.1 ARG HADOOP_VERSION_DEFAULT=3.4.0 ARG HADOOP_TMP_HOME="/opt/hadoop" ARG TARGETARCH=amd64 # Prepare spark-operator build FROM ${GOLANG_IMAGE} as builder WORKDIR /app/spark-operator ARG SPARK_OPERATOR_VERSION RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator RUN GOTOOLCHAIN=go1.22.3 go mod download # Build ARG TARGETARCH RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go #Install Hadoop jars ARG HADOOP_VERSION_DEFAULT ARG HADOOP_TMP_HOME RUN mkdir -p ${HADOOP_TMP_HOME} RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME} # Prepare spark-operator image FROM ${ECR_URL}:${SPARK_IMAGE} WORKDIR /opt/spark-operator USER root ENV PATH $JAVA_HOME/bin:$PATH ENV SPARK_HOME="/opt/spark" ENV JAVA_HOME="/opt/jdk-11.0.21" ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password" ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin: RUN yum update -y && \ yum install --setopt=tsflags=nodocs -y openssl && \ yum clean all ARG HADOOP_TMP_HOME COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/ COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/ COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/ COPY entrypoint.sh /opt/spark-operator/ RUN chmod a+x /opt/spark-operator/entrypoint.sh ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] Conclusion

After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime:

  • org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410
  • org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308

\ Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries.

Tags: api