Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 and x86_64 linux: TF java full native builds are failing to find the native headers #544

Open
snadampal opened this issue Jun 13, 2024 · 8 comments

Comments

@snadampal
Copy link
Contributor

snadampal commented Jun 13, 2024

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04 x86_64): Linux Ubuntu 22.04, aarch64
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version: TF version is v2.16.1, TF java version is tag: v1.0.0-rc.1
  • Java version (i.e., the output of java -version): openjdk 11.0.23 2024-04-16
  • Java command line flags (e.g., GC parameters): mvn install -P native-build -Dbazel.build.flags='--verbose_failures -s --config=mkl_aarch64_threadpool' -X
  • Installed from Maven Central?: No
  • Bazel version (if compiling from source): bazel 6.5.0
  • GCC/Compiler version (if compiling from source):gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • CUDA/cuDNN version: Not applicable
  • GPU model and memory: Not applicable

Describe the problem
TensorFlow java source builds are failing on aarch64 linux system with the missing native headers. please let me know how it's built for x86_64 linux platform.

based on my debugging so far it looks like the dependency comes from this commit which added C API extension for custom gradient functions, and introduced these headers and .cc which requires several third_party libraries from tensorflow native but none of those bazel workspaces are cloned.


tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api$ 
tfj_gradients.h  tfj_gradients_impl.cc  tfj_graph.h  tfj_graph_impl.cc  tfj_scope.h  tfj_scope_impl.cc

I tried to manually clone the missing workspaces into bazel cache, but the cycle is never ending, it's missing tsl, eigen, ml_dtype, absl, protobuf, and now compiled headers for protobuf....

Provide the exact sequence of commands / steps that you executed before running into the problem

sudo apt-get install pkg-config ccache clang ant python3-pip swig git file wget unzip tar bzip2 gzip patch autoconf-archive autogen automake make cmake libtool bison flex perl nasm curl gfortran libasound2-dev freeglut3-dev libgtk2.0-dev libusb-dev zlib1g libffi-dev libbz2-dev zlib1g-dev

sudo apt install maven default-jdk

cd $HOME
mkdir bazel
cd bazel
wget https://github.com/bazelbuild/bazel/releases/download/6.5.0/bazel-6.5.0-linux-arm64
mv bazel-6.5.0-linux-arm64 bazel
chmod a+x bazel
export PATH=/home/ubuntu/bazel/:$PATH

# Build and install javacpp-presets.
# Clone the following forked repo to exclude the libraries that are not supported and not required
git clone https://github.com/snadampal/javacpp-presets.git
cd javacpp-presets
git checkout tfjava_aarch64
mvn install -Djavacpp.platform=linux-arm64 -Dmaven.javadoc.skip=true -X -T 16

# Build and install tensorflow java bindings
git clone https://github.com/tensorflow/java.git
cd java
git checkout v1.0.0-rc.1
mvn install -P native-build -Dbazel.build.flags='--verbose_failures -s --config=mkl_aarch64_threadpool' -X

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@snadampal snadampal changed the title aarch64 linux: TF java source builds are failing to find the native headers aarch64 and x86_64 linux: TF java source builds are failing to find the native headers Jun 14, 2024
@snadampal snadampal changed the title aarch64 and x86_64 linux: TF java source builds are failing to find the native headers arm64 and x86_64 linux: TF java source builds are failing to find the native headers Jun 14, 2024
@snadampal
Copy link
Contributor Author

The issue is not specific to arm64, I see the same missing headers issue even on the other platforms, at least I have reproduced it on linx-x86_64 as well, with Ubuntu 22.04 OS. From the code it looks like it happens on every platform.
I have root-caused the issue to the fact that the dist_download step is skipped for the native build, but the dist_download is the one setting up all the required native headers for the javacpp build. the non native build is working fine because dist_download step executes there.


            <!--
              Download TensorFlow native libraries
                This will download the official Python distribution for the active platform, and extract the `tensorflow_cc` library
                from it so that we can generate the JavaCPP API bindings and distribute it as a JAR. This will be executed only
                when not building a full native build.
            -->
            <id>dist-download</id>
            <phase>initialize</phase>
            <goals>
              <goal>exec</goal>
            </goals>
            <configuration>
              <skip>${dist.download.skip}</skip> <!-- skipped when full native build is enabled -->
              <executable>bash</executable>
              <arguments>
                <argument>scripts/dist_download.sh</argument>
                <argument>${dist.download.folder}</argument>
              </arguments>
              <environmentVariables>
                <PLATFORM>${native.classifier}</PLATFORM>
              </environmentVariables>
              <workingDirectory>${project.basedir}</workingDirectory>
            </configuration>
          </execution>
        </executions>
      </plugin>

The backtrace:

[INFO] g++ -I/home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow -I/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/com_google_protobuf/src -I/usr/lib/jvm/java-11-openjdk-amd64/include -I/usr/lib/jvm/java-11-openjdk-amd64/include/linux /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnitensorflow.cpp /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnijavacpp.cpp -march=x86-64 -m64 -O3 -s -std=c++17 -Wl,-rpath,$ORIGIN/ -Wl,-z,noexecstack -Wl,-Bsymbolic -Wall -fPIC -pthread -shared -o libjnitensorflow.so -L/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow -Wl,-rpath,/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/execroot/tensorflow_java/bazel-out/k8-opt/bin/external/org_tensorflow/tensorflow -ltensorflow_framework -ltensorflow_cc 
In file included from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl/tsl/c/tsl_status_internal.h:19,
                 from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/tensorflow/c/tf_status_internal.h:19,
                 from /home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/tensorflow/c/c_api_internal.h:32,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api/tfj_graph_impl.cc:18,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/src/main/native/org/tensorflow/internal/c_api/tfj_graph.h:31,
                 from /home/ubuntu/java/tensorflow-core/tensorflow-core-native/target/native/org/tensorflow/internal/c_api/linux-x86_64/jnitensorflow.cpp:115:
/home/ubuntu/.cache/bazel/_bazel_ubuntu/255b14aaecc232d3c121b5bd17b6e1a3/external/org_tensorflow/third_party/xla/third_party/tsl/tsl/platform/status.h:28:10: fatal error: absl/base/attributes.h: No such file or directory
   28 | #include "absl/base/attributes.h"

@snadampal snadampal changed the title arm64 and x86_64 linux: TF java source builds are failing to find the native headers arm64 and x86_64 linux: TF java full native builds are failing to find the native headers Jun 14, 2024
@Craigacp
Copy link
Collaborator

We modified where it's looking for the headers just before the rc1 release to fix this kind of issue. I tested it on macOS, and I thought I had tested it on a few Linuxes as well. I'll rerun the Linux build to see what's going on.

@Craigacp
Copy link
Collaborator

So it looks like the problem is that we used to get the absl headers from Bazel, but something has changed in the TF build process so it's not putting the absl repo in the bazel-tensorflow-core-native folder like it used to. We'd missed this because the clean is inconsistent between bazel & non-bazel builds.

@snadampal
Copy link
Contributor Author

Hi @Craigacp , it's not just the absl, there are several other packages are missing too, like Eigen, ml_dtypes, protobuf......
they exist in the repo but the workspaces are not cloned.

@Craigacp
Copy link
Collaborator

I can replicate this, but we couldn't replicate it on Karl's machine, even after a clean of bazel. Both machines are running macOS 14.5 with the latest XCode, and the same version of bazel so I'm pretty confused as to what's causing the issue.

@snadampal
Copy link
Contributor Author

I'm surprised in the working case where it is getting the all absl/Eigen/ml_dtype headers from. Probably checking the include paths for libjnitensorflow.cpp compilation might give some clue?
btw, it's consistently failing on linux.

@Craigacp
Copy link
Collaborator

Craigacp commented Jun 14, 2024

No, in some cases the external folder in bazel-tensorflow-core-native has extra folders in it linking to the dependencies we need the headers for, which we add to the include path in the pom. Not sure why bazel only puts them in some of the time. Not ruled out some memory on the machine that works yet.

@crutcher
Copy link

I see the same problem on Ubuntu 24.04 with rc2

mvn -T 8 clean install -TDjavacpp.platform.extension=-gpu -Dnative.build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants