Apache Arrow 安装手册

基于Apache Arrow 9.0文档翻译。

C++

构建编译C++ Arrow

系统准备

Arrow使用CMake作为配置工具,官方推荐使用out-of-source编译方式。

  • In-source build: 使用cmake将arrow与项目源码集成,当你要构建多个版本时(例如一个用于调试,一个用于发布)就不够灵活。
  • Out-of-source build: 使用cmake从另一个目录调用,构建一个单独的编译环境,与其他环境解耦。创建cpp/build-debug,从目录调用cmake $CMAKE_ARGS …

编译要求

  • C++11编译器。Linux要求GCC 4.8及以上;Windows要求VS2017及以上。
  • CMake 3.5及以上
  • Linux/macOS,具备make或ninja编译工具
  • 编译要求最少1GB内存,debug至少4GB内存,使用docker则至少8GB内存。

在Ubuntu/Debian,可以通过以下命令安装

sudo apt-get install \
     build-essential \
     cmake

编译

从仓库中clone最新的代码,并进入cpp子目录。

要求路径不得存在中文,否则可能会出现编译异常 需要安装libssl-dev,否则可能会找不到库

git clone https://github.com/apache/arrow.git
cd arrow/cpp
CMake 3.21以上

CMake 3.21.0及以上版本提供了一些用于各种构建预置项,可以使用cmake --list-presets获取可用的预置项列表。

cmake --list-presets   # from inside the `cpp` subdirectory

你可以使用cmake -N --preset <preset name>来检查一个给定的预置所启用的特定选项。

cmake --preset -N ninja-debug-minimal
Preset CMake variables:

  ARROW_BUILD_INTEGRATION="OFF"
  ARROW_BUILD_STATIC="OFF"
  ARROW_BUILD_TESTS="OFF"
  ARROW_EXTRA_ERROR_CONTEXT="ON"
  ARROW_WITH_RE2="OFF"
  ARROW_WITH_UTF8PROC="OFF"
  CMAKE_BUILD_TYPE="Debug"

可以使用支持的预置来构建

mkdir build   # from inside the `cpp` subdirectory
cd build
cmake .. --preset ninja-debug-minimal
   Preset CMake variables:

     ARROW_BUILD_INTEGRATION="OFF"
     ARROW_BUILD_STATIC="OFF"
     ARROW_BUILD_TESTS="OFF"
     ARROW_EXTRA_ERROR_CONTEXT="ON"
     ARROW_WITH_RE2="OFF"
     ARROW_WITH_UTF8PROC="OFF"
     CMAKE_BUILD_TYPE="Debug"

   -- Building using CMake version: 3.21.3
   [ etc. ]

然后编译

cmake --build .

tree debug/
cmake --build .
[142/142] Creating library symlink debug/libarrow.so.700 debug/libarrow.so

tree debug/
debug/
├── libarrow.so -> libarrow.so.700
├── libarrow.so.700 -> libarrow.so.700.0.0
└── libarrow.so.700.0.0

0 directories, 3 files
CMake 3.21以下

因为没有预设的帮助,因此需要手动写要求了。 CMake默认添加CMAKE_BUILD_TYPE=release

最小编译安装

mkdir build-release
cd build-release
cmake ..
make -j8       # if you have 8 CPU cores, otherwise adjust

最小debug安装(包含单元测试)

git submodule update --init --recursive
export ARROW_TEST_DATA=$PWD/../testing/data
mkdir build-debug
cd build-debug
cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON ..
make -j8       # if you have 8 CPU cores, otherwise adjust
make unittest  # to run the tests

单元测试用例默认不会被编译,所以需要在编译后手动编译。单元测试依赖CMake的ctest工具。

在一些linux环境下,可能因为语言或编码问题导致了一些问题(locale-related errors),可以尝试添加环境变量来解决。

export LC_ALL="en_US.UTF-8"
可选配置项

默认最小化安装,所以如果需要让Arrow支持更多功能,需要手动指定配置项

  • -DARROW_BUILD_UTILITIES=ON: 构建Arrow命令行工具
  • -DARROW_COMPUTE=ON: 计算函数内核和其他支持
  • -DARROW_CSV=ON: CSV模块
  • -DARROW_CUDA=ON: CUDA支持,需要环境变量中有 $CUDA_HOME
  • -DARROW_DATASET=ON: 数据集API,依赖(还是包含?)文件系统API
  • -DARROW_FILESYSTEM=ON: 文件系统(包含本地和远程文件系统)API
  • -DARROW_FLIGHT=ON: Arrow Flight RPC 支持, 基于gRPC
  • -DARROW_GANDIVA=ON: Gandiva表达式编译器,基于 LLVM、 Protocol Buffers 和 re2
  • -DARROW_GANDIVA_JAVA=ON: Gandiva Java的JNI绑定
  • -DARROW_GCS=ON: 支持GCS
  • -DARROW_HDFS=ON: 支持hadoop的文件系统hdfs
  • -DARROW_JEMALLOC=ON: 编译Arrow jemalloc-based allocator, 默认开启
  • -DARROW_JSON=ON: JSON 解析模块
  • -DARROW_MIMALLOC=ON: 编译 Arrow mimalloc-based allocator
  • -DARROW_ORC=ON: Arrow integration with Apache ORC
  • -DARROW_PARQUET=ON: Apache Parquet libraries and Arrow integration
  • -DPARQUET_REQUIRE_ENCRYPTION=ON: Parquet Modular Encryption
  • -DARROW_PLASMA=ON: Plasma Shared Memory Object Store
  • -DARROW_PLASMA_JAVA_CLIENT=ON: Build Java client for Plasma
  • -DARROW_PYTHON=ON: 支持python (构建pyarrow时依赖此项). 这个库必须与你要构建的 pyarrow 的 Python 版本一致。必须安装Numpy。本选项开启会同样开启 ARROW_COMPUTE, ARROW_CSV, ARROW_DATASET, ARROW_FILESYSTEM, ARROW_HDFS, ARROW_JSON.
  • -DARROW_S3=ON: Support for Amazon S3-compatible filesystems
  • -DARROW_WITH_RE2=ON: Build with support for regular expressions using the re2 library, on by default and used when ARROW_COMPUTE or ARROW_GANDIVA is ON
  • -DARROW_WITH_UTF8PROC=ON: Build with support for Unicode properties using the utf8proc library, on by default and used when ARROW_COMPUTE or ARROW_GANDIVA is ON
  • -DARROW_TENSORFLOW=ON: Build Arrow with TensorFlow support enabled

太多了,详见 https://arrow.apache.org/docs/developers/cpp/building.html#optional-components ,此处之将我认为可能对我有价值的翻译了一下

安装

sudo make install

可以使用

xargs rm < install_manifest.txt

删除安装文件

此后可以使用

pkg-config --cflags --libs arrow

验证安装

使用包管理器安装

Debian GNU/Linux、Ubuntu

sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev  libarrow-glib-dev  libarrow-dataset-dev   libarrow-dataset-glib-dev  libarrow-flight-dev  libarrow-flight-glib-dev  libplasma-dev libplasma-glib-dev libgandiva-dev libgandiva-glib-dev libparquet-dev libparquet-glib-dev

AlmaLinux 8和Redhat 8

sudo dnf install -y epel-release || sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1).noarch.rpm
sudo dnf install -y https://apache.jfrog.io/artifactory/arrow/almalinux/$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)/apache-arrow-release-latest.rpm
sudo dnf config-manager --set-enabled epel || :
sudo dnf config-manager --set-enabled powertools || :
sudo dnf config-manager --set-enabled codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-rhui-rpms || :
sudo subscription-manager repos --enable codeready-builder-for-rhel-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-$(arch)-rpms || :
sudo dnf install -y arrow-devel # For C++
sudo dnf install -y arrow-glib-devel # For GLib (C)
sudo dnf install -y arrow-dataset-devel # For Apache Arrow Dataset C++
sudo dnf install -y arrow-dataset-glib-devel # For Apache Arrow Dataset GLib (C)
sudo dnf install -y arrow-flight-devel # For Apache Arrow Flight C++
sudo dnf install -y arrow-flight-glib-devel # For Apache Arrow Flight GLib (C)
sudo dnf install -y gandiva-devel # For Apache Gandiva C++
sudo dnf install -y gandiva-glib-devel # For Apache Gandiva GLib (C)
sudo dnf install -y parquet-devel # For Apache Parquet C++
sudo dnf install -y parquet-glib-devel # For Apache Parquet GLib (C)

CentOS8

sudo dnf install -y epel-release
sudo dnf install -y https://apache.jfrog.io/artifactory/arrow/centos/$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)-stream/apache-arrow-release-latest.rpm
sudo dnf config-manager --set-enabled epel
sudo dnf config-manager --set-enabled powertools
sudo dnf install -y arrow-devel # For C++
sudo dnf install -y arrow-glib-devel # For GLib (C)
sudo dnf install -y arrow-dataset-devel # For Apache Arrow Dataset C++
sudo dnf install -y arrow-dataset-glib-devel # For Apache Arrow Dataset GLib (C)
sudo dnf install -y arrow-flight-devel # For Apache Arrow Flight C++
sudo dnf install -y arrow-flight-glib-devel # For Apache Arrow Flight GLib (C)
sudo dnf install -y gandiva-devel # For Apache Gandiva C++
sudo dnf install -y gandiva-glib-devel # For Apache Gandiva GLib (C)
sudo dnf install -y parquet-devel # For Apache Parquet C++
sudo dnf install -y parquet-glib-devel # For Apache Parquet GLib (C)

CentOS7

sudo yum install -y epel-release || sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1).noarch.rpm
sudo yum install -y https://apache.jfrog.io/artifactory/arrow/centos/$(cut -d: -f5 /etc/system-release-cpe | cut -d. -f1)/apache-arrow-release-latest.rpm
sudo yum install -y --enablerepo=epel arrow-devel # For C++
sudo yum install -y --enablerepo=epel arrow-glib-devel # For GLib (C)
sudo yum install -y --enablerepo=epel arrow-dataset-devel # For Apache Arrow Dataset C++
sudo yum install -y --enablerepo=epel arrow-dataset-glib-devel # For Apache Arrow Dataset GLib (C)
sudo yum install -y --enablerepo=epel parquet-devel # For Apache Parquet C++
sudo yum install -y --enablerepo=epel parquet-glib-devel # For Apache Parquet GLib (C)

Amazon Linux2

sudo amazon-linux-extras install -y epel
sudo yum install -y https://apache.jfrog.io/artifactory/arrow/amazon-linux/2/apache-arrow-release-latest.rpm
sudo yum install -y --enablerepo=epel arrow-devel # For C++
sudo yum install -y --enablerepo=epel arrow-glib-devel # For GLib (C)
sudo yum install -y --enablerepo=epel arrow-dataset-devel # For Apache Arrow Dataset C++
sudo yum install -y --enablerepo=epel arrow-dataset-glib-devel # For Apache Arrow Dataset GLib (C)
sudo yum install -y --enablerepo=epel parquet-devel # For Apache Parquet C++
sudo yum install -y --enablerepo=epel parquet-glib-devel # For Apache Parquet GLib (C)

Python

等待补充