JAXライクなfunctorchで機械学習を速くする – part 1

PyTorch 1.11からβ版として追加された functorch と呼ばれる機能を試してみました。PyTorch 1.9くらいのときから試験版として本体に組み込まれて提供されていましたが、どうやらfunctorchという別モジュールに切り出して提供されるようになったようです。

pytorch/functorch: functorch is JAX-like composable function transforms for PyTorch.

functorchとは

PyTorch公式サイトには以下のように説明されています。

functorch is a library that adds composable function transforms to PyTorch. It aims to provide composable vmap (vectorization) and autodiff transforms that work with PyTorch modules and PyTorch autograd with good eager-mode performance.

computing per-sample-gradients (or other per-sample quantities)
running ensembles of models on a single machine
efficiently batching together tasks in the inner-loop of MAML
efficiently computing Jacobians and Hessians
efficiently computing batched Jacobians and Hessians

PyTorch単体では記述が面倒だったサンプル毎の勾配計算やモデルアンサンブルの計算等を効率良く実行することができます。どうやら自動微分とベクトル化が目玉機能のようです。

functorchはGoogle JAXをインスパイアしておりAPIの見た目もだいたいJAXと同じです。もちろんデータの取り回しはJAX(jax.numpy)のDeviceArrayではなくtorch.TensorとなるのでPyTorch用に最適化されています。

google / jax – GitHub

最初は公式サイトのサンプルを動かしつつ応用的なコードも書いていきますが、このエントリーではfunctorchの全ての機能は説明できないので、少しずつ分かる範囲だけ試していこうと思います。今回は基本機能を知る準備編(part 1)ですね。処理速度などの非機能面については次回以降のエントリーで紹介します。

導入

Python開発環境がある程度整っていればインストールは簡単です。Google Colab無料版でも動くので環境が用意出来ない人はColabを使うといいでしょう。

# For CUDA 10.2
# >>> pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 torchaudio==0.11.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html
# For CUDA 11.3
# >>> pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
# For CPU-only build
# >>> pip install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

# ここではColab無料版にPyTorchと関連モジュールのCPU版をインストール
pip install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

# For CUDA 10.2

# >>> pip install torch==1.11.0+cu102 torchvision==0.12.0+cu102 torchaudio==0.11.0+cu102 -f https://download.pytorch.org/whl/cu102/torch_stable.html

# For CUDA 11.3

# >>> pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

# For CPU-only build

# >>> pip install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

# ここではColab無料版にPyTorchと関連モジュールのCPU版をインストール

pip install torch==1.11.0+cpu torchvision==0.12.0+cpu torchaudio==0.11.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html

functorchは独立したモジュールなのでpipで別途インストールします。

pip install functorch

1	pip install functorch

インストールできているかどうか動作確認します。

import torch
from functorch import vmap, grad

x = torch.randn(3)
y = vmap(torch.sin)(x)
assert torch.allclose(y, x.sin())

x = torch.randn([])
y = grad(torch.sin)(x)
assert torch.allclose(y, x.cos())

# assertエラーが出なければOK

import torch

from functorch import vmap, grad

x = torch.randn(3)

y = vmap(torch.sin)(x)

assert torch.allclose(y, x.sin())

x = torch.randn([])

y = grad(torch.sin)(x)

assert torch.allclose(y, x.cos())

# assertエラーが出なければOK

↑のサンプルコードの意味は以降で説明するのでとりあえずvmapとgradが読み込みできていればOKです。

使い方

functorchの説明には”composable function transforms”という表現が頻出しますが、コード上では高階関数を作って運用します。ここからは関数型に頭を切り替えて使っていきましょう。

grad (gradient computation)

functorch has auto-differentiation transforms (grad(f) returns a function that computes the gradient of f)

functorch.gradはわかりやすいのでまずこちらから試します。gradは関数の自動微分をやってくれるAPIで、冒頭の公式説明にもあったようにサンプル毎の勾配計算にも利用できますし、後述のfunctorch.vmapと組み合わせて使うことで機械学習処理のパフォーマンスを向上させることができます。

まずおさらいとして、PyTorch単体で自動微分する場合、backward()後にtorch.Tensorのgrad属性を参照することで微分係数を得ることができます。

def func(x):
    ''' f(x) = x^3 - 3x^2 + 9x + 2'''
    return x**3 - 3 * x**2 + 9*x + 2 # 3x^2 - 6x + 9

def dx_func(x):
    x = torch.tensor(x, requires_grad=True)
    y = func(x)
    y.backward()
    return x.grad

x = 2.0
print("f(x) =", func(x))  # f(x) = 16.0
print("df(x)/dx =", dx_func(x))  # df(x)/dx = tensor(9.)

def func(x):

''' f(x) = x^3 - 3x^2 + 9x + 2'''

return x**3 - 3 * x**2 + 9*x + 2 # 3x^2 - 6x + 9

def dx_func(x):

x = torch.tensor(x, requires_grad=True)

y = func(x)

y.backward()

return x.grad

x = 2.0

print("f(x) =", func(x)) # f(x) = 16.0

print("df(x)/dx =", dx_func(x)) # df(x)/dx = tensor(9.)

上記コードのdx_func関数をgradを使って書くと以下のように書けます。gradは新しい関数を作って返すので、それを即時に適用することができます。

# functorch.gradでdx_funcを定義・適用
grad(func)(torch.tensor(x))  # tensor(9.)

# もちろん分けて書くこともできる
grad_func = grad(func)
grad_func(torch.tensor(x))

# lambdaを使って一つの式にまとめると
grad(lambda x: x**3 - 3 * x**2 + 9*x + 2)(torch.tensor(x))

# functorch.gradでdx_funcを定義・適用

grad(func)(torch.tensor(x)) # tensor(9.)

# もちろん分けて書くこともできる

grad_func = grad(func)

grad_func(torch.tensor(x))

# lambdaを使って一つの式にまとめると

grad(lambda x: x**3 - 3 * x**2 + 9*x + 2)(torch.tensor(x))

↑の例では適当に遊んでみただけですが、こういうIIFE(Immediately Invoked Function Expression: 即時実行関数式)な書き方は昔のJavaScriptでよく使われてましたね。Pythonだとあまり見かけないかもしれませんが、関数型言語っぽい書き味に慣れていきましょう。

さて、導入時の動作確認用サンプルコードを改めて確認してみると、今なら簡単に理解できるかと思います。

x = torch.randn([])  # 要素数1の乱数(torch.Tensor)
y = grad(torch.sin)(x)  # sinを微分した関数(cos)を生成・即時適用
assert torch.allclose(y, x.cos())

x = torch.randn([]) # 要素数1の乱数(torch.Tensor)

y = grad(torch.sin)(x) # sinを微分した関数(cos)を生成・即時適用

assert torch.allclose(y, x.cos())

ちなみにtorch.allclose関数は第一引数と第二引数の値がほぼ等しいかどうかをチェックする関数で、NumPyにも同名の関数があります。
* torch.allclose — PyTorch 1.11.0 documentation

もちろん多変数関数の自動微分もできます。gradのargnumsパラメータで微分対象の変数を指定します。

# x, y, zの3変数関数
g = lambda x, y, z: torch.sqrt(x**2 + 2*y**2 + 3*z**2)
dgdx = grad(g, argnums=0)  # argnumsパラメータでxを微分対象にする、つまり偏微分dg/dx
x = y = z = torch.tensor(1.)
dgdx(x, y, z)  # x / sqrt(x**2 + 2*y**2 + 3*z**2) = tensor(0.4082)

# x, y, zの3変数関数

g = lambda x, y, z: torch.sqrt(x**2 + 2*y**2 + 3*z**2)

dgdx = grad(g, argnums=0) # argnumsパラメータでxを微分対象にする、つまり偏微分dg/dx

x = y = z = torch.tensor(1.)

dgdx(x, y, z) # x / sqrt(x**2 + 2*y**2 + 3*z**2) = tensor(0.4082)

vmap (auto-vectorization)

a vectorization/batching transform (vmap(f) returns a function that computes f over batches of inputs), and others

functorch.vmapはコードを大きく変更することなく関数を自動ベクトル化（auto-vectorization）して並列処理することでパフォーマンスを向上させます。機械学習だと特にバッチ学習/推論の際に有用です。内部実装的にはコア部分がC++で書かれており、過去のエントリーでも何度か紹介しているATen(テンソル演算用C++モジュール)もがっつり使っているようです。

ちなみにATenが環境毎にどうやって並列処理しているかについては以下の関数で確認できます。

import torch
print(torch.__config__.parallel_info()) 

ATen/Parallel:
	at::get_num_threads() : 1
	at::get_num_interop_threads() : 1
OpenMP 201511 (a.k.a. OpenMP 4.5)
	omp_get_max_threads() : 1
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 1
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
std::thread::hardware_concurrency() : 2
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

import torch

print(torch.__config__.parallel_info())

ATen/Parallel:

at::get_num_threads() : 1

at::get_num_interop_threads() : 1

OpenMP 201511 (a.k.a. OpenMP 4.5)

omp_get_max_threads() : 1

Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications

mkl_get_max_threads() : 1

Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)

std::thread::hardware_concurrency() : 2

Environment variables:

OMP_NUM_THREADS : [not set]

MKL_NUM_THREADS : [not set]

ATen parallel backend: OpenMP

ここではGoogle Colab無料版を使っているので↑のような情報が得られました。OpenMPが有効で、スレッド並列数2で動作しているようです。

以下の公式サンプルコードを見てみます。

import torch
from functorch import vmap

batch_size, feature_size = 3, 5
weights = torch.randn(feature_size, requires_grad=True)

def model(feature_vec):
    # Very simple linear model with activation
    assert feature_vec.dim() == 1  # torch.Tensor.dotは1Dテンソル(ベクトル)を受け取る
    return feature_vec.dot(weights).relu()

examples = torch.randn(batch_size, feature_size)
result = vmap(model)(examples)

import torch

from functorch import vmap

batch_size, feature_size = 3, 5

weights = torch.randn(feature_size, requires_grad=True)

def model(feature_vec):

# Very simple linear model with activation

assert feature_vec.dim() == 1 # torch.Tensor.dotは1Dテンソル(ベクトル)を受け取る

return feature_vec.dot(weights).relu()

examples = torch.randn(batch_size, feature_size)

result = vmap(model)(examples)

単純な線形モデルを定義し、vmapによってサンプル毎にモデルを適用しています。これをあえてvmapを使わずに書くと以下のようにforループが出現してしまい関数型っぽくなくなります。リスト内包表記を使うと多少シンプルに見えますが処理が効率化されるわけではありません。

# 素直にループ
outputs = []
for example in examples:
    outputs.append(model(example))
result = torch.stack(outputs)

# リスト内包表記を使う場合
result = torch.stack([model(example) for example in examples], dim=0)

# 素直にループ

outputs = []

for example in examples:

outputs.append(model(example))

result = torch.stack(outputs)

# リスト内包表記を使う場合

result = torch.stack([model(example) for example in examples], dim=0)

ここまでは特に利点が見えにくいかもしれませんが、以下のようにvmapをgradと組み合わせると、機械学習でよく使われる処理をスマートに書くことができ、なおかつ内部的にはvmapにより効率的に並列処理されます。

from functorch import vmap
batch_size, feature_size = 3, 5

def model(weights,feature_vec):
    # Very simple linear model with activation
    assert feature_vec.dim() == 1
    return feature_vec.dot(weights).relu()

def compute_loss(weights, example, target):
    y = model(weights, example)
    return ((y - target) ** 2).mean()  # MSELoss

weights = torch.randn(feature_size, requires_grad=True)
examples = torch.randn(batch_size, feature_size)
targets = torch.randn(batch_size)
inputs = (weights, examples, targets)
grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))(*inputs)

from functorch import vmap

batch_size, feature_size = 3, 5

def model(weights,feature_vec):

# Very simple linear model with activation

assert feature_vec.dim() == 1

return feature_vec.dot(weights).relu()

def compute_loss(weights, example, target):

y = model(weights, example)

return ((y - target) ** 2).mean() # MSELoss

weights = torch.randn(feature_size, requires_grad=True)

examples = torch.randn(batch_size, feature_size)

targets = torch.randn(batch_size)

inputs = (weights, examples, targets)

grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))(*inputs)

compute_loss関数ではモデルを適用してMSE Lossを計算、これをgradで自動微分対象として勾配を得る関数(1)を生成します。さらにvmapでサンプル毎に関数(1)を適用する関数(2)を生成、inputsに対してその関数(2)を即時適用しています。in_dimsパラメータでマッピングする次元を指定できるので、ここでは入力と同じ3要素のタプルで指定します。weightsのようにマッピングさせない場合はNoneを指定すればOKです。in_dimsパラメータはJAXだと別の名前(in_axes)ですが機能は同じはずです。functorchはJAX-likeなAPIを謳っていますが、PyTorchだとdimという単語がよく使われるのでそちらに合わせたのでしょう。

in_dimsがあればout_dimsもあります。頭の中でどうマッピングされるかイメージして指定しましょう。

x = torch.randn(2, 3, 5)

# in_dimsとout_dimsを同じにした場合
result = vmap(lambda x: x*2, in_dims=1, out_dims=1)(x)
assert torch.allclose(result, x*2)

# in_dimとout_dimを別にした場合
result = vmap(lambda x: x*2, in_dims=2, out_dims=1)(x)
assert result.shape == (2, 5, 3)
assert torch.allclose(result, (x * 2).transpose(1, 2))

x = torch.randn(2, 3, 5)

# in_dimsとout_dimsを同じにした場合

result = vmap(lambda x: x*2, in_dims=1, out_dims=1)(x)

assert torch.allclose(result, x*2)

# in_dimとout_dimを別にした場合

result = vmap(lambda x: x*2, in_dims=2, out_dims=1)(x)

assert result.shape == (2, 5, 3)

assert torch.allclose(result, (x * 2).transpose(1, 2))

vmapの挙動を見ると当然ではありますが入力するtensorの形状には注意します。あと、常用しないとは思いますが形状さえ合っていれば(マッピング可能であれば)ネストしても動作します。

x = torch.randn(2)
y = torch.randn(3)
vmap(torch.mul)(x, y)
# ValueError: vmap: Expected all tensors to have the same size in the mapped dimension, got sizes [2, 3] for the mapped dimension

# vmapをネスト
x = torch.randn(2, 3, 5)
y = torch.randn(2, 3, 5)
output = vmap(vmap(torch.mul))(x, y)
assert torch.allclose(output, x * y)

# 3段ネスト
output = vmap(vmap(vmap(torch.mul)))(x, y)
assert torch.allclose(output, x * y)

# lambda式内でvmap
x = torch.randn(3)
y = torch.randn(5)
output = vmap(lambda x: vmap(lambda y: x)(y))(x)
assert torch.allclose(output, x.view(3, 1).expand(3, 5))

x = torch.randn(2)

y = torch.randn(3)

vmap(torch.mul)(x, y)

# ValueError: vmap: Expected all tensors to have the same size in the mapped dimension, got sizes [2, 3] for the mapped dimension

# vmapをネスト

x = torch.randn(2, 3, 5)

y = torch.randn(2, 3, 5)

output = vmap(vmap(torch.mul))(x, y)

assert torch.allclose(output, x * y)

# 3段ネスト

output = vmap(vmap(vmap(torch.mul)))(x, y)

assert torch.allclose(output, x * y)

# lambda式内でvmap

x = torch.randn(3)

y = torch.randn(5)

output = vmap(lambda x: vmap(lambda y: x)(y))(x)

assert torch.allclose(output, x.view(3, 1).expand(3, 5))

↑の例は大袈裟気味に書きましたが、vmapやgrad対象とする関数は普通にdefで定義すると関数定義がスコープ内に残ってしまうので、無名関数として作って即時適用する形に慣れる方が良さそうです。

ここまでに紹介したパラメータとシンプルな入力データを使って最後におさらいします。

x = torch.linspace(0, 1, 3)  # tensor([0.0000, 0.5000, 1.0000])
y = torch.linspace(1, 2, 3)  # tensor([1.0000, 1.5000, 2.0000])
print(torch.dot(x,y))  # 0.0*1.0 + 0.5*1.5 + 1.0*2.0 = tensor(2.7500)
print(grad(torch.dot, argnums=0)(x, y))  # tensor([1.0000, 1.5000, 2.0000])

x = torch.linspace(0, 1, 3) # tensor([0.0000, 0.5000, 1.0000])

y = torch.linspace(1, 2, 3) # tensor([1.0000, 1.5000, 2.0000])

print(torch.dot(x,y)) # 0.0*1.0 + 0.5*1.5 + 1.0*2.0 = tensor(2.7500)

print(grad(torch.dot, argnums=0)(x, y)) # tensor([1.0000, 1.5000, 2.0000])

複数の入力ベクトルに対してvmapで並列処理したい場合は、

X = torch.linspace(0, 1, 15).reshape(5, 3)  # 5つの3次元ベクトル
Print(X)
# tensor([[0.0000, 0.0714, 0.1429],
          [0.2143, 0.2857, 0.3571],
          [0.4286, 0.5000, 0.5714],
          [0.6429, 0.7143, 0.7857],
          [0.8571, 0.9286, 1.0000]])
# 5つの3次元ベクトルをvmapで並列処理
result = vmap(grad(torch.dot, argnums=0), in_dims=(0, None), out_dims=0)(X, y)
print(result)
# tensor([[1.0000, 1.5000, 2.0000],
          [1.0000, 1.5000, 2.0000],
          [1.0000, 1.5000, 2.0000],
          [1.0000, 1.5000, 2.0000],
          [1.0000, 1.5000, 2.0000]])

X = torch.linspace(0, 1, 15).reshape(5, 3) # 5つの3次元ベクトル

Print(X)

# tensor([[0.0000, 0.0714, 0.1429],

[0.2143, 0.2857, 0.3571],

[0.4286, 0.5000, 0.5714],

[0.6429, 0.7143, 0.7857],

[0.8571, 0.9286, 1.0000]])

# 5つの3次元ベクトルをvmapで並列処理

result = vmap(grad(torch.dot, argnums=0), in_dims=(0, None), out_dims=0)(X, y)

print(result)

# tensor([[1.0000, 1.5000, 2.0000],

[1.0000, 1.5000, 2.0000],

[1.0000, 1.5000, 2.0000]])

↑の例が分かれば、gradとvmapの使い方はある程度理解できたのではないでしょうか。関数型に頭を切り替えられていれば意外とすんなり理解できるので単に慣れの問題だと思います。その他、gradとvmapを使う際の細かい注意点や制限については以下の公式ページを参照してください。余裕があれば次回エントリーでいくつかピックアップして紹介するかもしれません。

UX Limitations — functorch 0.1.1 documentation

おわりに

JAXは以前から興味があったにもかかわらずほとんど使ってなかったのですが、functorchの登場によって使うモチベーションが上がりました。functorchも最初は使いにくく感じるかもしれませんが、慣れればスラスラ書けるようになって面白いです。今回はfunctorchの基本機能となるgradとvmapを主に紹介する準備編でした。functorchの真価はPyTorchの文脈上で機械学習処理を高速化することですので、次回はより実用的な例や他のAPIも試してみたいと思います。

また、functorch自体はまだβ版であり、APIはユーザーからのフィードバックなどを経て変わる可能性があるとのことです。今回のエントリー内での使い方も将来使えなくなるかもしれないのでその際はご了承ください。

ちなみに、GoogleにはTensorFlowがあるのになぜFlax(JAX)、Traxなど複数のフレームワークを作っているのかについてはQuoraにディスカッションがあったので見てみると面白いです。Google(DeepMind含む)内のそれぞれの組織がフレームワークをボトムアップで作っているだけということですが、巨大な組織だとよくある話かなと思います。確かにGoogleはチャットアプリもたくさん作ってるし、今後生き残った者に投資していくのでしょう。

Why is Google developing multiple neural network libraries such as Flax, Trax, and Objax even though they already have TensorFlow? – Quora

参考

Tutorial: Writing JAX-like code in PyTorch with functorch – Simone Scardapane
FUNCTORCH | RICHARD ZOU & HORACE HE – YouTube
JAXによるスケーラブルな機械学習 – ZOZO TECH BLOG
jaxのautogradをpytorchのautogradと比較、単回帰まで（速度比較追加） – HELLO CYBERNETICS
機械学習で楽しむ JAX と NumPyro v0.1.0

JAXライクなfunctorchで機械学習を速くする – part 1

functorchとは

導入

使い方

grad (gradient computation)

vmap (auto-vectorization)

おわりに

参考

あわせて読む:

コメントを残す