Unravelling `tf.einsum`
Origin Story#
Recently, I was trying to disect the original DCNN Paper which utilized a diffusion kernel to more readily make use of implicit graph-structure in common tasks such as node, edge and graph classification. However, an existing implementation I fonund had a curious piece of notation which led me down the rabbithole of Tensor calculus.
Coordinates are maps used to solve a given problem. A coordinate transform allows mapping from one frame of reference to another (converting from a map of your high school, to the location of your high school in reference to where it is in the city, compared to a country-wide map).
To say a number is a sclar means that the value does no change when transformed from one coordinate system to another (e.g. the distance between two points on a flat plain is irrespective of where true north is).
A vector is directional, and can be formed on the basis of the reference set of coordinates. For example, a vector between your home and the nearest fire-station can be broken down into a sum of north- and east-facing vectors.
Tensors#
A tensor describes the superset of transformations which include scalars and vectors:
- $0$-tensors are constant functions, which we identify as scalars
- $1$-tensors are linear functions, which we call vectors
- $2$-tensors are bilinear functions, which we call matrices
A Tensor describes any general transformation, independent of any basis function between sets of algebraic objects related to a vector space
Back to the paper, there was a particular function which claimed to do batch matrix multiplication:
tf.einsum('ijk,kl->ijl', A, B)
where $A$ was the diffusion kernel and $B$ was a feature vector (and tf
was tensorflow
). So $A$ would have dimensions (batch_size, m, n) and $B$ would have dimensions (n, k), where:
- batch_size: number of nodes to process in a given batch (for model trainining)
- n: number of features
- m: number of nodes
- k: number of “hops”
Ignoring the technicalities of the paper and the actual definitions above, I wanted to know what the actual heck this strange einsum
function was trying to do
Einstein Summation#
Enter Einstein summation: In “Einstein” summation, the repeated index defines what we sum by, the expression must have a repeated index, so: $$ \sum_{i=1}^n = a_1x_1 + a_2x_2 + … + a_nx_n \equiv a_ix_i $$ is valid. But $a_{ij}x_k$ is not, whilst $a_{ij}x_j$ is: $$ a_{ij}x_j \equiv a_{i1}x_1 + a_{i2}x_2 + … + a_{in}x_n $$
Double sums are handled as follows, for example summation on both $i$ and $j$: $$ a_{ij}x_iy_j $$
In the einsum
function, the first argument ijk,kl->ijl
signified summation on the $k^{th}$ dimension
Now that I understood what the notation meant, I wanted a better grasp on the actual mechanics behind the function. Using synthetic Tensors as follows:
k = 2
batch_size, m, n = None, 4, 2
init = tf.random.uniform(shape=(m, n), minval=0, maxval=16, dtype=tf.int32)
A = tf.Variable(init)
A = tf.expand_dims(A, 0)
A
<tf.Tensor: shape=(1, 4, 2), dtype=int32, numpy=
array([[[14, 4],
[ 4, 12],
[ 9, 13],
[ 0, 13]]], dtype=int32)>
init = tf.random.uniform(shape=(n, k), minval=0, maxval=16, dtype=tf.int32)
B = tf.Variable(init)
B
<tf.Variable 'Variable:0' shape=(2, 2) dtype=int32, numpy=
array([[3, 9],
[5, 1]], dtype=int32)>
tf.matmul
#
Here is where I used the two prior defined Tensors to basically see what would happen. It was also at this point I realised that TensorFlow 2 now included a function matmul
which essentially did the same!
C = tf.einsum('ijk,kl->ijl', A, B)
C
<tf.Tensor: shape=(1, 4, 2), dtype=int32, numpy=
array([[[ 62, 130],
[ 72, 48],
[ 92, 94],
[ 65, 13]]], dtype=int32)>
tf.matmul(A, B)
<tf.Tensor: shape=(1, 4, 2), dtype=int32, numpy=
array([[[ 62, 130],
[ 72, 48],
[ 92, 94],
[ 65, 13]]], dtype=int32)>
Minimum-Viable Example#
Okay, now simplifying even further; firstly by creating a rank-2 tensor (i.e. a matrix) using numpy and then finding the matrix product
import numpy as np
A = np.matrix('''
1 4;
2 3
''')
B = np.matrix('''
5 7;
6 8
''')
C = A @ B
C
matrix([[29, 39],
[28, 38]])
Every element in $C$, $C_{ik}$ is: $$ C_{ik} = \sum_jA_{ij}B_{jk} $$
$C_{01} = 39$ so
$$ C_{01} = \sum_j A_{0j} B_{j1} (1\times 7)_ {j=0} + (4\times 8)_{j=1} $$
Followed by converting the above matrices to TensorFlow objects and repeating the operation to somehow test that I grasped the notation:
A = tf.convert_to_tensor(A)
B = tf.convert_to_tensor(B)
A, B
(<tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 4],
[2, 3]])>,
<tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[5, 7],
[6, 8]])>)
It worked! The output of einsum
below is consistent with matmul
above
# equivalent to A @ B or tf.matmul(A, B)
tf.einsum('ij,jk->ik', A, B)
<tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[29, 39],
[28, 38]])>
Slightly-Less Minimum Example#
Now on to a slightly more complex example, I created a rank-2 Tensor and a rank-1 Tensor for multiplication against
# applying to batch case
A = tf.Variable([
[[1,2],
[3,4]],
[[3, 5],
[2, 9]]
])
B = tf.Variable(
[[2], [1]]
)
A.shape, B.shape
(TensorShape([2, 2, 2]), TensorShape([2, 1]))
tf.matmul(A, B)
<tf.Tensor: shape=(2, 2, 1), dtype=int32, numpy=
array([[[ 4],
[10]],
[[11],
[13]]], dtype=int32)>
For the $ijl^{th}$ element in $C$, sum across the $k^{th}$ dimension in A and B
output[i,j,l] = sum_k A[i,j,k] * B[k, l]
# for the ijl-th element in C,
C = tf.einsum('ijk,kl->ijl', A, B)
C
<tf.Tensor: shape=(2, 2, 1), dtype=int32, numpy=
array([[[ 4],
[10]],
[[11],
[13]]], dtype=int32)>
and success! I think I have a fair grasp on how einsum
and Einstein summation works, and how/why it can be sometimes simpler just to use the built-in matmul
function, but also where batch dimensions may mess with the built-in functions and we would want to define it in finer detail