- github: https://github.com/glathoud/flatorize
- [1] http://glat.info
- [2] http://asmjs.org

by Guillaume Lathoud [1], April 2013, October 2014

`flatorize`

lets you write well-encapsulated math code - i.e. many small functions - and automatically generates flat, **very fast** JavaScript code.

Speedups exceed +1000% in many cases, including matrix multiplication and the Discrete Fourier Transform (DFT).

Additionally:

- A plugin for
`asm.js`

generates even faster JavaScript using`TypedArray`

. - Plugins for the C and D languages are also available.
- On the DFT task:
`flatorize_d`

faster than state-of-the-art FFTW.

- On the DFT task:

Tests on the command line:

- Unit tests for
`asm.js`

, C and D (source on GitHub). - Speed tests using JavaScript,
`asm.js`

, C, Dâ€¦ (GitHub).

External links:

Think like this:

Matrix multiplication a * b: Each row of a with each column of b

...write code like you think, without caring about performance...

# Python c = a * b c = [[ sum(x*y for x,y in zip(ra,cb)) for cb in zip(*b) ] for ra in a ]

// JavaScript function matmulrows_zip( a, b ) { return a.map( function (ra) { return zip.apply( null, b ).map( function (cb) { return sum( zip( ra, cb ) .map( function (xy) { return xy[ 0 ] * xy[ 1 ]; } ) ); } ); } ); } // --> Well encapsulated code, but slow due to function call overhead.

...and let `flatorize`

transform the above into **very fast** JavaScript code:

// Generated "flatorized" code == factorized + flattened // // - factorized: avoid repeating computations. // - flattened: no function call.

The two core ideas are to factorize and to flatten the implementation of mathematical expressions, hence the name `flatorize`

.

Consider the function `f`

of complex numbers `a,b,c`

:

function f(a,b,c) { return csub( csub(a,cadd(b,c)), cadd(b,c) ); };

To spare CPU, one can modify `f`

by evaluating `add(b,c)`

only once:

```
function f(a,b,c) {
var bc = add(b,c);
return sub( sub(a,bc), bc );
}
```

Small "building block" functions like `add`

or `sub`

are useful to *write* code, but *at runtime* the
CPU spends more time in function overhead than just doing the simple operation(s) in them.

In more complex cases like recursive expressions, the function overhead becomes even bigger.

One can eliminate function overhead by inlining each small function, that is replacing each small function with its implementation:

```
function f(a,b,c) {
var bc = b+c;
return (a - bc) - bc;
}
```

The CPU performance should be pretty good at this point.

One could try to further optimize,
e.g. writing `a-2*bc`

, which
amounts to mathematical expression simplification. This can be quite
complicated, and it is not sure how much CPU speed can be gained this
way - especially compared to function overhead elimination -
so if we ever attempt simplification, we'll keep it very simple.

`expr`

, `part`

and `flatorize`

#
`flatorize('a,b,c', function f(a,b,c) { ... })`

returns a new function, obtained by:

- calling
`f('a','b','c')`

, which returns an expression, - factorizing and flattening the expression,
- generating the new code for the expression.

`f(a,b,c)`

can use 3 ways to build and return an expression:

`flatorize.expr(a,'+',b)`

defines the code expression`a+b`

.`flatorize.part(arr,3)`

or`flatorize.part(obj,'prop')`

describe the extraction of a property value from an object (array or not).`[ 'a', anExpr ]`

is an array expression composed of 2 sub-expressions. Similary an object`{ a : anExpr }`

can be returned.

We now work with complex numbers, which we represent as arrays of two numbers `[re,im]`

. Consider the function `f`

of complex numbers `a,b,c`

:

`f = ;`

To produce an optimized implementation `f2`

of the function `f`

,
we use flatorize and write the following code:

```
FZ = flatorize;
cadd = FZ('a,b', function (a,b) {
return cplx( FZ.expr( creal(a), '+', creal(b) ),
FZ.expr( cimag(a), '+', cimag(b) )
);
});
csub = FZ('a,b', function (a,b) {
return cplx( FZ.expr( creal(a), '-', creal(b) ),
FZ.expr( cimag(a), '-', cimag(b) )
);
});
creal = FZ('a', function (a) { return FZ.part( a, 0 ); });
cimag = FZ('a', function (a) { return FZ.part( a, 1 ); });
cplx = FZ('re,im', function (re,im) { return [ re, im ]; });
f =
f2 = FZ('a,b,c',f);
```

All functions returned by `flatorize`

can be used to build further expressions: in the above example, `creal`

is used to express `cadd`

, which itself is used to expressed `f`

. Definition order does not matter.

All functions returned by `flatorize`

can also be called with values, for computation. For example, calling `f2(a,b,c)`

executes
the flat, factorized:

Consider matrices expressed as 2-dimensional arrays of rows. Here is an implementation of matrix multiplication using 3 nested loops:

// LOOPS

Guessing the intent of `matmulrows_loops`

from its implementation alone can be hard. Concepts are *not* encapsulated into separate functions, which slows down understanding, debugging and maintenance.

Can we express the concept of matrix multiplication in a more concise and better encapsulated way? For example, in Python:

```
# Python
a = [ [1,2,3,4,], [5,6,7,8,], [9,10,11,12,], ]
b = [ [13,14,], [15,16,], [17,18,], [19,20,]]
# Matrix multiplication: c = a * b
# Each number in c is the dot-product
# of a row ra with a column cb
c = [[ sum(x*y for x,y in zip(ra,cb)) for cb in zip(*b) ] for ra in a ]
# zip(*b) implements the "transposed matrix"
# Result:
# >>> c
# [[170, 180], [426, 452], [682, 724]]
```

The expression is concise, shorter than the nested loops used above, and the various concepts are well encapsulated: `map`

, `sum`

and `zip`

.

The reference to rows of `a`

and columns of `b`

is *explicit* through variables `ra`

and `cb`

, respectively. This can facilitate understanding.

A similar implementation in JavaScript requires to define `zip`

and `sum`

:

// ZIP (JavaScript)

Great! But given the number of nested function calls involved, `matmulrows_zip`

may well run much slower than `matmulrows_loops`

:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

ZIP matrix multip.: 3.56e+4 matmulrows_zip(a,b) calls per second LOOPS matrix multip.: 1.76e+6 matmulrows_loops(a,b) calls per second -> speedup relative to ZIP: +4855%

Can we keep the nice encapsulations in `matmulrows_zip`

*and* speed it up? We now write similar code using `flatorize`

: instead of working on numbers, we work on symbols:

The core remains very similar. Indeed, compare the number version:

// ZIP

with the new symbolic version:

// ZIPFLAT

All we had to do was to replace:

`sum`

with`symbol_sum`

`xy[ 0 ] * xy[ 1 ]`

with`FZ.expr( xy[ 0 ], '*', xy[ 1 ] )`

Here is the generated implementation:

// ZIPFLAT // Generated code (function matmulrows_zip_342)

(This code, as a few others below, was produced and inserted as you loaded the page.)

Now a look at speed:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

Example of result:

ZIP matrix multip.: 3.53e+4 matmulrows_zip(a,b) calls per second LOOPS matrix multip.: 1.50e+6 matmulrows_loops(a,b) calls per second -> speedup relative to ZIP: +4147% ZIPFLAT matrix multip.: 1.69e+6 matmulrows_zip_342(a,b) calls per second -> speedup relative to ZIP: +4698%

Depending on the JavaScript engine, LOOPS written by hand may run faster or slower than the generated ZIPFLAT code. However, both speeds have similar orders of magnitude.

More importantly:

`We write well-encapsulated code, ``flatorize`

it automatically, and the generated code runs orders of magnitude faster (ZIPFLAT).

Trade-off: the matrix dimensions must be known to generate the `flatorize`

'd code.

We now implement a DFT using the
Cooley-Tukey algorithm for both
real & complex inputs (option). The implementation `dft_ditfft2`

shows that we can express recursive
mathematical expressions in a "natural" way, that is very close to the
original pseudocode.

```
dftreal16flat = flatorize('arr', dft_exprgenF( 4, { real: true } ));
```

Let us now test it on a "small" scale (16 point). First, let us generate the code:

So the code generation of a "flatorized" 16-point DFT takes very little time. This has to be done only one time.

Now that we have a "flatorized" implementation, let us check whether it runs faster than the original Cooley-Tukey:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

On a linux Ubuntu 12.04, I obtained with Firefox 21 a speedup of about +2000%, and with Chrome 27 about +1250%.

`dftreal1024flat = flatorize('arr', dft_exprgenF( 10, { real: true } ));`

Since the recursion depth \(log_2{1024}=10\) starts to be quite significant, one can wonder how long the call to `flatorize` itself will take.

(This one can last a few seconds.)

Now, let us measure how fast the flatorized DFT1024 implementation runs, as compared with the original Cooley-Tukey:

(Feel free to do it multiple times.)

(The speed measurement can last long in some browsers.)

In Chrome 27 on a linux Ubuntu 12.04, I got a small speedup +50% by the first measurement, and afterwards a consistent speedup of about +150%. In Firefox 21 I got a consistent speedup of about +350%.

Speedup brought by `flatorize`

:

- System: linux Ubuntu 12.04 on a Thinkpad T420s.
- (1st): first measurement.
- (2+): second measurement and following measurements.

Speedup on: | `matmul342` | `DFT16` | `DFT32` | `DFT64` | `DFT128` | `DFT256` | `DFT512` | `DFT1024` |
---|---|---|---|---|---|---|---|---|

Firefox 26 (1st) | +615% | +2532% | +1755% | +1011% | +638% | +602% | +597% | +624% |

Firefox 26 (2nd) | +399% | +2067% | +1309% | +830% | +649% | +654% | +702% | +301% |

Firefox 26 (3rd) | +434% | +1945% | +2076% | +936% | +713% | +743% | +685% | +383% |

Chrome 32 (1st) | +87% | +1388% | +1072% | +1421% | +577% | +264% | -3% | -67% |

Chrome 32 (2nd) | +102% | +1044% | +1284% | +1557% | +1409% | +1600% | +1440% | +1462% |

Chrome 32 (3rd) | +98% | +1107% | +1141% | +1285% | +1428% | +1404% | +1515% | +1580% |

V8 3.19.10 (1st) | +74% | +1548% | +1987% | +1271% | +548% | +194% | -12% | -77% |

V8 3.19.10 (2nd) | +80% | +1735% | +2618% | +2233% | +1695% | +1785% | +1548% | +1587% |

V8 3.19.10 (3rd) | +79% | +1643% | +2461% | +2233% | +1695% | +1793% | +1546% | +1590% |

(Feel free to do it multiple times.)

Firefox 26 shows excellent speedups everywhere.

Chrome 32 and V8 3.19.10 mostly show excellent speedups, especially after the first pass, which likely triggers code optimization.

In all environments the memory consumption remains very stable.

The huge performance improvements validate the interest
of `flatorize`

.

A plugin is available that
generates ```
asm.js
```

code, which runs even faster (tested in
Firefox and Chrome).

Two other plugins generate fast code for the C language and the D language.

See also: Speed tests of the various solutions & languages (JS, `asm.js`

, C).

In the future, one could also try WebCL, NaCl or PNaCl.

Big thanks to the V8 developers, who solved issue #2612 in record time (concerning Chrome 24 and V8 3.17).

By no means did I intend `flatorize`

to compete with highly specialized and/or native implementations *of a particular task* (e.g. FFTW to implement DFT).

Rather, this simple, task-agnostic approach – factorizing and flattening – can give *huge enough* performance gains in many cases, which spares the development and maintainance costs of writing optimized native code by hand.

Above, the three main functions were described: `flatorize`

, `flatorize.expr`

and `flatorize.part`

. In addition:

- The method
`getDirect`

and the convenience function`flatorize.now`

give you direct access to the flatorized implementation, which you can use for computations (even better performance).`flatorize.now`

makes sense when you know that you will never use the flatorized function to build further expressions, e.g. in specific cases of a parameterizable implementation (`matmul342`

); else use`getDirect`

. - Within a function that builds an expression, a call to the method
`evalnow`

delivers a value (instead of an expression). See above the call`cpol.evalnow`

within`dft_exprgenF`

.

For more details about these two points, and examples, you can have a look at the sources on github or here (files flatorize.js and examples.js).

.

V8: Several `.sh`

files are provided to run the tests on
V8,
including d8_tryit_speed_all.sh
and d8_tryit_speed_dft1024.sh;
the rest on github.

.

Flatorizing DFT1024 in a tractable time – order of magnitude: 1 second – actually required a bit of work. The method is not specific to DFT, but rather very generic. The resulting, cleaned-up implementation: flatorize.js