On-the-fly tail call optimization in Javascript without trampoline

Programming with tail call optimization combines high performance and algorithmic clarity (short, medium and long intros). Javascript does not optimize tail calls, so I implemented tail call optimization in Javascript itself, and tested it on various browsers.

For those not familiar with tail calls, the first two sections explain it.
For those familiar with tail calls: we're dealing here with a single function calling itself (tail recursion elimination). See also ⇒ mutual recursion.

The hurried ones can directly read the GCD example, check the cross-browser test results, or even jump to conclusion(s). Those interested can check the ⇒ production examples.

I am greatly indebted to Aubrey Jaffer, for his inspiring work, to Thomas Lahn, for his insightful comment on syntax, and to the people on comp.lang.javascript for their stimulating discussions, which resulted in a precision on the current context (2010).

Update 2018-06-28: fext.js supports mutual tail recursion, generates very fast code, and includes easy debugging tools.

The problem

Javascript engines are getting faster and faster, leading to bigger and bigger front-end applications, as well as Javascript servers (see e.g. node.js). Both cases involve large and complex data structures and processes. Resource-efficient implementations are thus needed, especially with the rising trend of mobile platforms.

Recursive processes often lead to an explosive resource cost. A toy example first:

function fact_rec(n) {           // Recursive implementation
    return n > 1 ? n * fact_rec(n - 1) : 1;
}

function fact_iter(n) {        // Iterative implementation
    var a, ret = 1;
    for (a = 2; a <= n; a++) {
        ret = ret * a;
    }
    return ret;
}

	total number of function calls	memory: maximum stack depth	CPU: for n=13, duration in seconds
`fact_rec(n)`	`n`	`n`	`1.758e-6 s (100: base)`
`fact_iter(n)`	`1`	`1`	`1.299e-7 s (7)`

(Whenever you want, you can duration tests on this page.)

For this simple task, the recursive implementation uses already about one order of magnitude more resources than the iterative implementation does.

Whereas the recursive one directly implements a mathematical definition, the iterative implementation directly describes a run-time process. Converting from the former to the latter may well be difficult on more complex tasks.

A solution

A compromise between easing implementation and saving resources is tail optimization, where a compiler executes in an iterative manner a certain type of recursive implementation (example below).

Unfortunately, most Javascript engines do not implement tail optimization. Therefore, I am proposing to optimize Javascript functions on the fly using... Javascript itself (tailopt.js), inspired by the great works of Eugene Lazutkin and Aubrey Jaffer:

Implementation → Generated code

Verbose:

// tailopt() takes a function fact_tail(n), optimizes it,
// and returns the optimized function.

var fact_tailopt = tailopt(function fact_tail(n) {
        return (function fact_sub(p, tmp) {
            if (p) {
                // Tail recursion:
                // return <myself>( new parameters );
                return fact_sub(p - 1, p * tmp);
            }
            return tmp;            
        })(n, 1);       // Initial parameters
    }
);

fact_tailopt.toString():

function (n) { 
    var fact_sub;
    var p = n, p_new, tmp = 1, tmp_new;
L_fact_sub: 
    while (true) {
        if (p) { 
            p_new = p - 1;
            tmp_new = p * tmp;
            p = p_new;
            tmp = tmp_new;
            continue L_fact_sub;
        } 
        return tmp;
    } 
}

Concise:

var fact_tailopt2 = tailopt( function (n) {
    return (function fact_sub(p, tmp) {
        // Still a tail call:
        return p ? fact_sub(p - 1, p * tmp) : tmp; 
    })(n, 1);
} );

fact_tailopt2.toString():

Same as above.

Note: the parentheses around the sub-function expression are not strictly necessary - however I tend to prefer writing them to suggest right away the presence of an actual call. A matter of taste.

	total number of function calls	memory: maximum stack depth	CPU: for n=13, duration in seconds
`fact_rec(n)`	`n`	`n`	`1.758e-6 s (100: base)`
`fact_tailopt(n)`	`1`	`1`	`2.208e-7 s (13)`
`fact_tailopt2(n)`	`1`	`1`	`2.203e-7 s (13)`
`fact_iter(n)`	`1`	`1`	`1.299e-7 s (7)`

Interpretation: The function calls disappear:

The linearly growing - or more - stack size becomes a constant stack size:
- less memory used,
- no risk of stack overflow.
fact_tailopt's speed is:
- much better than the recursive fact_rec,
- near to the manually-coded iterative fact_iter.

How tailopt works: tailopt(fun) detects in fun.toString() the Javascript equivalent of Scheme "named lets" and unrolls them.

Note about very buggy JS engines (e.g. IE7, maybe IE8, sorry I am loosing interest in IE...): if the non-anonymous function expressions are not correctly supported, you can still define your sub-function in an anonymous - but more verbose - manner:

GCD (Greatest Common Divisor) example

(Inspired by the page: "Refactoring - Replace Iteration with Recursion".)

The iterative version has no stack depth issue, but lacks algorithmic clarity:

function gcd_iter(a, b) {
    while (true) {
        if (a > b) {
            a -= b;
        } else if (b > a) {
            b -= a;
        } else {
            break;
        }
    }
    return a;
}

In the recursive code, we have the stack depth issue, but the algorithm is clearly written:

function gcd_rec(a, b) {
    if (a > b) {
        return gcd_rec(a-b, b);
    } else if (a < b) {
        return gcd_rec(a, b-a);
    }
    // a === b
    return a;
}

Could we combine the best of both? Clear code and constant stack depth? That's where tailopt comes in:

var gcd_tailopt = tailopt(gcd_rec);

In this case, we need neither extra auxiliary variable nor sub-function (as opposed to fact_tail).

	total number of function calls	memory: maximum stack depth	CPU: for a=28974330 and b=310200 (gcd: 330), duration in seconds
`gcd_rec(a, b)`	`n`	`n`	`1.450e-5 s (100: base)`
`gcd_tailopt(a, b)`	`1`	`1`	`1.215e-6 s (8)`
`gcd_iter(a, b)`	`1`	`1`	`4.323e-7 s (3)`

Note about IE7 and such, i.e. if you need to use anonymous function expressions:

var gcd_tailopt = tailopt("gcd_rec", function (a, b) {
    if (a > b) {
        return gcd_rec(a-b, b);
    } else if (a < b) {
        return gcd_rec(a, b-a);
    }
    // a === b
    return a;
});

String replication example

The task is to replicate a string many times, as fast as possible - and without blowing the memory (first work on arrays, then do a join('')).

Recursive code:

function string_repli_rec(s, n) {
    if (!((n > 0) && s)) { return ''; }
    return (function sub(s, n) {

        if (n < 1) { return ['']; }
        if (n === 1)  { return [s]; } 

        var half = Math.floor(n / 2), a = sub(s, half);
        return a.concat(a, sub(s, n - 2 * half));
    })(s, n).join('');
}

Tail-recursive code:

function string_repli_tail(s0, n0){
    if (!((n0 > 0) && s0)) { return ''; }

    return (function sub(s, n, tmp){
        if (n & 1) { tmp.push(s); }            
        return n ? sub(s + s, n >> 1, tmp) : tmp.join('');
    })(s0, n0, []);
}

Tail-optimized code (= iterative process):

var string_repli_tailopt = tailopt( string_repli_tail );

Iterative code (taken from dojo 1.4.1):

function string_repli_iter(str, num){

    if(num <= 0 || !str){ return ''; }
    
    var buf = [];
    for(;;){
    	if(num & 1){
    		buf.push(str);
    	}
    	if(!(num >>= 1)){ break; }
    	str += str;
    }
    return buf.join('');
}

Both iterative and tail-recursive codes can be seen as very similar. However, the tail-recursive way *forces* to clearly separate initialization, update and termination, whereas the iterative way does not clearly separate all three.

Average iteration duration on 500 runs on a linux platform, in seconds (relative values in brackets) - the lower, the better:

Stack depth: case name    Firefox 3.6        Google Chrome 4.0  Google V8 Engine   Rhino Engine       Your browser & system
n: string_repli_rec       5.972e-5 (100.0)   5.312e-5 (100.0)   3.324e-5 (100.0)   1.337e-2 (100.0)   *
n: string_repli_tail      1.582e-5 ( 26.5)   4.075e-6 (  7.7)   2.381e-6 (  7.2)   1.243e-3 (  9.3)   *
1: string_repli_tailopt   1.105e-5 ( 18.5)   3.479e-6 (  6.5)   2.231e-6 (  6.7)   1.100e-3 (  8.2)   *
1: string_repli_iter      9.818e-6 ( 16.4)   3.361e-6 (  6.3)   2.166e-6 (  6.5)   8.072e-4 (  6.0)   *

Detailed statistical results are in appendice.

The speed improvement of the iterative version over the tail-optimized version seems marginal at best. The tail-optimized version forces to clearly separate initialization, update and termination, which can help a lot, especially for more complex code.

A tree traversal example

We have a tree of categories, whose non-leaf nodes have counts:

ooi
  tour
    winter
      skiFreeride    (count: 3)
      sledging       (count: 156)
      winterHiking   (count: 157)
      ...
    summer
      ...
  points of interest
    ...

For a given non-leaf node, we want to sum the counts of its children, so we need to traverse the subtree.

	total number of function calls	memory: maximum stack depth	CPU: duration in seconds
`treeforeach_rec()`	`total number of tree nodes (245)`	`maximum tree depth (4)`	`1.870e-4 s (100: base)`
`treeforeach_tailopt()`	`1`	`1`	`3.608e-4 s (193)`

A DOM tree traversal example

Let us list DOM nodes on this web page who (1) are text nodes (2) contain the string ".js":

var node_list = [], 
node_check = function (node) { 
    if ((node.nodeName === '#text') && (node.textContent.indexOf('.js') > -1)) {
        node_list.push(node);
    }
};
DOMtreeforeach_rec(document, node_check);

	total number of function calls	memory: maximum stack depth	CPU: duration in seconds
`DOMtreeforeach_rec()`	`total number of DOM nodes (210)`	`maximum tree depth (12)`	`1.312e-3 s (100: base)`
`DOMtreeforeach_tail()`	`1`	`1`	`1.462e-3 s (111)`

Cross-browser Tests

System: windows 7 professional, dell optiplex 760, intel core quad cpu Q9650 3.00GHz, RAM 8.00 GB, 64 bits

Results: Duration of a single call, in seconds: the lower, the better.

Remember that we are hugely saving on stack depth, so a small duration increase may not be that terrible!

Warning: With Firefox, I have observed slight variations on the CPU time, depending whether Firebug is activated or not. It likely has an impact on Tracemonkey optimizations. All results reported on this page have Firebug turned off. A further discussion on this theme can be found here.


Stack depth: case name      Firefox 3.6              Safari 4.0.4             Google Chrome 3.0        IE8                      Your browser & system

n: fact_rec                 1.758e-6 s (100: base)   2.155e-7 s (100: base)   1.540e-7 s (100: base)   5.635e-6 s (100: base)   *
n: fact_tail                2.834e-6 s (161)         6.681e-7 s (310)         4.056e-7 s (263)         8.785e-6 s (156)         *
1: fact_tailopt             2.208e-7 s (13)          1.532e-7 s (71)          9.653e-8 s (63)          3.548e-6 s (63)          *
1: fact_iter                1.299e-7 s (7)           1.155e-7 s (54)          6.055e-8 s (39)          1.381e-6 s (24)          *
                                                                                                                             
n: fact_tail2               2.891e-6 s (165)         6.807e-7 s (316)         4.084e-7 s (265)         8.785e-6 s (156)         *
1: fact_tailopt2            2.203e-7 s (13)          1.577e-7 s (73)          9.593e-8 s (62)          3.612e-6 s (64)          *

n: gcd_rec                  1.450e-5 s (100: base)   1.798e-6 s (100: base)   1.445e-6 s (100: base)   5.294e-5 s (100: base)   *
1: gcd_tailopt              1.215e-6 s (8)           7.652e-7 s (43)          5.343e-7 s (37)          1.921e-5 s (36)          *
1: gcd_iter                 4.323e-7 s (3)           6.823e-7 s (38)          5.355e-7 s (37)          1.046e-5 s (20)          *

n: treeforeach_rec()        1.870e-4 s (100: base)   5.935e-5 s (100: base)   1.303e-4 s (100: base)   6.286e-4 s (100: base)   *
n: treeforeach_tail()       2.139e-4 s (114)         8.308e-5 s (140)         2.116e-4 s (162)         6.229e-4 s (99)          *
1: treeforeach_tailopt()    3.608e-4 s (193)         7.600e-5 s (128)         3.094e-4 s (238)         4.638e-4 s (74)          *

n: DOMtreeforeach_rec()     1.312e-3 s (100: base)   7.761e-4 s (100: base)   1.566e-3 s (100: base)   7.231e-3 s (100: base)   *
1: DOMtreeforeach_tail()    1.191e-3 s (91)          5.068e-4 s (65)          7.021e-4 s (45)          6.333e-3 s (88)          *
1: DOMtreeforeach_tailopt() 1.462e-3 s (111)         4.822e-4 s (62)          7.671e-4 s (49)          9.923e-3 s (137)         *

1. Sanity checks

1a [ok] Similar durations for fact_tail and fact_tail2, as well as for fact_tailopt & fact_tailopt2.
1b [ok] fact and gcd: the tail-optimized version _tailopt *always* runs faster than the original tail & recursive versions (_tail, _rec).
1c [mixed] Tree traversal cases: _tailopt *sometimes* runs slower than _tail and/or _rec.
This only seems to be a real issue for treeforeach_tailopt() and Firefox 3.6/Google Chrome 3.0.
Possible causes (maybe combined):
- Side effect(s) of built-in JIT-like engine optimizations (Tracemonkey, Webkit, V8...),
- In the tree traversal examples no return value is passed (in contrary to fact_rec and gcd_rec).

2. A huge improvement

We obtain a constant stack depth *and* a much smaller duration than the recursive version:

Similar duration: treeforeach_tailopt() for Safari 4.0.4 and IE8, DOMtreeforeach_tailopt() for Firefox 3.6 and IE8.
Much smaller duration:
- fact_tailopt() and gcd_tailopt() for all browsers. - almost as fast as the iterative code.
- DOMtreeforeach_tailopt() for Safari 4.0.4 and Google Chrome 3.0

Tests with Rhino

System: Ubuntu 8.04 on a Thinkpad T43 (Intel Pentium M, 2 GHz, 1.5 GB RAM)

Java version: 1.5.0, gij (GNU libgcj) version 4.2.4 (Ubuntu 4.2.4-1ubuntu3)

Rhino version: 1.6 release 7 2008 10 14

Test: 10 runs

rhino -e "load('tailopt-js-rerun.js');rerun(10);"

Results: Duration of a single call, in seconds: the lower, the better.

Stack depth: test case
n: fact_rec_____________ iter_sec: 4.002e-4 (100: base)
n: fact_tail____________ iter_sec: 5.825e-4 (146)
1: fact_tailopt_________ iter_sec: 4.899e-4 (122)
n: fact_tail2___________ iter_sec: 5.931e-4 (148)
1: fact_tailopt2________ iter_sec: 4.748e-4 (119)
1: fact_iter____________ iter_sec: 1.207e-4 (30)
n: gcd_rec______________ iter_sec: 3.042e-3 (100: base)
1: gcd_tailopt__________ iter_sec: 3.003e-3 (99)
1: gcd_iter_____________ iter_sec: 7.957e-4 (26)
n: string_repli_rec_____ iter_sec: 1.340e-2 (100: base)
n: string_repli_tail____ iter_sec: 1.215e-3 (9)        
1: string_repli_tailopt_ iter_sec: 1.123e-3 (8)        
1: string_repli_iter____ iter_sec: 8.237e-4 (6)        
n: treeforeach_rec______ iter_sec: 5.478e-2 (100: base)
n: treeforeach_tail_____ iter_sec: 5.668e-2 (103)
1: treeforeach_tailopt__ iter_sec: 6.306e-2 (115)

Stack depth savings apart, the tailopt results do not always bring a speed improvement, probably because Rhino runs on top of Java.

Tests with the Google V8 Engine

System: Ubuntu 8.04 on a Thinkpad T43 (Intel Pentium M, 2 GHz, 1.5 GB RAM)

V8 version: 2.1.0

Test: 10 runs

d8 -e "load('tailopt-js-rerun.js');rerun(10);"

Results: Duration of a single call, in seconds: the lower, the better.

Stack depth: test case
n: fact_rec_____________ iter_sec: 2.674e-7 (100: base)
n: fact_tail____________ iter_sec: 4.149e-7 (155)
1: fact_tailopt_________ iter_sec: 1.904e-7 (71)
n: fact_tail2___________ iter_sec: 4.160e-7 (156)
1: fact_tailopt2________ iter_sec: 1.904e-7 (71)
1: fact_iter____________ iter_sec: 1.338e-7 (50)
n: gcd_rec______________ iter_sec: 2.676e-6 (100: base)
1: gcd_tailopt__________ iter_sec: 1.085e-6 (41)
1: gcd_iter_____________ iter_sec: 1.215e-6 (45)
n: string_repli_rec_____ iter_sec: 3.352e-5 (100: base)
n: string_repli_tail____ iter_sec: 2.486e-6 (7)        
1: string_repli_tailopt_ iter_sec: 2.411e-6 (7)        
1: string_repli_iter____ iter_sec: 2.214e-6 (7)        
n: treeforeach_rec______ iter_sec: 2.247e-4 (100: base)
n: treeforeach_tail_____ iter_sec: 2.922e-4 (130)
1: treeforeach_tailopt__ iter_sec: 2.813e-4 (125)

tailopt is bringing an additional speed improvement, like in the browser cases.