⇒ New implementation (mutual recursion)
 
⇒ New implementation (production examples)
 
⇒ AppendicesTo the appendices
 
⇒ Back home
⇒ email
 
Flattr this
⇒ New implementation (mutual recursion)
 
⇒ New implementation (production examples)
 
Problem
Solution
Current context (2010)
GCD Example
String Replication Example
Tree Traversal Example
DOM Traversal Example
Cross-browser Tests
Rhino Tests
Google V8 Tests
Conclusion
 
⇒ Appendices
 
⇒ Back home
⇒ email
 
Flattr this

On-the-fly tail call optimization in Javascript without trampoline

by Guillaume Lathoud

Programming with tail call optimization combines high performance and algorithmic clarity (short, medium and long intros). Javascript does not optimize tail calls, so I implemented tail call optimization in Javascript itself, and tested it on various browsers.

The hurried ones can directly read the GCD example, check the cross-browser test results, or even jump to conclusion(s). Those interested can check the ⇒ production examples.

I am greatly indebted to Aubrey Jaffer, for his inspiring work, to Thomas Lahn, for his insightful comment on syntax, and to the people on comp.lang.javascript for their stimulating discussions, which resulted in a precision on the current context (2010).

If you like this, you might also like the transfun.js tool for fast map/filter/reduce/...

The problem

Javascript engines are getting faster and faster, leading to bigger and bigger front-end applications, as well as Javascript servers (see e.g. node.js). Both cases involve large and complex data structures and processes. Resource-efficient implementations are thus needed, especially with the rising trend of mobile platforms.

Recursive processes often lead to an explosive resource cost. A toy example first:

function fact_rec(n) {           // Recursive implementation
    return n > 1 ? n * fact_rec(n - 1) : 1;
}

function fact_iter(n) {        // Iterative implementation
    var a, ret = 1;
    for (a = 2; a <= n; a++) {
        ret = ret * a;
    }
    return ret;
}
total number of function callsmemory: maximum stack depthCPU: for n=13, duration in seconds
fact_rec(n)nn1.758e-6 s (100: base)
fact_iter(n)111.299e-7 s (7)

(Whenever you want, you can duration tests on this page.)

For this simple task, the recursive implementation uses already about one order of magnitude more resources than the iterative implementation does.

Whereas the recursive one directly implements a mathematical definition, the iterative implementation directly describes a run-time process. Converting from the former to the latter may well be difficult on more complex tasks.

A solution

A compromise between easing implementation and saving resources is tail optimization, where a compiler executes in an iterative manner a certain type of recursive implementation (example below).

Unfortunately, most Javascript engines do not implement tail optimization. Therefore, I am proposing to optimize Javascript functions on the fly using... Javascript itself (tailopt.js), inspired by the great works of Eugene Lazutkin and Aubrey Jaffer:

Implementation→ Generated code
Verbose:
// tailopt() takes a function fact_tail(n), optimizes it,
// and returns the optimized function.

var fact_tailopt = tailopt(function fact_tail(n) {
        return (function fact_sub(p, tmp) {
            if (p) {
                // Tail recursion:
                // return <myself>( new parameters );
                return fact_sub(p - 1, p * tmp);
            }
            return tmp;            
        })(n, 1);       // Initial parameters
    }
);
   
fact_tailopt.toString():
function (n) { 
    var fact_sub;
    var p = n, p_new, tmp = 1, tmp_new;
L_fact_sub: 
    while (true) {
        if (p) { 
            p_new = p - 1;
            tmp_new = p * tmp;
            p = p_new;
            tmp = tmp_new;
            continue L_fact_sub;
        } 
        return tmp;
    } 
}
Concise:
var fact_tailopt2 = tailopt( function (n) {
    return (function fact_sub(p, tmp) {
        // Still a tail call:
        return p ? fact_sub(p - 1, p * tmp) : tmp; 
    })(n, 1);
} );
fact_tailopt2.toString():

Same as above.


Note: the parentheses around the sub-function expression are not strictly necessary - however I tend to prefer writing them to suggest right away the presence of an actual call. A matter of taste.

total number of function callsmemory: maximum stack depthCPU: for n=13, duration in seconds
fact_rec(n)nn1.758e-6 s (100: base)
fact_tailopt(n)112.208e-7 s (13)
fact_tailopt2(n)112.203e-7 s (13)
fact_iter(n)111.299e-7 s (7)

Interpretation: The function calls disappear:

How tailopt works: tailopt(fun) detects in fun.toString() the Javascript equivalent of Scheme "named lets" and unrolls them.

Note about very buggy JS engines (e.g. IE7, maybe IE8, sorry I am loosing interest in IE...): if the non-anonymous function expressions are not correctly supported, you can still define your sub-function in an anonymous - but more verbose - manner:

var fact_tailopt = tailopt( function fact_tail2(n) {
        var fact_sub;
        return (fact_sub = function (p, tmp) {  // Anonymous function expression on the RHS
            return p ? fact_sub(p - 1, p * tmp) : tmp;
        })(n, 1);    // The initial parameters
    }
);

A precision on the current context (2010)

The use case is tail call optimization (TCO) in Javascript without trampoline.

ECMAScript 3 & 5 neither require the engine to do TCO, nor fully specify function decompilation, so one has to compromise:

Who's not willing to compromise has to wait until all engines implement TCO.

In the case of a Javascript server, things might be simpler, since one knows which engine is running (e.g. the Google's V8 Engine used by the server node.js).

GCD (Greatest Common Divisor) example

(Inspired by the page: "Refactoring - Replace Iteration with Recursion".)

The iterative version has no stack depth issue, but lacks algorithmic clarity:

function gcd_iter(a, b) {
    while (true) {
        if (a > b) {
            a -= b;
        } else if (b > a) {
            b -= a;
        } else {
            break;
        }
    }
    return a;
}

In the recursive code, we have the stack depth issue, but the algorithm is clearly written:

function gcd_rec(a, b) {
    if (a > b) {
        return gcd_rec(a-b, b);
    } else if (a < b) {
        return gcd_rec(a, b-a);
    }
    // a === b
    return a;
}

Could we combine the best of both? Clear code and constant stack depth? That's where tailopt comes in:

var gcd_tailopt = tailopt(gcd_rec);

In this case, we need neither extra auxiliary variable nor sub-function (as opposed to fact_tail).

total number of function callsmemory: maximum stack depthCPU: for a=28974330 and b=310200 (gcd: 330),
duration in seconds
gcd_rec(a, b)nn1.450e-5 s (100: base)
gcd_tailopt(a, b)111.215e-6 s (8)
gcd_iter(a, b)114.323e-7 s (3)

Note about IE7 and such, i.e. if you need to use anonymous function expressions:

var gcd_tailopt = tailopt("gcd_rec", function (a, b) {
    if (a > b) {
        return gcd_rec(a-b, b);
    } else if (a < b) {
        return gcd_rec(a, b-a);
    }
    // a === b
    return a;
});

String replication example

The task is to replicate a string many times, as fast as possible - and without blowing the memory (first work on arrays, then do a join('')).

Recursive code:

function string_repli_rec(s, n) {
    if (!((n > 0) && s)) { return ''; }
    return (function sub(s, n) {

        if (n < 1) { return ['']; }
        if (n === 1)  { return [s]; } 

        var half = Math.floor(n / 2), a = sub(s, half);
        return a.concat(a, sub(s, n - 2 * half));
    })(s, n).join('');
}

Tail-recursive code:

function string_repli_tail(s0, n0){
    if (!((n0 > 0) && s0)) { return ''; }

    return (function sub(s, n, tmp){
        if (n & 1) { tmp.push(s); }            
        return n ? sub(s + s, n >> 1, tmp) : tmp.join('');
    })(s0, n0, []);
}

Tail-optimized code (= iterative process):

var string_repli_tailopt = tailopt( string_repli_tail );

Iterative code (taken from dojo 1.4.1):

function string_repli_iter(str, num){

    if(num <= 0 || !str){ return ''; }
    
    var buf = [];
    for(;;){
    	if(num & 1){
    		buf.push(str);
    	}
    	if(!(num >>= 1)){ break; }
    	str += str;
    }
    return buf.join('');
}

Both iterative and tail-recursive codes can be seen as very similar. However, the tail-recursive way *forces* to clearly separate initialization, update and termination, whereas the iterative way does not clearly separate all three.

Average iteration duration on 500 runs on a linux platform, in seconds (relative values in brackets) - the lower, the better:

Stack depth: case name    Firefox 3.6        Google Chrome 4.0  Google V8 Engine   Rhino Engine       Your browser & system
n: string_repli_rec       5.972e-5 (100.0)   5.312e-5 (100.0)   3.324e-5 (100.0)   1.337e-2 (100.0)   *
n: string_repli_tail      1.582e-5 ( 26.5)   4.075e-6 (  7.7)   2.381e-6 (  7.2)   1.243e-3 (  9.3)   *
1: string_repli_tailopt   1.105e-5 ( 18.5)   3.479e-6 (  6.5)   2.231e-6 (  6.7)   1.100e-3 (  8.2)   *
1: string_repli_iter      9.818e-6 ( 16.4)   3.361e-6 (  6.3)   2.166e-6 (  6.5)   8.072e-4 (  6.0)   *

Detailed statistical results are in appendice.

The speed improvement of the iterative version over the tail-optimized version seems marginal at best. The tail-optimized version forces to clearly separate initialization, update and termination, which can help a lot, especially for more complex code.

A tree traversal example

We have a tree of categories, whose non-leaf nodes have counts:

ooi
  tour
    winter
      skiFreeride    (count: 3)
      sledging       (count: 156)
      winterHiking   (count: 157)
      ...
    summer
      ...
  points of interest
    ...

For a given non-leaf node, we want to sum the counts of its children, so we need to traverse the subtree.

total number of function callsmemory: maximum stack depthCPU: duration in seconds
treeforeach_rec()total number of tree nodes (245)maximum tree depth (4)1.870e-4 s (100: base)
treeforeach_tailopt()113.608e-4 s (193)

A DOM tree traversal example

Let us list DOM nodes on this web page who (1) are text nodes (2) contain the string ".js":

var node_list = [], 
node_check = function (node) { 
    if ((node.nodeName === '#text') && (node.textContent.indexOf('.js') > -1)) {
        node_list.push(node);
    }
};
DOMtreeforeach_rec(document, node_check);
total number of function callsmemory: maximum stack depthCPU: duration in seconds
DOMtreeforeach_rec()total number of DOM nodes (210)maximum tree depth (12)1.312e-3 s (100: base)
DOMtreeforeach_tail()111.462e-3 s (111)

Cross-browser Tests

System: windows 7 professional, dell optiplex 760, intel core quad cpu Q9650 3.00GHz, RAM 8.00 GB, 64 bits

Results: Duration of a single call, in seconds: the lower, the better.

Remember that we are hugely saving on stack depth, so a small duration increase may not be that terrible!

Warning: With Firefox, I have observed slight variations on the CPU time, depending whether Firebug is activated or not. It likely has an impact on Tracemonkey optimizations. All results reported on this page have Firebug turned off. A further discussion on this theme can be found here.


Stack depth: case name      Firefox 3.6              Safari 4.0.4             Google Chrome 3.0        IE8                      Your browser & system

n: fact_rec                 1.758e-6 s (100: base)   2.155e-7 s (100: base)   1.540e-7 s (100: base)   5.635e-6 s (100: base)   *
n: fact_tail                2.834e-6 s (161)         6.681e-7 s (310)         4.056e-7 s (263)         8.785e-6 s (156)         *
1: fact_tailopt             2.208e-7 s (13)          1.532e-7 s (71)          9.653e-8 s (63)          3.548e-6 s (63)          *
1: fact_iter                1.299e-7 s (7)           1.155e-7 s (54)          6.055e-8 s (39)          1.381e-6 s (24)          *
                                                                                                                             
n: fact_tail2               2.891e-6 s (165)         6.807e-7 s (316)         4.084e-7 s (265)         8.785e-6 s (156)         *
1: fact_tailopt2            2.203e-7 s (13)          1.577e-7 s (73)          9.593e-8 s (62)          3.612e-6 s (64)          *

n: gcd_rec                  1.450e-5 s (100: base)   1.798e-6 s (100: base)   1.445e-6 s (100: base)   5.294e-5 s (100: base)   *
1: gcd_tailopt              1.215e-6 s (8)           7.652e-7 s (43)          5.343e-7 s (37)          1.921e-5 s (36)          *
1: gcd_iter                 4.323e-7 s (3)           6.823e-7 s (38)          5.355e-7 s (37)          1.046e-5 s (20)          *

n: treeforeach_rec()        1.870e-4 s (100: base)   5.935e-5 s (100: base)   1.303e-4 s (100: base)   6.286e-4 s (100: base)   *
n: treeforeach_tail()       2.139e-4 s (114)         8.308e-5 s (140)         2.116e-4 s (162)         6.229e-4 s (99)          *
1: treeforeach_tailopt()    3.608e-4 s (193)         7.600e-5 s (128)         3.094e-4 s (238)         4.638e-4 s (74)          *

n: DOMtreeforeach_rec()     1.312e-3 s (100: base)   7.761e-4 s (100: base)   1.566e-3 s (100: base)   7.231e-3 s (100: base)   *
1: DOMtreeforeach_tail()    1.191e-3 s (91)          5.068e-4 s (65)          7.021e-4 s (45)          6.333e-3 s (88)          *
1: DOMtreeforeach_tailopt() 1.462e-3 s (111)         4.822e-4 s (62)          7.671e-4 s (49)          9.923e-3 s (137)         *

1. Sanity checks

2. A huge improvement

We obtain a constant stack depth *and* a much smaller duration than the recursive version:

Tests with Rhino

System: Ubuntu 8.04 on a Thinkpad T43 (Intel Pentium M, 2 GHz, 1.5 GB RAM)

Java version: 1.5.0, gij (GNU libgcj) version 4.2.4 (Ubuntu 4.2.4-1ubuntu3)

Rhino version: 1.6 release 7 2008 10 14

Test: 10 runs

rhino -e "load('tailopt-js-rerun.js');rerun(10);"

Results: Duration of a single call, in seconds: the lower, the better.

Stack depth: test case
n: fact_rec_____________ iter_sec: 4.002e-4 (100: base)
n: fact_tail____________ iter_sec: 5.825e-4 (146)
1: fact_tailopt_________ iter_sec: 4.899e-4 (122)
n: fact_tail2___________ iter_sec: 5.931e-4 (148)
1: fact_tailopt2________ iter_sec: 4.748e-4 (119)
1: fact_iter____________ iter_sec: 1.207e-4 (30)
n: gcd_rec______________ iter_sec: 3.042e-3 (100: base)
1: gcd_tailopt__________ iter_sec: 3.003e-3 (99)
1: gcd_iter_____________ iter_sec: 7.957e-4 (26)
n: string_repli_rec_____ iter_sec: 1.340e-2 (100: base)
n: string_repli_tail____ iter_sec: 1.215e-3 (9)        
1: string_repli_tailopt_ iter_sec: 1.123e-3 (8)        
1: string_repli_iter____ iter_sec: 8.237e-4 (6)        
n: treeforeach_rec______ iter_sec: 5.478e-2 (100: base)
n: treeforeach_tail_____ iter_sec: 5.668e-2 (103)
1: treeforeach_tailopt__ iter_sec: 6.306e-2 (115)

Stack depth savings apart, the tailopt results do not always bring a speed improvement, probably because Rhino runs on top of Java.

Tests with the Google V8 Engine

System: Ubuntu 8.04 on a Thinkpad T43 (Intel Pentium M, 2 GHz, 1.5 GB RAM)

V8 version: 2.1.0

Test: 10 runs

d8 -e "load('tailopt-js-rerun.js');rerun(10);"

Results: Duration of a single call, in seconds: the lower, the better.

Stack depth: test case
n: fact_rec_____________ iter_sec: 2.674e-7 (100: base)
n: fact_tail____________ iter_sec: 4.149e-7 (155)
1: fact_tailopt_________ iter_sec: 1.904e-7 (71)
n: fact_tail2___________ iter_sec: 4.160e-7 (156)
1: fact_tailopt2________ iter_sec: 1.904e-7 (71)
1: fact_iter____________ iter_sec: 1.338e-7 (50)
n: gcd_rec______________ iter_sec: 2.676e-6 (100: base)
1: gcd_tailopt__________ iter_sec: 1.085e-6 (41)
1: gcd_iter_____________ iter_sec: 1.215e-6 (45)
n: string_repli_rec_____ iter_sec: 3.352e-5 (100: base)
n: string_repli_tail____ iter_sec: 2.486e-6 (7)        
1: string_repli_tailopt_ iter_sec: 2.411e-6 (7)        
1: string_repli_iter____ iter_sec: 2.214e-6 (7)        
n: treeforeach_rec______ iter_sec: 2.247e-4 (100: base)
n: treeforeach_tail_____ iter_sec: 2.922e-4 (130)
1: treeforeach_tailopt__ iter_sec: 2.813e-4 (125)

tailopt is bringing an additional speed improvement, like in the browser cases.

Conclusion

We can write clear Javascript recursive code that runs almost as fast as the iterative code.

This seems particularly adequate when having:

Why are Javascript engines *not already* optimizing this?

function gcd_rec(a, b) {
    if (a > b) {
        return gcd_rec(a-b, b);
    } else if (a < b) {
        return gcd_rec(a, b-a);
    }
    // a === b
    return a;
}

Future work

Done (new implementation that supports mutual recursion and solves name collisions).

Appendices

On another page. They include standard deviation statistics and a comparison with trampoline implementations.


Produced on 2016-02-02 by index.scm - by Guillaume Lathoud (glathoud _at_ yahoo _dot_ fr)Valid HTML 4.01 Strict