Another cool feature offered by Spire is the cfor
macro. cfor
is basically just a C-style for loop:
cfor(0)(i => i < maxSize, i => i + 1)( i => {
result = result + i*i
})
The purpose of cfor
is not to do anything you can't do with ordinary Scala iterator primitives. The sole purpose of cfor
is performance, since unfortunately ordinary scala iteration performance is pretty terrible.
I've also written a few other posts on Spire, including this post on boolean algebras in spire and this post on vector spaces in spire, which might be useful.
Consider the following problem. We want to apply the affine transformation (x:Double) => 2.0*x+3.0
to an Array[Double]
. The idiomatic way to define it is as follows:
object CFor {
def mapMultiply(x: Array[Double]): Array[Double] = x.map(x => 2.0*x+3.0)
}
Now lets benchmark this. We can build a benchmark class using the scala.testing.Benchmark
library:
class MultiplyBenchmark(func: Array[Double] => Array[Double]) extends testing.Benchmark {
var x: Array[Double] = null
override def setUp() = {
if (x == null) {
x = new Array(8*1024*1024)
cfor(0)(_ < x.size, _ + 1)(i => {
x(i) = java.lang.Math.random()
})
}
}
def run = func(x)
}
object CForTest {
val CForTest = new MultiplyBenchmark(CFor.cforMultiply)
val MapTest = new MultiplyBenchmark(CFor.mapMultiply)
}
Running this in the console on my machine yields:
scala> CForTest.MapTest.runBenchmark(100).sum / 100.0
res2: Double = 240.03
(This number is in milliseconds.)
Suppose we implemented the same function as idiomatic C, translated to Scala:
def cforMultiply(x: Array[Double]): Array[Double] = {
val result = new Array[Double](x.size)
cfor(0)(_ < x.size, _ + 1)(i => {
result(i) = 2.0*x(i) + 3.0
})
result
}
The identical benchmark runs over 10x faster:
scala> CForTest.CForTest.runBenchmark(100).sum / 100.0
res1: Double = 21.42
Now lets consider an example with two arrays:
def cforMultiply(x: Array[Double], y: Array[Double]): Array[Double] = {
val result = new Array[Double](x.size)
cfor(0)(_ < x.size, _ + 1)(i => {
result(i) = x(i)*y(i)
})
result
}
def zipMultiply(x: Array[Double], y: Array[Double]): Array[Double] = x.zip(y).map( x => x._1 * x._2 )
The cforMultiply
takes 26.1ms to run. The zipMultiply
takes 7.468 seconds to run! Holy shit, idiomatic scala is slow.
Scala - Comparison to Python/Numpy
Note: A previous version of this post incorrectly said scala was slower than numpy. I misinterpreted the result of timeit. It is not returning a number in milliseconds, it's returning the time to run all benchmarks as measured in seconds. Thanks to Ichoran and michael_121 for spotting this mistake.
For comparison, here is multiply in idiomatic python:
2.0*x+3.0
And the benchmarks:
In [3]: timeit.timeit('2.0*x+3.0',
setup="""from numpy import *;
x = random.rand(8*1024*1024)""",
number=100)
<timeit-src>:2: SyntaxWarning: import * only allowed at module level
Out[3]: 4.537637948989868
This works out to be 45.37ms/run. This is about twice as slow as Scala's cfor
- I'm guessing because the Scala version iterates once over the array, while the numpy version does so twice.
Dot product implemented as x*y
takes 27.5ms.
Using a while loop
[edit: it was suggested I try a while loop after I wrote this] This is equivalent to the performance you would get simply by using a while loop (21.75ms):
def whileMultiply(x: Array[Double]): Array[Double] = {
val result = new Array[Double](x.size)
var i:Int = 0
while (i < x.size) {
result(i) = 2.0*x(i) + 3.0
i += 1
}
result
}
This is because cfor
is a macro which expands down to the while loop. The implementation can be found here, but the gist of it is:
def oversimplified_cforMacro[A](c: Context)(init: c.Expr[A])
(test: c.Expr[A => Boolean], next: c.Expr[A => A])
(body: c.Expr[A => Unit]): c.Expr[Unit] = {
val tree = q"""
var $index = $init
while ($test($index)) {
$body($index)
$index = $next($index)
}
"""
new InlineUtil[c.type](c).inlineAndReset[Unit](tree)
}
So using cfor
is equivalent to using the corresponding while
loop.
Lessons learned
Double check the docs before posting benchmarks. Egg on my face here. (The previous lesson learned was that numpy was faster than scala+cfor, which is incorrect.)
Another interesting fact I discovered while investigating this is that JVM has built in bounds check elimination. If the JIT compiler can prove that your loop will not exit the bounds of an array, it will eliminate the range check. If I understand right, it does it whenever your indexing function is linear:
cfor(0)(i => i < maxSize, i => i + 1)( i => {
result[a*i+b] = ...
})
Here the compiler must be able to prove that a
and b
are constants.
Code available
Code for this series of blog posts is available on github.