Number Crunching in Scala

name: inverse
class: center, middle, inverse

# Number Crunching
## in Scala
Chris Stucchio, [BayesianWitch](http://www.bayesianwitch.com)

---
# Motivation

- "6 WEIRD Facts about Elephants. You'll be SHOCKED by #4!"
- "An ordinary housewife rode an Elephant. You'll never believe what happened next..."

Which title is more viral?

## How it works

- User visits BW customer website (publisher, blogger, etc).
- Title tag is not populated.
- Browser requests title from server.
- Server must respond in 400ms or less with the optimal title.
- **Finding optimal title involves solving coupled PDEs, minimizing high dimensional objective function, and statistical sampling.**
- If no response in 400ms, random (suboptimal) title is displayed.

---
class: middle
# Scala + Akka

- Great for realtime streaming
- Error tolerant
- Fast
- Concurrent

---
class: middle
# Python

- Easy for beginners
- Not ugly like Perl
- **Numpy**
- **Scipy**
- **Matplotlib**
- **Bokeh**

---
class: middle
# What to do

## Realtime streaming, low latency

Akka sounds great.

## Solve PDE

Python sounds great.

## JSON/Thrift/ProtoBuff network API connecting the two

Doesn't sound great.

---
# Example problem

Have true conversion rate `z`, time is currently `t=0`. Don't know it!

All I have is *opinions* about it:

[ f(0,t=0), f(0.01, t=0), f(0.02, t=0), ... ]

Function `f(x,t=0)` represents *probability* that `z=x` at `t=0`.

Unfortunately conversion rates vary with time - *Mulayam sticks to his guns on new rape law* has a different conversion rate today than in one week.

Key point: measure `z` perfectly today, tomorrow still lots of uncertainty. It's actually `z(t)`.

---
# Example solution

x = [0.00, 0.01, 0.02, ...]

At time `t`, show user article, he clicked it!

f_after_click = x * f_before_click
    f_after_click /= sum(f_after_click)

Update from Bayes Theorem. Use `(1-x)` if he didn't click.

This is how you add new data.

---
# Example solution

Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE:

$$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$

Represents *random walk* - if probability of conversion is now 10%, it might be 11% or 9% tomorrow.

---
# Example solution

Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE:

$$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$

f_jacobi_basis = matrix @ f_normal_basis

$$ f(z,0) = \sum_n f_n [m(z) Q_n(z)] $$

Matrix used to find $@ f_n $@.

---
# Example solution

Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE:

$$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$

f_jacobi_basis = matrix @ f_normal_basis
    f_jacobi_basis *= exp( eigenvalues * t )

$$ f(z,t) = \sum_n f_n e^{ E_n t} [m(z) Q_n(z)] $$
Get solution at future time by multiplying exponential of eigenvalues times time.

---
# Example solution

Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE:

$$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$

f_jacobi_basis = matrix @ f_normal_basis
    f_jacobi_basis *= exp( eigenvalues * t )
    f_t = inverse(matrix) @ f_jacobi_basis

Finally get back solution at points of $@z$@ by inverting transform.

---
# Net Result
![changing conversion rate](changing_conversion_rate.png)

[How to measure a changing conversion rate](http://www.chrisstucchio.com/blog/2013/time_varying_conversion_rates.html)

Everything you just saw was done in Python. :(

---
name: inverse
class: center, middle, inverse

# Problem #1:
## Idiomatic Scala is Slow

---
class: center, middle
# 2.0*x+3.0
## for x a vector
---
# Containers are evil

## Idiomatic Scala

val x: List[Double] = ... //size 8M
    x.map(x => 2.0*x+3.0)

Time:

java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.lang.Double.valueOf(Double.java:521)
    at scala.runtime.BoxesRunTime.boxToDouble(Unknown Source)
    ...

(Also uglier than `2*x+3`.)

---
# Containers are evil, retry

## Idiomatic Scala

val x: List[Double] = ... //size 32k
    x.map(x => 2.0*x+3.0)

Time: 29ms

900ns for double multiply and add, as well as boxing, unboxing, adding entry to list...

---
# Arrays

val x: Array[Double] = ... //size 8M
    x.map(x => 2.0*x+3.0)

Time: 240ms

30ns for double multiple and add, as well as function call.

8x faster, doesn't fail for "big" data.
---
# For loops

val x: Array[Double] = ... //size 8M
    var i=0
    while (i < x.size) {
      result(i) = 2.0*x(i)+3.0
      i = i + 1
    }

Time: 21.4ms

2ns/double. That's about the best you'll get.

---
# Arrays + For Loops

- 80x faster than Scala containers.
- Don't die on small data sets

## FUGLY!

Compare:

val x: Array[Double] = ... //size 8M
    var i=0
    while (i < x.size) {
      result(i) = 2.0*x(i)+3.0
      i = i + 1
    }

What we want:

2*x+3

---
name: inverse
class: center, middle, inverse

# Problem #2:
## Expressiveness

---
class: center, middle
# 3.0 * x + y
## for x,y a *sparse* vector

---
# Sparse Vector

> 3.0 *: Map("x" -> 5.0, "y" -> 2.5) + Map("y" -> -1.0, "z" -> 3.0)

Map(x -> 15.0, y -> 6.5, z -> 3.0)

---
# Other vector spaces

val f = (x => x*x)
    val g = (x => 2*x)
    f+g //Works in math, why not here?

Functions `T => Double` are a vector space, lots more are.

## Don't Repeat Yourself

For general purpose algorithm (e.g. Gradient Descent), want to write it once at high level. Should be **type safe** as much as possible.

def gradientDescent[V](init: V,
                           deriv: V => V,
                           objective: V => Double,
                           errorTolerance: Double)
                     (implicit vs: VectorSpace[V]): V =
        ....

---
# Mathematical unification

Boolean algebra:

true && false == false

## Boolean Algebra

Applies to more things than just Booleans - predicates, for example:

val f: T => Boolean == ...
    val g: T => Boolean == ...

(f & g)(x) == f(x) && g(x)

---
name: inverse
class: center, middle, inverse

# Spire
[https://github.com/non/spire](https://github.com/non/spire)

---
# Provides math typeclasses

trait Semigroup[T] {
      def plus(x: T, y: T)
    }

Must satisfy:

s.plus(s.plus(x,y),z) == s.plus(x, s.plus(y,z))

(Associative law)

**Warning:** I'm oversimplifying the type signatures.

**Example**: `NonEmptyList[T]`

---
# Provides math primitives

trait Monoid[T] extends Semigroup[T] {
      def id: T
    }

Must also satisfy:

g.plus(x, g.id) == x

(Identity)

**Example**: `List[T]`. Identity is `Nil`.

---
# Provides math primitives

trait Group[T] extends Monoid[T] {
      def inv(x: T): T
    }

Must satisfy:

g.plus(x, g.inv(x)) == g.id

(Existence of inverse)

**Example**: `Int`. Identity is `0`. Inverse is `x => -x`.

---
# Provides math primitives

trait Ring[T] extends Semigroup[T, S] {
      def id: T
      def plus(x: T, y: T): T
      def times(x: T, y: T): T
    }

Must also satisfy:

x * (y+z) == x*y+x*z

(Distributive)

**Example**: `Int`, `Matrix[Double]`.

---
# Vector Spaces

trait VectorSpace[R,V] {
      implicit def scalar: Field[R]
      def timesl(r: R, v: V): V
      def plus(v1: V, v2: V): V
      ..
    }

Implicits are available:

implicit class ArrayVectorSpace[K,V] extends VectorSpace[V, Array[V]] {
      ...
    }

implicit class MapVectorSpace[K,V] extends VectorSpace[V, Map[K,V]] {
      ...
    }

---
# Vector Spaces

Array(1,2,3) + Array(3,2,1) == Array(4,4,4)

Map("x" -> 1, "y" -> 2) + Map("y" -> 1, "z" -> 1)
      == Map("x"->1, "y" -> 3, "z" -> 1)

---
# Cfor macro

var sum = 0
    cfor(0)(_ < x.size, _ + 1)(i => {
      sum += x(i)
    })

Equivalent to standard c for-loop. Everything is inlined. No function calls.

*Scala does need macros, if you care about low level performance.*

---
name: inverse
class: center, middle, inverse

# Breeze
[https://github.com/scalanlp/breeze](https://github.com/scalanlp/breeze)

---
# Goal: Numpy for Scala

![issue tracker](breeze_issue_tracker.png)

---
# DenseVector

scala> val x = DenseVector.rangeD(0,1,0.01)
    res2: breeze.linalg.DenseVector[Double]
        = DenseVector(0.0, 0.01, 0.02...0.98, 0.99)

scala> x :* x
    res8: breeze.linalg.DenseVector[Double]
        = DenseVector(0.0, 1.0E-4, 4.0E-4,... )

scala> x :+ x
    res0: breeze.linalg.DenseVector[Int] = DenseVector(0.0, 0.02, 0.04...1.98)

Thin wrapper around array, so fast. This works:

x*2+3

Doesn't work:

2*x+3

---
# DenseVector

Like Numpy, multiple `DenseVector` objects can be backed by the same array.

scala> val x = DenseVector.rangeD(0,1,0.01)
        = DenseVector(0.0, 0.01, 0.02...0.98, 0.99)
    scala> val y = new DenseVector(x.data, 2, 2, 10)
    y: breeze.linalg.DenseVector[Double]
        = DenseVector(0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2)

Updates to `x` will also affect `y` so be careful:

scala> x.update(2, 23)
    scala> y
    res14: breeze.linalg.DenseVector[Double]
        = DenseVector(23.0, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2)

(This seems weird but is VERY handy sometimes, particularly for avoiding GC.)

---
# SparseVector

scala> val v = SparseVector[Double](50)(0->1.0, 5->1.0, 20->1.0)
    v: breeze.linalg.SparseVector[Double]
        = SparseVector((0,1.0), (5,1.0), (20,1.0))

scala> val w = SparseVector[Double](50)(0->2, 4->1, 20->2)
    w: breeze.linalg.SparseVector[Double]
        = SparseVector((0,2.0), (4,1.0), (20,2.0))

scala> v + w
    res0: breeze.linalg.SparseVector[Double]
        = SparseVector((0,3.0), (4,1.0), (5,1.0), (20,3.0))

Note the first element of the constructor is the maximum index.

Dense vectors play nicely with sparse ones:

scala> val yy = DenseVector.rangeD(0,50,1)
    scala> w + yy
    res3: breeze.linalg.DenseVector[Double] = DenseVector(2.0, 1.0, 2.0,
        3.0, 5.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0...)

---
# DenseMatrix

val x = DenseMatrix.eye[Double](3) * 0.5
    x: breeze.linalg.DenseMatrix[Double] =
    0.5  0.0  0.0
    0.0  0.5  0.0
    0.0  0.0  0.5

scala> x * x
    0.25  0.0   0.0
    0.0   0.25  0.0
    0.0   0.0   0.25

scala> val v = DenseVector[Double](1,2,3)
    scala> x * v
    res14: breeze.linalg.DenseVector[Double] = DenseVector(0.5, 1.0, 1.5)

scala> inv(x)
    2.0  -0.0  -0.0
    0.0  2.0   -0.0
    0.0  0.0   2.0

---
# Matrix operations

- Fast. Breeze uses [Netlib-Java](https://github.com/fommil/netlib-java/) for most operations. Uses Blas libraries, using java libs as a fallback.
- Type safe.
- Low level operations can be written using fast primitives - no need for cython-like libs to roll your own low-level implementation.

Unsafe methods give nearly [1] "as fast as C" performance:

var result: Double = 0
    cfor(0)(i => i < v.size, i => i+1)(i => {
      result += v.unsafeValueAt(i)
      v.unsafeUpdate(i, result)
    })

[1] They don't use SIMD. If anyone knows how to make the JVM use SIMD, I'd love to hear.

---
# UFunc - polymorphic functions

scala> val x: Double = 5
    scala> sum(x)
    res4: Double = 5

scala> val y: DenseVector.range(0,10)
    scala> sum(y)
    res5: Double = 55

The `sum` function seems to work on nearly everything.

---
# UFunc - polymorphic functions

Handled via implicits

object sum extends UFunc {
        implicit def vectorImpl[@expand.args(Int, Double, Float, Long), T]
          = new Impl[DenseVector[T],T] {
            def apply(v: T): S = {
               var result: T = 0
               cfor(0)(i => i < v.size, i => i+1)(i => {
                 result += v(i)
              })
              result
          }
        }

implicit def scalarImpl[@expand.args(Int, Double, Float, Long), T]
          = new Impl[T,T] {
            def apply(v: T): T = v
        }
    }

(Not the real implementation.)

---
# UFunc - polymorphic functions

trait UFunc {
      final def apply[@specialized(Int, Double, Float) V,
                      @specialized(Int, Double, Float) VR]
                      (v: V)(implicit impl: Impl[V, VR]):VR
                          = impl(v)

final def apply[@specialized(Int, Double, Float) V1,
                      @specialized(Int, Double, Float) V2,
                      @specialized(Int, Double, Float) VR]
                      (v1: V1, v2: V2)(implicit impl: Impl2[V1, V2, VR]):VR
                          = impl(v1, v2)
      ...
    }

---
# Polynomials

val p = Polynomial.dense(Array[Double](1,2,4))
    p(0.0) === 1.0
    p(0.5) === 3.0

Polynomials work on matrices, vectors, as expected[1]:

p(DenseVector(0.0, 0.5)) === DenseVector(1.0, 3.0)

[1] Some implementations are unfinished. If so let me know and I'll add it for you.

---
# Statistics

scala> val d = new Beta(26, 34)
    scala> d.sample(5)
    res9: IndexedSeq[Double] = Vector(0.5187421555179692, 0.4949550069324163,
                                      0.2992490222823595, 0.3872017877436448,
                                      0.4024336597885077)

All the nice features you expect, e.g. `pdf`:

scala> val x = DenseVector.rangeD(0,1,0.1)
    scala> d.pdf(x)
    res10: breeze.linalg.DenseVector[Double]
      = DenseVector(0.0, 3.181962409216E-9, 0.0021898365265868606, 0.6744642295797582,
                    5.535836809100089, 3.57233751778091, 0.21599972917689683, 7.676170973244001E-4,
                    3.3414253640546295E-8, 7.391881042962405E-17)

---
# Optimization

scala> val opt = new ProjectedQuasiNewton(tolerance=1e-9)
    scala> val f = new DiffFunction[DenseVector[Double]] {
        def calculate(x: DenseVector[Double]) = {
          (sum((x - 3.0) :^ 2.0), (x * 2.0) - 6.0)
        }
      }
    val result = optimizer.minimize(f, init)

---
# Signal Processing

scala> val TriangleKernel = DenseVector(0.25, 0.5, 0.25)
    scala> val data = getSignalFromOutsideWorld
    scala> val smoothed = convolve(data, kernel)
    scala> val correlations = correlate(smoothed, expectedHiddenSignal)

---
# Breeze - good and bad

- Spire and Breeze were developed somewhat independently of each other. As a result, Breeze has it's own *incompatible* implementation of `Ring`, `Field`, `VectorSpace`, etc. Urgh. (David Hall and I have both attempted to fix this, but it's not easy at this point.)
- Numpy Parity (TM) is a LONG way off. Missing features include, e.g., 2d arrays, lots of ufuncs, fitting distributions, etc.

---
name: inverse
class: center, middle, inverse
# Saddle
[https://github.com/saddle/saddle](https://github.com/saddle/saddle)

---
# R/Pandas for Scala

- Provides vectors, matrix similar to Breeze,
- Key feature: `Series` - a `Vec` with an index (roughly a map)

scala> val s = Series(d"2013-01-04T00:00:00.000-05:00" -> 5, d"2013-01-18T00:00:00.000-05:00" -> 7, d"2013-02-01T00:00:00.000-05:00" -> 4)

- Internal representation is `(Array[DateTime], Array[Int])` making computations efficient.

---
# Nice IO

import org.saddle.io._
    val file = CsvFile("P00000001-ALL.csv")

// parse columns 2 and 9 of the CSV and convert the result to a Frame
    // (we know in advance these cols are candidate name and donation amount)
    val frame = CsvParser.parse(List(2,9))(file).withRowIndex(0).withColIndex(0)

// convert frame body data to long primitives, mapping any parse errors to NA
    val data = frame.mapValues(CsvParser.parseLong)

But not nice like it could be:

from pandas import *
    df = read_csv("P00000001-ALL.csv")
    data = (df['name'], df['donation_amount'])

---
# Stuff it doesn't do

- Automagically load data/type infer like Pandas (hard in static typed system). Maybe a job for `scala.Dynamic`?
- Pandas-style joins
- Spire compatibility
- Breeze compatibility

Last one very annoying. Can't call, e.g.

val frame = CsvParser.parse(List(2,9))(csvFile).withRowIndex(0).withColIndex(0)
    val data = frame.mapValues(CsvParser.parseLong)

val dataAsBreeze = new DataFrame(data.underlyingArray) //Don't make me do this!

val sampleDist = breeze.stats.VariableKernelEmpiricalDistribution(dataAsBreeze)

---
name: inverse
class: center, middle, inverse
# Visualization
[Breeze-Bokeh](https://github.com/stucchio/breeze-bokeh) and [Breeze-Viz](https://github.com/scalanlp/breeze-viz)

---
# Breeze-Viz

Native java library based on [JFreeChart](http://www.jfree.org/).

![jfrechart](j_free_chart.png)

---
# Breeze-Bokeh

Javascript-generaing library based on Continuum Analytics' [Bokeh](http://bokeh.pydata.org/), specifically the javascript portions. <a href="bokeh.html" target="#">Click here</a>

---
# Which to use

- Breeze-viz semi abandoned
- Breeze-Bokeh wildly experimental, supports very little
- JFreeChart is not under heavy development, Bokeh certainly is.
- Browser vs Desktop
- I'm a bit biased

---
name: inverse
class: center, middle, inverse
# Bigger Data Sets
[Scalding](https://github.com/twitter/scalding)

---
# Hadoop/Cascading

Hadoop is fun again!

class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics {
      InInterval
      .filter('dailySession)((c:EventSequence) => Users.isValidUser(c.head))
      .flatMap('dailySession -> 'version)((c:EventSequence) => {
        c().find(b => PageViews.isPostShowPage(b)).map( _.data("vw") )
      })
      .map('dailySession -> 'score)((c:EventSequence) => {
        val pageViews: Long = c.count((b:Event) => PageViews.isPageView(b) )
        pageViews
      })
      .map('dailySession -> 'userIdent)( (c:EventSequence) => c.userIdentString )
      .project('userIdent, 'score, 'version)
      .sumL( ('userIdent, 'version), 'score -> 'score, reducers = 6)
      .longToDouble('score -> 'score)
      .sizeMeanVar('version, 'score, ('mean, 'variance, 'count))
      .write( Tsv( args("output") ) )
    }

---
# Hadoop/Cascading

Hadoop is fun again!

class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics {
      InInterval

Data looks like this:

Map('dailySession -> EventSequence)

---
# Hadoop/Cascading

Hadoop is fun again!

Data looks like this:

Map('dailySession -> EventSequence,
        'version      -> versionString)

---
# Hadoop/Cascading

Hadoop is fun again!

Data looks like this:

Map('dailySession -> EventSequence,
        'version      -> versionString,
        'score        -> pageViews)

---
# Hadoop/Cascading

.project('userIdent, 'score, 'version)

Throws away everything but `userIdent`, `score` and `version`.

Map('dailySession -> EventSequence,
        'version      -> versionString,
        'score        -> pageViews)

Aggregation:

.sumL( ('userIdent, 'version), 'score -> 'score, reducers = 6)

Group by `(userIdent, version)` pairs, and compute sum of scores for each element in the group.

---
# Scalding

- Great if you live in Hadoop-land.
- No integration whatsoever with Breeze, Spire, etc.
- Lives entirely in twitter-land, uses [Algebird](https://github.com/twitter/algebird) for math.

My use case: take all the logs, dump data to CSV. Load CSV and then do matrix stuff with Breeze.

Danger of Scalding:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Scala creates a lot more objects than Java. In Hadoop this can be a big performance killer.

---
name: inverse
class: center, middle, inverse
# Conclusion

---
# Conclusion
- Up and coming libraries. Breeze isn't numpy/scipy, but it's getting closer.
- Breeze-Bokeh playing catch-up to Python-Bokeh.
- Saddle - building Pandas for Scala is tough, due to heavy use of dynamic features.
- Scalding is best Hadoop abstraction layer I've seen. Idiomatic Scala FTW!

Why can't we all play nicely together?

![get along](Ennerdale_three_children_hugging.jpg)