name: inverse class: center, middle, inverse # Number Crunching ## in Scala Chris Stucchio, [BayesianWitch](http://www.bayesianwitch.com) --- # Motivation - "6 WEIRD Facts about Elephants. You'll be SHOCKED by #4!" - "An ordinary housewife rode an Elephant. You'll never believe what happened next..." Which title is more viral? ## How it works - User visits BW customer website (publisher, blogger, etc). - Title tag is not populated. - Browser requests title from server. - Server must respond in 400ms or less with the optimal title. - **Finding optimal title involves solving coupled PDEs, minimizing high dimensional objective function, and statistical sampling.** - If no response in 400ms, random (suboptimal) title is displayed. --- class: middle # Scala + Akka - Great for realtime streaming - Error tolerant - Fast - Concurrent --- class: middle # Python - Easy for beginners - Not ugly like Perl - **Numpy** - **Scipy** - **Matplotlib** - **Bokeh** --- class: middle # What to do ## Realtime streaming, low latency Akka sounds great. ## Solve PDE Python sounds great. ## JSON/Thrift/ProtoBuff network API connecting the two Doesn't sound great. --- # Example problem Have true conversion rate `z`, time is currently `t=0`. Don't know it! All I have is *opinions* about it: [ f(0,t=0), f(0.01, t=0), f(0.02, t=0), ... ] Function `f(x,t=0)` represents *probability* that `z=x` at `t=0`. Unfortunately conversion rates vary with time - *Mulayam sticks to his guns on new rape law* has a different conversion rate today than in one week. Key point: measure `z` perfectly today, tomorrow still lots of uncertainty. It's actually `z(t)`. --- # Example solution x = [0.00, 0.01, 0.02, ...] At time `t`, show user article, he clicked it! f_after_click = x * f_before_click f_after_click /= sum(f_after_click) Update from Bayes Theorem. Use `(1-x)` if he didn't click. This is how you add new data. --- # Example solution Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE: $$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$ Represents *random walk* - if probability of conversion is now 10%, it might be 11% or 9% tomorrow. --- # Example solution Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE: $$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$ f_jacobi_basis = matrix @ f_normal_basis $$ f(z,0) = \sum_n f_n [m(z) Q_n(z)] $$ Matrix used to find $@ f_n $@. --- # Example solution Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE: $$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$ f_jacobi_basis = matrix @ f_normal_basis f_jacobi_basis *= exp( eigenvalues * t ) $$ f(z,t) = \sum_n f_n e^{ E_n t} [m(z) Q_n(z)] $$ Get solution at future time by multiplying exponential of eigenvalues times time. --- # Example solution Have `f_0` representing our belief at `t=0`. No new data. Want `f_1` representing our (increased) uncertainty at `t=1`. Solve PDE: $$ \frac{df(z,t)}{dt} = \frac{d}{d z}\left[\theta\left( z-\frac{b-a}{a+b+2}\right) f(z,t)\right] + \frac{d^2}{dz^2} \left[ \sqrt{ \frac{\theta(1-z)}{2(a+b+2)}} f(z,t) \right] $$ f_jacobi_basis = matrix @ f_normal_basis f_jacobi_basis *= exp( eigenvalues * t ) f_t = inverse(matrix) @ f_jacobi_basis Finally get back solution at points of $@z$@ by inverting transform. --- # Net Result ![changing conversion rate](changing_conversion_rate.png) [How to measure a changing conversion rate](http://www.chrisstucchio.com/blog/2013/time_varying_conversion_rates.html) Everything you just saw was done in Python. :( --- name: inverse class: center, middle, inverse # Problem #1: ## Idiomatic Scala is Slow --- class: center, middle # 2.0*x+3.0 ## for x a vector --- # Containers are evil ## Idiomatic Scala val x: List[Double] = ... //size 8M x.map(x => 2.0*x+3.0) Time: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Double.valueOf(Double.java:521) at scala.runtime.BoxesRunTime.boxToDouble(Unknown Source) ... (Also uglier than `2*x+3`.) --- # Containers are evil, retry ## Idiomatic Scala val x: List[Double] = ... //size 32k x.map(x => 2.0*x+3.0) Time: 29ms 900ns for double multiply and add, as well as boxing, unboxing, adding entry to list... --- # Arrays val x: Array[Double] = ... //size 8M x.map(x => 2.0*x+3.0) Time: 240ms 30ns for double multiple and add, as well as function call. 8x faster, doesn't fail for "big" data. --- # For loops val x: Array[Double] = ... //size 8M var i=0 while (i < x.size) { result(i) = 2.0*x(i)+3.0 i = i + 1 } Time: 21.4ms 2ns/double. That's about the best you'll get. --- # Arrays + For Loops - 80x faster than Scala containers. - Don't die on small data sets ## FUGLY! Compare: val x: Array[Double] = ... //size 8M var i=0 while (i < x.size) { result(i) = 2.0*x(i)+3.0 i = i + 1 } What we want: 2*x+3 --- name: inverse class: center, middle, inverse # Problem #2: ## Expressiveness --- class: center, middle # 3.0 * x + y ## for x,y a *sparse* vector --- # Sparse Vector > 3.0 *: Map("x" -> 5.0, "y" -> 2.5) + Map("y" -> -1.0, "z" -> 3.0) Map(x -> 15.0, y -> 6.5, z -> 3.0) --- # Other vector spaces val f = (x => x*x) val g = (x => 2*x) f+g //Works in math, why not here? Functions `T => Double` are a vector space, lots more are. ## Don't Repeat Yourself For general purpose algorithm (e.g. Gradient Descent), want to write it once at high level. Should be **type safe** as much as possible. def gradientDescent[V](init: V, deriv: V => V, objective: V => Double, errorTolerance: Double) (implicit vs: VectorSpace[V]): V = .... --- # Mathematical unification Boolean algebra: true && false == false ## Boolean Algebra Applies to more things than just Booleans - predicates, for example: val f: T => Boolean == ... val g: T => Boolean == ... (f & g)(x) == f(x) && g(x) --- name: inverse class: center, middle, inverse # Spire [https://github.com/non/spire](https://github.com/non/spire) --- # Provides math typeclasses trait Semigroup[T] { def plus(x: T, y: T) } Must satisfy: s.plus(s.plus(x,y),z) == s.plus(x, s.plus(y,z)) (Associative law) **Warning:** I'm oversimplifying the type signatures. **Example**: `NonEmptyList[T]` --- # Provides math primitives trait Monoid[T] extends Semigroup[T] { def id: T } Must also satisfy: g.plus(x, g.id) == x (Identity) **Example**: `List[T]`. Identity is `Nil`. --- # Provides math primitives trait Group[T] extends Monoid[T] { def inv(x: T): T } Must satisfy: g.plus(x, g.inv(x)) == g.id (Existence of inverse) **Example**: `Int`. Identity is `0`. Inverse is `x => -x`. --- # Provides math primitives trait Ring[T] extends Semigroup[T, S] { def id: T def plus(x: T, y: T): T def times(x: T, y: T): T } Must also satisfy: x * (y+z) == x*y+x*z (Distributive) **Example**: `Int`, `Matrix[Double]`. --- # Vector Spaces trait VectorSpace[R,V] { implicit def scalar: Field[R] def timesl(r: R, v: V): V def plus(v1: V, v2: V): V .. } Implicits are available: implicit class ArrayVectorSpace[K,V] extends VectorSpace[V, Array[V]] { ... } implicit class MapVectorSpace[K,V] extends VectorSpace[V, Map[K,V]] { ... } --- # Vector Spaces Array(1,2,3) + Array(3,2,1) == Array(4,4,4) Map("x" -> 1, "y" -> 2) + Map("y" -> 1, "z" -> 1) == Map("x"->1, "y" -> 3, "z" -> 1) --- # Cfor macro var sum = 0 cfor(0)(_ < x.size, _ + 1)(i => { sum += x(i) }) Equivalent to standard c for-loop. Everything is inlined. No function calls. *Scala does need macros, if you care about low level performance.* --- name: inverse class: center, middle, inverse # Breeze [https://github.com/scalanlp/breeze](https://github.com/scalanlp/breeze) --- # Goal: Numpy for Scala ![issue tracker](breeze_issue_tracker.png) --- # DenseVector scala> val x = DenseVector.rangeD(0,1,0.01) res2: breeze.linalg.DenseVector[Double] = DenseVector(0.0, 0.01, 0.02...0.98, 0.99) scala> x :* x res8: breeze.linalg.DenseVector[Double] = DenseVector(0.0, 1.0E-4, 4.0E-4,... ) scala> x :+ x res0: breeze.linalg.DenseVector[Int] = DenseVector(0.0, 0.02, 0.04...1.98) Thin wrapper around array, so fast. This works: x*2+3 Doesn't work: 2*x+3 --- # DenseVector Like Numpy, multiple `DenseVector` objects can be backed by the same array. scala> val x = DenseVector.rangeD(0,1,0.01) = DenseVector(0.0, 0.01, 0.02...0.98, 0.99) scala> val y = new DenseVector(x.data, 2, 2, 10) y: breeze.linalg.DenseVector[Double] = DenseVector(0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2) Updates to `x` will also affect `y` so be careful: scala> x.update(2, 23) scala> y res14: breeze.linalg.DenseVector[Double] = DenseVector(23.0, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2) (This seems weird but is VERY handy sometimes, particularly for avoiding GC.) --- # SparseVector scala> val v = SparseVector[Double](50)(0->1.0, 5->1.0, 20->1.0) v: breeze.linalg.SparseVector[Double] = SparseVector((0,1.0), (5,1.0), (20,1.0)) scala> val w = SparseVector[Double](50)(0->2, 4->1, 20->2) w: breeze.linalg.SparseVector[Double] = SparseVector((0,2.0), (4,1.0), (20,2.0)) scala> v + w res0: breeze.linalg.SparseVector[Double] = SparseVector((0,3.0), (4,1.0), (5,1.0), (20,3.0)) Note the first element of the constructor is the maximum index. Dense vectors play nicely with sparse ones: scala> val yy = DenseVector.rangeD(0,50,1) scala> w + yy res3: breeze.linalg.DenseVector[Double] = DenseVector(2.0, 1.0, 2.0, 3.0, 5.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0...) --- # DenseMatrix val x = DenseMatrix.eye[Double](3) * 0.5 x: breeze.linalg.DenseMatrix[Double] = 0.5 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.5 scala> x * x 0.25 0.0 0.0 0.0 0.25 0.0 0.0 0.0 0.25 scala> val v = DenseVector[Double](1,2,3) scala> x * v res14: breeze.linalg.DenseVector[Double] = DenseVector(0.5, 1.0, 1.5) scala> inv(x) 2.0 -0.0 -0.0 0.0 2.0 -0.0 0.0 0.0 2.0 --- # Matrix operations - Fast. Breeze uses [Netlib-Java](https://github.com/fommil/netlib-java/) for most operations. Uses Blas libraries, using java libs as a fallback. - Type safe. - Low level operations can be written using fast primitives - no need for cython-like libs to roll your own low-level implementation. Unsafe methods give nearly [1] "as fast as C" performance: var result: Double = 0 cfor(0)(i => i < v.size, i => i+1)(i => { result += v.unsafeValueAt(i) v.unsafeUpdate(i, result) }) [1] They don't use SIMD. If anyone knows how to make the JVM use SIMD, I'd love to hear. --- # UFunc - polymorphic functions scala> val x: Double = 5 scala> sum(x) res4: Double = 5 scala> val y: DenseVector.range(0,10) scala> sum(y) res5: Double = 55 The `sum` function seems to work on nearly everything. --- # UFunc - polymorphic functions Handled via implicits object sum extends UFunc { implicit def vectorImpl[@expand.args(Int, Double, Float, Long), T] = new Impl[DenseVector[T],T] { def apply(v: T): S = { var result: T = 0 cfor(0)(i => i < v.size, i => i+1)(i => { result += v(i) }) result } } implicit def scalarImpl[@expand.args(Int, Double, Float, Long), T] = new Impl[T,T] { def apply(v: T): T = v } } (Not the real implementation.) --- # UFunc - polymorphic functions trait UFunc { final def apply[@specialized(Int, Double, Float) V, @specialized(Int, Double, Float) VR] (v: V)(implicit impl: Impl[V, VR]):VR = impl(v) final def apply[@specialized(Int, Double, Float) V1, @specialized(Int, Double, Float) V2, @specialized(Int, Double, Float) VR] (v1: V1, v2: V2)(implicit impl: Impl2[V1, V2, VR]):VR = impl(v1, v2) ... } --- # Polynomials val p = Polynomial.dense(Array[Double](1,2,4)) p(0.0) === 1.0 p(0.5) === 3.0 Polynomials work on matrices, vectors, as expected[1]: p(DenseVector(0.0, 0.5)) === DenseVector(1.0, 3.0) [1] Some implementations are unfinished. If so let me know and I'll add it for you. --- # Statistics scala> val d = new Beta(26, 34) scala> d.sample(5) res9: IndexedSeq[Double] = Vector(0.5187421555179692, 0.4949550069324163, 0.2992490222823595, 0.3872017877436448, 0.4024336597885077) All the nice features you expect, e.g. `pdf`: scala> val x = DenseVector.rangeD(0,1,0.1) scala> d.pdf(x) res10: breeze.linalg.DenseVector[Double] = DenseVector(0.0, 3.181962409216E-9, 0.0021898365265868606, 0.6744642295797582, 5.535836809100089, 3.57233751778091, 0.21599972917689683, 7.676170973244001E-4, 3.3414253640546295E-8, 7.391881042962405E-17) --- # Optimization scala> val opt = new ProjectedQuasiNewton(tolerance=1e-9) scala> val f = new DiffFunction[DenseVector[Double]] { def calculate(x: DenseVector[Double]) = { (sum((x - 3.0) :^ 2.0), (x * 2.0) - 6.0) } } val result = optimizer.minimize(f, init) --- # Signal Processing scala> val TriangleKernel = DenseVector(0.25, 0.5, 0.25) scala> val data = getSignalFromOutsideWorld scala> val smoothed = convolve(data, kernel) scala> val correlations = correlate(smoothed, expectedHiddenSignal) --- # Breeze - good and bad - Spire and Breeze were developed somewhat independently of each other. As a result, Breeze has it's own *incompatible* implementation of `Ring`, `Field`, `VectorSpace`, etc. Urgh. (David Hall and I have both attempted to fix this, but it's not easy at this point.) - Numpy Parity (TM) is a LONG way off. Missing features include, e.g., 2d arrays, lots of ufuncs, fitting distributions, etc. --- name: inverse class: center, middle, inverse # Saddle [https://github.com/saddle/saddle](https://github.com/saddle/saddle) --- # R/Pandas for Scala - Provides vectors, matrix similar to Breeze, - Key feature: `Series` - a `Vec` with an index (roughly a map) scala> val s = Series(d"2013-01-04T00:00:00.000-05:00" -> 5, d"2013-01-18T00:00:00.000-05:00" -> 7, d"2013-02-01T00:00:00.000-05:00" -> 4) - Internal representation is `(Array[DateTime], Array[Int])` making computations efficient. --- # Nice IO import org.saddle.io._ val file = CsvFile("P00000001-ALL.csv") // parse columns 2 and 9 of the CSV and convert the result to a Frame // (we know in advance these cols are candidate name and donation amount) val frame = CsvParser.parse(List(2,9))(file).withRowIndex(0).withColIndex(0) // convert frame body data to long primitives, mapping any parse errors to NA val data = frame.mapValues(CsvParser.parseLong) But not nice like it could be: from pandas import * df = read_csv("P00000001-ALL.csv") data = (df['name'], df['donation_amount']) --- # Stuff it doesn't do - Automagically load data/type infer like Pandas (hard in static typed system). Maybe a job for `scala.Dynamic`? - Pandas-style joins - Spire compatibility - Breeze compatibility Last one very annoying. Can't call, e.g. val frame = CsvParser.parse(List(2,9))(csvFile).withRowIndex(0).withColIndex(0) val data = frame.mapValues(CsvParser.parseLong) val dataAsBreeze = new DataFrame(data.underlyingArray) //Don't make me do this! val sampleDist = breeze.stats.VariableKernelEmpiricalDistribution(dataAsBreeze) --- name: inverse class: center, middle, inverse # Visualization [Breeze-Bokeh](https://github.com/stucchio/breeze-bokeh) and [Breeze-Viz](https://github.com/scalanlp/breeze-viz) --- # Breeze-Viz Native java library based on [JFreeChart](http://www.jfree.org/). ![jfrechart](j_free_chart.png) --- # Breeze-Bokeh Javascript-generaing library based on Continuum Analytics' [Bokeh](http://bokeh.pydata.org/), specifically the javascript portions.
Click here
--- # Which to use - Breeze-viz semi abandoned - Breeze-Bokeh wildly experimental, supports very little - JFreeChart is not under heavy development, Bokeh certainly is. - Browser vs Desktop - I'm a bit biased --- name: inverse class: center, middle, inverse # Bigger Data Sets [Scalding](https://github.com/twitter/scalding) --- # Hadoop/Cascading Hadoop is fun again! class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics { InInterval .filter('dailySession)((c:EventSequence) => Users.isValidUser(c.head)) .flatMap('dailySession -> 'version)((c:EventSequence) => { c().find(b => PageViews.isPostShowPage(b)).map( _.data("vw") ) }) .map('dailySession -> 'score)((c:EventSequence) => { val pageViews: Long = c.count((b:Event) => PageViews.isPageView(b) ) pageViews }) .map('dailySession -> 'userIdent)( (c:EventSequence) => c.userIdentString ) .project('userIdent, 'score, 'version) .sumL( ('userIdent, 'version), 'score -> 'score, reducers = 6) .longToDouble('score -> 'score) .sizeMeanVar('version, 'score, ('mean, 'variance, 'count)) .write( Tsv( args("output") ) ) } --- # Hadoop/Cascading Hadoop is fun again! class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics { InInterval Data looks like this: Map('dailySession -> EventSequence) --- # Hadoop/Cascading Hadoop is fun again! class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics { InInterval .filter('dailySession)((c:EventSequence) => Users.isValidUser(c.head)) .flatMap('dailySession -> 'version)((c:EventSequence) => { c().find(b => PageViews.isPostShowPage(b)).map( _.data("vw") ) }) Data looks like this: Map('dailySession -> EventSequence, 'version -> versionString) --- # Hadoop/Cascading Hadoop is fun again! class JennyABTest(args: Args) extends DailySessionJob(args) with SumsStatistics { InInterval .filter('dailySession)((c:EventSequence) => Users.isValidUser(c.head)) .flatMap('dailySession -> 'version)((c:EventSequence) => { c().find(b => PageViews.isPostShowPage(b)).map( _.data("vw") ) }) .map('dailySession -> 'score)((c:EventSequence) => { val pageViews: Long = c.count((b:Event) => PageViews.isPageView(b) ) pageViews }) Data looks like this: Map('dailySession -> EventSequence, 'version -> versionString, 'score -> pageViews) --- # Hadoop/Cascading .project('userIdent, 'score, 'version) Throws away everything but `userIdent`, `score` and `version`. Map('dailySession -> EventSequence, 'version -> versionString, 'score -> pageViews) Aggregation: .sumL( ('userIdent, 'version), 'score -> 'score, reducers = 6) Group by `(userIdent, version)` pairs, and compute sum of scores for each element in the group. --- # Scalding - Great if you live in Hadoop-land. - No integration whatsoever with Breeze, Spire, etc. - Lives entirely in twitter-land, uses [Algebird](https://github.com/twitter/algebird) for math. My use case: take all the logs, dump data to CSV. Load CSV and then do matrix stuff with Breeze. Danger of Scalding: java.lang.OutOfMemoryError: GC overhead limit exceeded Scala creates a lot more objects than Java. In Hadoop this can be a big performance killer. --- name: inverse class: center, middle, inverse # Conclusion --- # Conclusion - Up and coming libraries. Breeze isn't numpy/scipy, but it's getting closer. - Breeze-Bokeh playing catch-up to Python-Bokeh. - Saddle - building Pandas for Scala is tough, due to heavy use of dynamic features. - Scalding is best Hadoop abstraction layer I've seen. Idiomatic Scala FTW! Why can't we all play nicely together? ![get along](Ennerdale_three_children_hugging.jpg)