# Spark (for Java)

I almost missed this goodie on Technology Radar -- not only is it shadowed by the popular Apache Spark name, it's reference was hidden in a Spring Boot summary... not my favorite family of XML-bloated tools.  Spark is a lightweight web framework for Java 8.  It has modest run-time dependencies -- Jetty and slf4j and four-line hello-world  example -- including imports, but not close curly braces.

Let's go through a somewhat more complex conversation with Spark than "Hello, World" and set up a simple key-value store.

### Project Setup

Create a Maven project.  Spark has instructions for Intellij and Eclipse.  You don't need an archetype; just make sure to select Java SDK 1.8.

### Salutations, Terrene

We'll implement a simple REST dictionary so that we can show off our vocabulary, or our thesaurus skills, and because we're snooty, we'll "protect" our dictionary with a password.

package org.bredin;

import spark.*
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.*

public class Main {
private static Map<String,String> keyStore = new TreeMap<>(String.CASE_INSENSITIVE_ORDER);

public static void main(String[] args) {
Spark.before((request, response) -> {
if (!"blam".equalsIgnoreCase(request.queryParams("secret"))) {
Spark.halt(401, "invalid secret");
}
});
Spark.get("/get/:key", (request, response) -> readEntry(request, response));
Spark.get("/put/:key/:value", (request, response) -> writeEntry(request, response));
}

public static Object readEntry(Request request, Response response) {
String key = request.params(":key");
String value = keyStore.get(key);
if (value == null) {
response.status(404);
return "unknown key " + key;
} else {
return value;
}
}

public static Object writeEntry(Request request, Response response) throws UnsupportedEncodingException {
String key = request.params(":key");
String value = URLDecoder.decode(request.params(":value"), "UTF-8");
String oldValue = keyStore.put(key, value);
if (oldValue == null) return "";
else return oldValue;
}
}


OK, it's not as terse as Ruby or Node.js, but it's readable (similar to Express), statically-typed, and integrates with the rest of your JVM.  The real beauty of Spark is in the route definitions and filters -- try approaching that level of conciseness with Spring... or even Jetty and annotations.

Spark provides before() and after() filters, presumably for authentication, logging, forwarding....  executed in the order applied in your code.  Above, there's only an unsophisticated password check.  I've not dug in to discover whether or not Spark exposes enough bells and whistles for Kerberos.

The Spark.get() methods provide conduits for REST into your application.  Spark checks to see that the request parameters are present, returning 404 otherwise, and dispatches your registered handlers.

You can run and test drive the example

$curl 'localhost:4567/put/foo/What%20precedes%20bar?secret=BLAM'$ curl localhost:4567/get/foo?secret=BLAM
What precedes bar


Neat!  I've always been uneasy that Jetty's annotations aren't thoroughly checked by the compiler.  DropWizard has loads of dependencies with versioning issues that have tripped me up.

# Sweet Georgia Brown

Sweet Georgia Brown was probably the first jazz standard I heard. Maybe that's not unusual for people growing up in mid and southwest suburbs during the seventies and eighties. It probably exposes me as a poseur that the Harlem Globetrotters and Scooby Doo had more impact to my jazz education than Wynton Marsalis, Sunday brunches, Starbucks, or wherever I thought I should have heard jazz.

At any rate, Sweet Georgia Brown is probably between most Americans' ears at some point, the melody falls naturally on top of the chord changes, and it's a comfortable starting place to learn 2-5-1 chord changes....   theory wonk alert.... but this is important stuff for improvising and learning new tunes.

I'll refer in my notes to Brian Oberlin's version in F.

### Chords

The chords to Sweet Georgia Brown are almost entirely a chain of a variant of ii-V7-I changes, and once you can chain those together on the mandolin (which is easy), you have most of the tune under your fingers.  The variant II7-V7-I7 is simpler than the "pure" ii-V7-I, but it's simpler, and pretty common in swing, dixie, and blues.

When, Ray, my first mandolin teacher, showed me this trick, I thought it was magic.  There are only two three-finger shapes you need for the trick.  Here's how to do it going from G7 to C7

• V7 (G7): Rotate your left hand slightly so that the bottom two notes each move one fret (half-step) down the neck, and you'll form a rootless dominant-7th chord.
• I7 (C7): Rotate your left hand back to the root-on-top shape, but move your fingers down two frets down from where you started....

Hey, you're almost back to where you started! You can continue the 2-5-1 walk... and it actually sounds like music, and it works in any major key from any place higher than the second fret.

More theory wonkery -- the F7, E7, Eb7, D7 sequence at the turnaround is also a 2-5-1 progression, just with some tritone substitution.  Try "un-substituting" on the turnaround, or applying the substitution for other 2-5-1's in the song, and you'll see some variety in chord choices and voicing.

Those two three-finger chord shapes moved up and down the neck will get you everything you need to play Sweet Georgia Brown, except for D minor and F.  For the former, a two-finger bar at the 7th fret is easy, but putting the root on top at the 5th fret of a three-finger might sound better in some places.  For the remainder, the F major chord, you could even substitute F7 and be OK.

For me, a swinging chop seems pretty natural.  Don Julin has a good free lesson/video.

### Melody

Once you're comfortable with the chords, simply noodling with the major pentatonic scales with the chords, it might not surprise you to find the melody on your own.  There are a couple of odd runs, at least I thought they were odd, until I looked at the chords' scale notes, and then the melody becomes much easier to remember, feel, even.

Follow Brian Oberlin's notes thinking about where the first, third, fifth, and flatted seventh are in the chord, and the melody will seep into your finger memory.  I like swinging the melody, and alternating playing the chords and melody reinforce the syncopation.

### Sound Bites

I'll circle back, soon, with some audio samples.

# Exploring Python with Data

In the glut of Python data analysis tools, I'm sometimes embarrassed by my lack of comfort with Python for analysis.  Static types, Java/Scaladoc, and slick IDEs in concert with compilers provide a guides that I haven't been able to replace in Python.  Additionally, the problem of dynamic types seems to exacerbate problems with library interoperability.   With Anaconda and Jupyter, though, I can share some quick notes on getting started.

Here are some notes on surveying some admittedly canned data to classify malignant/benign tumors.  The Web is littered with examples of using sklearn to classify iris species using feature dimensions, so I thought I would share some notes exploring one of the other datasets included with scikit-learn, the Breast Cancer Wisconsin (Diagnostic) Data Set.  I've also decided to use Python 3 to take advantage of comprehensions and because that's what the Python community uses where I work.

The notebook below illustrates how to load demo data (loading csv is simple, too), convert the scikit-learn matrix to a DataFrame if you want to use Pandas for analysis, and applies linear and logistic regression to classify tumors as malignant or benign.

share_breast
In [7]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pylab as pl
import pandas as pd
from sklearn import datasets

# demo numpy matrix to Pandas DataFrame
pbc = pd.DataFrame(data=bc.data,columns=bc.feature_names)
pbc.describe()

Out[7]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

In [8]:
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

# Plot training-set size versus classifier accuracy.
def make_test_train(test_count):
n = bc.target.size
trainX = bc.data[0:test_count,:]
trainY = bc.target[0:test_count]
testX = bc.data[n//2:n,:]
testY = bc.target[n//2:n]
return trainX, trainY, testX, testY

def eval_lin(trainX, trainY, testX, testY):
regr = LinearRegression()
regr.fit(trainX, trainY)
y = regr.predict(testX)
err = ((y.T > 0.5) - testY)
correct = [x == 0 for x in err]
return sum(correct) / err.size, np.std(correct) / sqrt(err.size)

def eval_log(trainX, trainY, testX, testY):
regr = LogisticRegression()
regr.fit(trainX, trainY)
correct = (regr.predict(testX) - testY) == 0
return sum(correct) / testY.size, np.std(correct) / sqrt(correct.size)

def lin_log_cmp(n):
trainX, trainY, testX, testY = make_test_train(n)  # min 20
lin_acc, lin_stderr = eval_lin(trainX, trainY, testX, testY)
log_acc, log_stderr = eval_log(trainX, trainY, testX, testY)
return lin_acc, log_acc

xs = range(20,280,20)
lin_log_acc = [lin_log_cmp(x) for x in xs]

pl.figure()
lin_lin, = pl.plot(xs, [y[0] for y in lin_log_acc], label = 'linear')
log_lin, = pl.plot(xs, [y[1] for y in lin_log_acc], label = 'logistic')
pl.legend(handles = [lin_lin, log_lin])
pl.xlabel('training size from ' + str(bc.target.size))
pl.ylabel('accuracy');


Incidentally, I used the iPython nbconvert to paste the notebook here.

Caveats: Without types, it's pretty easy to make mistakes in manipulating the raw data.  Python and numpy scalar, array, and matrix arithmetic operators are gracious in accepting parameters, so you might get a surprise or two if you're not careful.  That combined with operating with black-box analysis tools gives me some skepticism of any conclusions, but it's a start, and the investment was cheap.

Other Plotting Tools: Seaborn.pairplot generates some slick scatter plot and histograms that will help identify outliers, describe ranges, and demonstrated redundancy in the data dimensions.  I tried removing some of obviously redundant data columns, resulting in no quality change in logistic classification and less than statistically significant reduction linear classification.

Linear or Logistic? It surprises me that logistic regression proved inferior classification to linear, but economists frequently use linear regression to model 0/1 variables.  Paul von Hippel has a post comparing relative advantages of linear versus logistic regression.  As a student, I had trouble both with application of logistic regression and conveying my travails to a thesis adviser. I wish I had read more commentary comparing the two 20 years ago.