Stochastic Optimization and Automatic Differentiation for Machine Learning (Spring 2018)

Coding sessions

Ipython notebooks

Projects (due May 4th 12th 16th, by email)

You can choose either of the projects below. You can complete this assignment alone or as a group of two, but I will grade them accordingly (I will have higher expectations for an assignment completed jointly by two students). Please send me, in an email to, a zip file containing your repord in pdf (not a .doc) and code in whatever format (you can send a notebook).


Implement the SDCA algorithm to estimate support Vector Machines. Test the algorithm on databases of your choice and compare it with a subgradient descent approach such as this one

Incremental methods with second order information

Implement and discuss the efficiency of the algorithm proposed in this paper by benchmarking it against SVRG (only) on a problem of your choice.

Predicting heat-maps using a 2-layer neural net and a Wasserstein loss

Download the Geographical Origin of Music dataset.

Your task will be to predict where, in the world, comes a given piece of music. The dataset you are given contains features for 1059 songs. There are 116 features per song (described in the default_plus_chromatic_features_1059_tracks file). In the following task, please report simultaneously train error and validation error of your network as a function of the number of updates you perform (you can choose to split the dataset as you see fit). Use a simple gradient descent scheme to update your network by minimizing the loss.

  • Program a 2-layer neural network, and train it using autograd, to predict the exact longitude and latitude of the music given 116 features. To define the loss function between your prediction and the actual longitude and latitude of the song, you should use the arc-cosine distance on the sphere and back-prop automatically through that loss. You can draw inspiration from these examples and other examples from autograd. Use a train/test split of your choice, and network size of your choice, knowing that your network should be of the form f(x) = tanh ( A_2 sigma(A_1 x)) where A_2 is 2times p matrix and A_1 a p times 116 matrix, sigma an activation function such as tanh or a rectified linear unit. This should output two values in [-1,1] that you should rescale and shift to suitable latitude and longitude values that can be compared with those in the dataset (typically within the min and max of these respective quantities).

  • Knowing that it is usually difficult to pinpoint exactly the location associated with a song, I have created a dataset where location is no longer given as a point, but, instead, a heat-map on the world (discretized on a 20x20 grid). This information is stored in this text file. That text file has 1059 lines, each of which contains 400 numbers, which correspond to the 20x20 heatmap matrix (enumerated in column order, as standard). Your task is now to fit a neural network whose output is no longer two numbers, but an entire heatmap covering the world. The function you should fit now is f(x) = s-max ( A_2 sigma(A_1 x)) where A_2 is now a 400times p matrix and A_1 a p times 116 matrix, sigma an activation function, and s-max the soft-max function which ensures that the output of your network is a normalized histogram of size 400, which can be reshaped as a 20times 20 heatmap. For the loss function, the simplest choice is a Kullback-Leibler divergence, which you can easily put in your code. BONUS: Another (more involved) option is the earth mover's distance between these two histograms, using a simple squared-Euclidean distance on the grid. An efficient way to do this would be to use the function Sinkhorn in the POT toolbox, dimension reduction scripts which approximates the earth-mover's distance with a function that can be back-propped through autograd.