stats & linux sandbox: May 2014

08 May 2014

gibbs sampler and skewness test

For the past two years, every time I filled my car with gas ($n = 28$ times) I recorded how many miles I went and how many gallons I used. I want to estimate how many miles per gallon I'm getting on my car. For simplicity, I'll assume my data comes from the Normal distribution with unknown mean and variance.

I take a Bayesian approach and assume the mean follows a Normal distribution with mean 30 and variance 10 and the variance is distributed Inverse Gamma with shape 20 and scale 0.002. With these priors we get conjugate posteriors.

Here is the posterior predictive distribution (in red) overlaying the histogram of the data:

The posterior distribution for the mean with a symmetric 95% probability interval in green (calculating the 95% HPD wouldn't be necessary since the mean is normal):

The mean is 28.8 with probability interval on (27.1, 30.7). We could also get the posterior distribution of the variance, but the mean is much more informative as far as knowing miles per gallon is concerned. The histogram of the data shows that the assumption of normality might not be appropriate. We will test the skewness in two ways: using simulation and the bootstrapping method.

The equation for sample skewness is given by:

Note that this is a biased estimator (but who cares). The null hypothesis is that there is no skew. We want to build up the finite distribution (as opposed to asymptotic) of the skewness. To do this we repeatedly draw n samples from a standard normal and compute the skewness for each sample.

We obtain a two-tailed p-value by calculating the proportion in the sample that is more extreme than the observed skew statistic -0.58. This p-value is approximately 0.135.

For the bootstrapping approach, we take bootstrap samples of the data (sampling with replacement) and compute the skewness for each bootstrap sample. This gives us a distribution of the skewness. We obtain a 95% confidence on the skewness from the 0.025 and 0.975 quantile values. This interval is (-1.26, 0.04).

With that p-value and zero being in the interval, there is not enough evidence to suggest that the data is skewed.

Conclusion: my car gets good gas mileage.

R code:

# the inverse gamma density
igpdf = function(x, shape, scale)
(1/(gamma(shape)*scale^shape))*x^(-shape-1)*exp(-1/(x*scale))

# returns the posterior parameters for mu, depending
# on the current value of sigma2, mumu and sigma2mu are known
updatemu = function(data, sigma2, mumu, sigma2mu){
xbar = mean(data)
n = length(data)
precision = 1/sigma2
precisionmu = 1/sigma2mu
postmu = ((xbar*n*precision+mumu*precisionmu)/(n*precision+precisionmu))
postvar = (1/(n*precision+precisionmu))
return (c(postmu, postvar))
}

# returns the posterior parameters for sigma2, depending
# on the current value of mu, priorSh and priorSc are known
updatesigma2 = function(data, mu, priorSh, priorSc){
n = length(data)
SSx = sum((data-mu)^2)
postSh = n/2+priorSh
postSc = 2*priorSc/(priorSc*SSx+2)
return (c(postSh, postSc))
}

# function that contains the looping, at each iteration the
# parameter is update based on the most recent value of the
# other parameters (only two parameters in this case)
gibbs = function(data, mumu, sigma2mu, prsh, prsc, loops=10000){
# initialize the matrix
out = matrix(0, nrow=loops, ncol=2)
# choose staring location for mu to be the mean
out[1,1] = mean(data)
# sigma2 is not initialize since we'll update first thing
# in the loop
for (i in 1:loops){
post = updatesigma2(data,out[i-1,1],prsh,prsc)
# 1/rgamma yields randoms draws from inverse gamma
# get random draw for sigma2
out[i,2] = 1/rgamma(1,shape=post[1],scale=post[2])

post = updatemu(data,out[i,2],mumu,sigma2mu)

# get random draw for mu

out[i,1] = rnorm(1,post[1],sqrt(post[2]))

}

return (out)

}

# function to compute highest posterior density intervals

hpd = function(x, prob = 0.95, precision = 1000){

range = seq(0, 1-prob, length=precision)

range = cbind(range, range+prob)

best = range[which.min(apply(range, 1, function(y)

diff(quantile(x, y)))),]

return (quantile(x, best))

}

# get the closest values in density() for the intervals

bounds = function(int, dens, x=TRUE){

if (x) # returns x-value from density

return(dens$x[which.min(abs(dens$x-int))])

return(dens$y[which.min(abs(dens$x-int))])

}

### Miles and Gallons Data from my Saturn. Includes both highway and

### surface street miles.

mpg.dat=c(209.5/6.171, 296.0/9.664, 240.5/9.867, 221.8/10.231,

319.3/10.404, 287.5/9.543, 307.3/9.911, 227.9/9.405,

318.1/9.812, 309.4/9.634, 269.2/9.271, 297.9/10.334,

163.3/9.913, 300.0/10.12, 355.1/10.384, 286.6/9.611,

330.6/10.390, 301.0/9.956, 316.2/10.020, 366.9/9.942,

262.1/10.055, 215.8/10.048, 283.1/8.690, 357.6/9.879,

242.5/10.121, 188.9/8.485, 311.3/9.870, 225.3/10.236)

n = length(mpg.dat)

### My prior beliefs about the data, which would be normal

mpg.mumu = 30

mpg.s2mu = 10

mpg.prsh = 20

mpg.prsc = 0.002

### Running the gibbs sampler

mpgpost = gibbs(mpg.dat, mpg.mumu, mpg.s2mu, mpg.prsh, mpg.prsc, 100000)

### posterior predictive

# use each set of parameter draws (mu and sig2) to generate a

# random "observation"

preds = rnorm(nrow(mpgpost), mpgpost[,1], sqrt(mpgpost[,2]))

png("~/files/R/figs/mpghist.png")

par(mar=c(4.1,3.1,3.1,1.1))

hist(mpg.dat, breaks=10, col='gray', main="", xlab="MPG", ylab="",

xlim=c(min(mpg.dat)-3,max(mpg.dat)+3),freq=FALSE)

points(density(preds), col='red', type='l', lwd=2)

legend(15, 0.14, c("Posterior Predictive", "Data"), lty=c(1, 1),

col=c('red', "gray"), lwd=c(2,10), cex=1.3)

dev.off()

### posterior distribution on mu

densmu = density(mpgpost[,1])

int = quantile(mpgpost[,1],c(0.025,0.975))

plot(densmu,lwd=2,col='blue', xlab=expression(mu),

main=expression(paste("Posterior of ", mu)), cex.main=1.3,

cex.lab=1.3, ylab="")

polygon(density(mpgpost[,1]), col='lightblue', border='blue')

lines(rep(bounds(int[1], densmu), 2), c(0, bounds(int[1],

densmu, FALSE)), col='green', lwd=2)

lines(rep(bounds(int[2], densmu), 2), c(0, bounds(int[2],

densmu, FALSE)), col='green', lwd=2)

legend(24.2, 0.42, "95% Probability Interval", lty=1,

col='green', lwd=2)

# mean of mu

mean(mpgpost[,1])

# 95% probability interval of mu

quantile(mpgpost[,1],c(0.025,0.975))

### hypothesis test on the skewness of the data

skew = function(x) mean((x-mean(x))^3) / var(x)^(3/2)

# using the posterior predictive (but any normal distribution

# will work since they all have the same skewness)

pred.mean = mean(preds)

pred.sd = sd(preds)

observed.skew = skew(mpg.dat)

null.skew = double(100000)

for (i in 1:length(null.skew))

null.skew[i] = skew(rnorm(length(mpg.dat), pred.mean, pred.sd))

hist(null.skew, col='gray', freq=FALSE, main="Simulation-based",

xlab="Sample Skewness", ylab="", cex.main=1.3, cex.lab=1.3)

abline(v=c(-1,1)*observed.skew, col='green', lwd=2)

mean(null.skew < observed.skew) + mean(null.skew > -observed.skew)

# using bootstrapping

boot.skew = double(100000)

for (i in 1:length(boot.skew))

boot.skew[i] = skew(sample(mpg.dat, replace=TRUE))

hist(boot.skew, col='gray', freq=FALSE, xlab="Sample Skewness",

main="Bootstrapping", cex.main=1.3, cex.lab=1.3)

abline(v=quantile(boot.skew, c(0.025, 0.975)), col='green', lwd=2)

quantile(boot.skew, c(0.025, 0.975))

probability distributions

A probability function takes some outcome X (in an experiment, say) and assigns some probability p to it, with 0 < p < 1. For example, suppose we flip a coin six times. We want to know the probability of getting X heads. So in this experiment, X = 0, 1, 2, 3, 4, 5 or 6.

Without getting into too much detail on how these seven probabilities are calculated, here is a figure:

Notice how the probability of getting 3 heads is highest. Our intuition should confirm this: we expect to see half of the flips be heads. Getting no heads or all heads is possible, just not too probable.

The name of this probability function (or distribution) is the Binomial distribution. It is a discrete distribution which means it can only take as inputs integer values.

For a probability function to be a valid, each probability must be between zero and one, inclusive, and they all must sum to one. This again agrees with our intuition since a negative probability doesn't make sense, nor can something occur with more than a 100% chance (probability one).

Many probability distributions exist (infinite really), and we choose which one to use based on what we are measuring.

The big picture: we can use mathematical functions to determine how likely an event is to occur.

07 May 2014

a method for auto-tuning metropolis algorithm acceptance rates

The Metropolis-Hastings algorithm is a tool to generate random draws from a probability distribution. It requires the use of some candidate distribution. When Metropolis, et. al. (1953) introduced the algorithm, the candidate distribution needed to be symmetric. Hastings (1970) generalized the algorithm to include non-symmetric candidate densities. In this post I use the normal distribution for the candidate distributions.

A metric for determining whether the draws from the algorithm are feasible is the acceptance rate. Gelman, Roberts, and Gilks (1996) show that, asymptotically, the optimal acceptance rate for the Metropolis-Hastings algorithm is around 0.25. Rates between 0.15 and 0.40 still produce efficient results that are close to the optimal rate. See Roberts and Rosenthal (2001) for a more detailed treatment on choosing the scale parameters to attain these acceptance rates.

The standard deviation, σ, of the candidate normal distribution is critical in determining the acceptance rate. High σ's result in lower acceptance rates, low σ's result in higher acceptance rates.

Supposing we have a high-dimensional problem with many such σ's, it becomes a pain to manually set their values to achieve good acceptance rates. I present a method (there are plenty) to automatically change the σ's to their optimal values.

Here is the function used to adjust the σ's:

In R code we have:

f = function(accept, target = 0.25, k = 2.5){
k = c((1-1/k)/(cosh(target)-1), (k-1)/(cosh(target-1)-1))
1+sign(accept-target)*(cosh(accept-target)-1)*
k[(sign(accept-target)+1)/2+1]
}

The function expects accept to be in [0, 1], target to be in (0, 1), and k > 1. It looks like this:

It is a combination of hyperbolic cosine functions. They have a flat region and increase exponentially. The flat region (at 0.25 in the picture) is meant to be the optimal acceptance rate.

16 May 2014 edit: Here is a similar function that will result in steeper drops for when the acceptance rate is less than the target. Also, it is symmetric.

During the burn-in period, calculate the acceptance rate A for each dimension within some window (say the latest 100 iterations of the algorithm). Then for each σ, let σ_new = σ_current * f(A). This will increase σ when a high amount of acceptances are observed, hence reducing future acceptance rates. The reverse applies to low acceptances.

The parameter k controls the maximum change made to the candidate distribution standard deviation. In the function I provided, the highest that σ can increase by is a factor of k, and the smallest that σ can decrease by is a factor of 1/k.

Using this method on a Bayesian model with 2,500 parameters I was able to get all the parameters to have an acceptance rate well within [0.15, 0.40] with a burn-in of 2000 iterations. The figure shows the acceptance rate of 5000 post-burn-in draws.

06 May 2014

kernel smoothing

This post considers the simple example presented by Higdon (2002) in "Space and Space-Time Modeling Using Process Convolutions" which is to smooth out a lattice of i.i.d. Gamma(2, 1) random variables.

Suppose we have observations y at the locations x. The idea behind kernel smoothing is that we when making a prediction at location z, we up-weight the y's of those x's which are closest to z and down-weight those farther from z. We expect that observations which are closer together will produce similar values and those which are farther will be unrelated.

The following code uses the Gaussian kernel (named for its form) on an equally spaced grid from 0 to 1 of size 15. At each location, a random gamma(2,1) is drawn. The bandwidth parameter "delta" controls how far we want to go to capture the weight of an observation.

mw.smooth = function(x, y){
# gaussian kernel, delta is the bandwidth parameter
kern = function(x, y, delta=1/16)
exp(-1/(2*delta^2)*(x-y)^2)
# m = number of points to do the smoothing,
# should be greater than length(x) to look nice
m = 4*length(x)
outx = seq(min(x), max(x), length=m)
outy = double(m)
# loop to compute the weighted value at each new x
for (i in 1:length(outy))
outy[i] = sum(y*kern(x, outx[i]))/
sum(kern(x, outx[i]))
return (list("x"=outx, "y"=outy))
}

xx = seq(0, 1, length=15)
yy = rgamma(length(xx), 2, 1)
plot(xx, yy, type='h')
ss = mw.smooth(xx, yy)
lines(ss$x, ss$y, type='l', col='red')

The Gaussian kernel is straightforward and produces a rather smooth curve. Other kernels will have somewhat different properties, but all do essentially the same thing. The remaining R code contains other kernel functions: window, Epanechnikov, bisquare, tricube, and Wendland.

n = 30

xx = seq(0, 5, length=n)

# setting d to be the range of the data, related to delta

d = max(xx) - min(xx)

# how to to go beyond the range of the data

beyond = 0.1

# for the kernels:

# x is the locations for data points

# y is the locations at which to predict

kern.gaussian = function(x, y, delta=d/16) # gaussian

exp(-1/(2*delta^2)*(x-y)^2)

kern.window = function(x, y, delta=d/16) # window

1*(abs(x-y) <= delta)

kern.epanech = function(x, y, delta=d/16) # epanechnikov

0.75*(1-(abs(x-y)/delta)^2)*(abs(x-y) < delta)

kern.bisq = function(x, y, delta=d/16) # bisquare

(1-(abs(x-y)/delta)^2)^2*(abs(x-y) < delta)

kern.tric = function(x, y, delta=d/16) # tricube

(1-(abs(x-y)/delta)^3)^3*(abs(x-y) < delta)

kern.wendl = function(x, y, delta=d/4){ # wendland

d=abs(x-y)/delta

(1-d)^6*(35*d^2+18*d+3)/3*(d < delta)

}

# for comparing two different kernels

kern1 = function(x, y) kern.gaussian(x, y, d/16)

kern2 = function(x, y) kern.bisq(x, y, d/8)

# yy = rexp(n, 1)

yy = rgamma(n, 2, 1)

plot(xx, yy, type='h', xlim=c(min(xx)-beyond*d, max(xx)+beyond*d))

m = 4*n

# smox = the values of x at which to make predictions

smox1 = seq(min(xx)-beyond*d, max(xx)+beyond*d, length=m)

# smoothed = the predicted values

smoothed1 = double(m)

for (i in 1:m)

smoothed1[i] = sum(yy*kern1(xx, smox1[i]))/

sum(kern1(xx, smox1[i]))

points(smox1, smoothed1, type='l', col='red')

smox2 = seq(min(xx)-beyond*d, max(xx)+beyond*d, length=m)

smoothed2 = double(m)

for (i in 1:m)

smoothed2[i] = sum(yy*kern2(xx, smox2[i]))/

sum(kern2(xx, smox2[i]))

points(smox2, smoothed2, type='l', col='blue')

wireless connection on a samsung chromebook with no gui ubuntu

If you're weird like me, you like the thought of doing all your work through the terminal. So I got an inexpensive Chromebook where I hoped to set it up to boot to terminal and nothing else.

After following these directions (dual boot ubuntu on chromebook) to install the no desktop (minimal) Ubuntu 12.04 LTS on my ARM processor Samsung Chromebook I encountered a problem with connecting to the Internet. Here is my current solution.

Before running sudo bash s9ryd ubuntu-standard lts, add these lines to the s9ryd script:

apt-get -y install wpasupplicant
apt-get -y install wireless-tools

at the end of the block where other installs are being made. This allows you to run certain commands that are necessary to set up the wireless connection.

After installing Ubuntu, run this set of commands in sudo:

ifconfig mlan0 up
wpa_passphrase ESSID KEY > /home/user/.wpa.conf
wpa_supplicant -imlan0 -c/home/user/.wpa.conf -Dwext -B
dhclient -r
dhclient mlan0

where ESSID is the network name and KEY is its password. I wrote a little bash script that included these lines. The problem is that I have to run it each time I turn on the computer. I'm still looking for a better solution.