This assignment is can be solved via JAGS and
library(runjags). List through the manual
to find what you need. However, solutions with original manual Gibbs
sampling implementation are welcomed.
Your solution to this problem will be discussed during exam. Send it to vavraj@karlin.mff.cuni.cz and komarek@karlin.mff.cuni.cz latest at 7:00 on day D-2, where D is your exam term.
We will work with the velibCount data from
library(MBCbook), see help(velibCount) for
more details. The velibCount object is a list
of the following structure:
library(MBCbook)
data(velibCount)
names(velibCount)
## [1] "data" "position" "dates" "names"
It consists of 1213 city-bike rental stations in the vicinity of Paris. Each hour of one week in September 2014 the number of bikes stationed at each place has been recorded.
data - matrix containing the bike counts stationed
within specific rental station (row) at a specific time (column),position - matrix containing the coordinates of the
rental stations,dates - vector of dates (in seconds from
"1970-01-01") at which data were counted,names - vector of names of the rental bike
stations.Your task is to discover different evolution curves of the intensity for the stationed bikes during a working day and cluster the stations based on similar profiles.
Let us first have a look on several stations:
X = velibCount$data
Sys.setlocale("LC_TIME", "en_US.UTF-8")
dates = strftime(as.POSIXct(velibCount$dates, origin = "1970-01-01"), format = "%a-%I%P")
days = strftime(as.POSIXct(velibCount$dates, origin = "1970-01-01"), format = "%u")
hours = strftime(as.POSIXct(velibCount$dates, origin = "1970-01-01"), format = "%H")
par(mar = c(4,4,2.5,0.5))
matplot(t(X[1:5,]), type = "l", lwd = 2,
main = "The Velib data set", xaxt = "n", xlab = "Hour",
ylab="Number of available bikes", ylim = c(0, 1.1*max(X[1:5,])))
axis(1, at = seq(2, length(hours), by = 6), labels = hours[seq(2, length(hours), by = 6)])
abline(v = seq(14, length(hours), by = 24), col = "slategrey", lty = 5)
text(seq(2, length(hours), by = 24), 1.05*max(X[1:5,]),
labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"),
font = 2, cex = 1.2)
legend(24*3+14, max(X[1:5,]), legend = velibCount$names[1:5],
col = 1:5, lty = 1:5, lwd = 2, xjust = 0.5, yjust = 1)

Some stations are filled during night, others are filled during the day. These patterns can possibly vary in many ways. The profiles within working days look very similar, but weekends tend to disobey the regular trends. Therefore, we eliminate observations from Saturday and Sunday from the analysis. Thus, only \(24 \cdot 5 = 120\) observations should remain:
Y = X[, days %in% 1:5]
dim(Y)
hours = hours[days %in% 1:5]
par(mar = c(4,4,2.5,0.5))
matplot(t(Y[1:5,]), type = "l", lwd = 2,
main = "The Velib data set", xaxt = "n", xlab = "Hour",
ylab="Number of available bikes", ylim = c(0, 1.1*max(Y[1:5,])))
axis(1, at = seq(1, length(hours)+1, by = 6), labels = c(hours[seq(1, length(hours), by = 6)], "00"))
abline(v = seq(1, length(hours)+1, by = 24), col = "slategrey", lty = 5)
text(seq(13, length(hours), by = 24), 1.05*max(Y[1:5,]),
labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), font = 2, cex = 1.2)
legend(24*2+13, max(Y[1:5,]), legend = velibCount$names[1:5],
col = 1:5, lty = 1:5, lwd = 2, xjust = 0.5, yjust = 1)

We will assume that the evolution pattern (from 00:00 to 23:00) in a working day is the same regardless of the day of the week. In particular, the expected curves for Monday, \(\ldots\), Friday are the same.
Moreover, we will expect \(G\) different types of stations. Each type of station will be distinguished from the others by specific evolution during a working day. These curves and station types are not known in advance. We will perform unsupervised clustering based on mixture of Poisson distributions.
Let \(U_i \in \{1, \ldots, G\}\) be the unknown cluster label for station \(i \in \{1, \ldots, n\}\). This label can be equivalently represented with vector of indicators \(\mathbf{Z_i} = \left(\mathbb{I}_{(U_i = 1)}, \ldots, \mathbb{I}_{(U_i = G)}\right)^\top\), which have number 1 at position \(g\) when \(U_i = g\) and zeros elsewhere. We will specify the distribution of observed counts through conditional distribution given cluster label.
We will assume that bike storage counts follow Poisson distribution independently of each other. In mathematical notation \(Y_{ihd} \,|\, U_i = g \sim \mathsf{Pois}\left(\mu_{ihdg}\right)\), where \(Y_{ihd}\) is the number of bikes stored in station \(i\) at \(h-1\) hours on day \(d\) and \(\mu_{ihdg}\) is its expectation parameter. Naturally, \(h \in \{1, \ldots, 24\}\), \(d \in \{1, 2, 3, 4, 5\}\) and \(g \in \{1, \ldots, G\}\).
Assuming complete generality of \(\boldsymbol{\mu}\) would require \(120nG\) unknown parameters. We will decompose \(\boldsymbol{\mu}\) to decrease the number of parameters, but still capture the essentials. Namely, \(\mu_{ihdg} = w_i \cdot \lambda_{hg}\), where
For identifiability purposes we have to enforce condition \[\begin{equation} \sum\limits_{h=1}^{24} \lambda_{hg} = 1, \quad \text{ for each } g \in \{1, \ldots, G\}. \end{equation}\] otherwise we could obtain arbitrary rescaling, e.g. \(\mu_{ihdg} = (w_i \cdot c) \cdot (c^{-1} \cdot \lambda_{hg})\). As discussed above, all days are assumed to follow the same evolution curve, hence, none parameter depends on \(d\).
Assume multinomial distribution for \(\mathbf{Z}_i \sim \mathsf{Mult}_G(1, \boldsymbol{\tau})\), which is equivalent to discrete categorical distribution \(\mathsf{P}(U_i = g) = \tau_g\), where parameter \(\boldsymbol{\tau} \in (0, 1)^G\) is vector of \(G\) cluster probabilities, that sum up to 1, \(\sum\limits_{g=1}^G \tau_g = 1\).
Posit the following independent prior distributions for model parameters:
where hyperparameters \(a\), \(b\), \(c\), \(d\)
and \(\alpha\) are chosen to induce
weakly informative prior. Random vector \(\boldsymbol{\tau}\) follows Dirichlet
distribution with parameter \(\boldsymbol{\alpha}\) on a simplex
(constraints \(0 < \tau_g < 1\)
and \(\tau_1 + \cdots + \tau_G = 1\))
if \(p(\boldsymbol{\tau} |
\boldsymbol{\alpha}) \propto \prod\limits_{g=1}^G \tau_g^{\alpha_g -
1}\). Search for ddirch in JAGS manual.
Describe in detail how the data augmentation works in this model. What would be the observed likelihood in standard frequentistic approach?
Write down the full-conditional probability density functions. Do they belong to well-known parametric families? If so, write down the values of the parameters. Comment on how the choice of hyperparameters for the prior distribution effects these full-conditional distributions. Choose their values appropriately.
Write down the algorithm of Gibbs sampling for this specific model.
Implement the Poisson Mixture Model in JAGS for a given number of clusters \(G\). Show the code and explain the format of inputs. Enforce condition (1) above by distinguishing the unrestricted and restricted parameters \(\lambda_{hg}\).
Would you be able to implement the Gibbs sampling on your own without the use of JAGS?
Choose some reasonable value of number of clusters \(G\) (\(G=10\) used in source literature, but \(G=2\) or \(G=3\) could be enough). Sample two Markov
chains using JAGS (or your own implementation) to approximate the
posterior distribution \(p(\mathbf{U},
\boldsymbol{\tau}, \mathbf{w}, \boldsymbol{\lambda}
\,|\,\mathsf{data})\). Initialize them from completely different
points. Choose appropriate burnin by monitoring some of the
trajectories. Do the chains converge to the same solution? What could be
the danger here when approximating posterior distribution from both
chains?
Hints: The parametric space is huge. Sampling can take even
hours! Start by low number of samples, estimate how long it would take
you to produce MCMC of reasonable length (...$timetaken).
Continue in sampling with function extend.jags, which will
skip the model compilation. You are allowed to do any reasonable
shortcuts to speed up your sampling: parallelization, low \(G\), single day of the week only, subset of
stations, low sample length. Defend your shortcuts. Implementing it on
your own may severely reduce the computation time.
Decide whether you can use samples from all chains to approximate the posterior. Provide summaries including ET and HPD intervals for a representative subset of primary model parameters.
Estimate the posterior mean for parameters \(\lambda_{hg}\) and plot the profiles for each \(g\) into one plot. Depict also the cluster size. Interpret the discovered clusters.
Estimate the posterior median for station-specific intensity \(w_i\) and plot histogram.
Using the sampled cluster labels \(U_i\) assign each station to the cluster with the highest frequency. You may apply a custom rule to keep a station unclustered if there are two or more closely competing clusters. Take the locations of the stations and plot them into a real world map. Distinguish clusters by different colors. Do you see some logical pattern?
Hint: How to plot a real world map using R? Use either
mapview or leaflet library.
library(leaflet)
leaflet(velibCount$position) %>%
addTiles() %>%
addCircleMarkers(color = "red", radius = 5, fillOpacity = 1, stroke = FALSE) %>%
setView(lng = mean(velibCount$position$longitude), # To show the map in Rmd html output
lat = mean(velibCount$position$latitude), # Ignore or replace for your PDF
zoom = 12)
To which cluster would you assign a newly observed station \(\mathbb{Y}_{\mathsf{new}}\) if
Start by writing down the general formula for posterior probability \(\mathsf{P}(U_\mathsf{new} = g | \mathbb{Y}_\mathsf{new})\) and express it as a function of primary model parameters. Explore the posterior distribution of these probabilities and classify new stations based on a reasonable rule.
Here are some points to be discussed: