Yorozuya 2019-07-12T14:05:20.620Z http://invkrh.me/ Hao Ren Hexo Linear Algebra Proofs http://invkrh.me/2019/06/11/linear-algebra-proofs/ 2019-06-11T20:00:08.000Z 2019-07-12T14:05:20.620Z A list of useful properties and their proofs in linear algebra. Some of them are also very useful in machine learning. Complex proofs are not in the scope of the post, but the references will be given if interested.

# 0. Basics

## Proof 0.1

Let A and B be two $N \times N$ square matrix, then $det(AB) = det(A) \cdot det(B)$

## Proof 0.2

The determinant of a orthogonal matrix must be $\pm1$

Because $1=det(I)=det\left(Q^{T} Q\right)=det\left(Q^{T}\right) det(Q)=(det(Q))^{2} \Rightarrow det(Q)=\pm 1$

## Proof 0.3

Let $A$ be an $N \times N$ matrix and let $\lambda_1, ..., \lambda_n$ be its eigenvalues, then

$det(A)=\prod_{i=1}^{n} \lambda_{i}$

$tr(A)=\sum_{i=1}^{n} \lambda_{i}$

# 1. Symetric Matrix

## Definition

Matrix $A$ is symmetric if $A = A^\mathrm{T}$

## Proof 1.1

If $\lambda$ is the eigen-value of $A$, so is the conjugate of $\lambda$, denoted as $\overline{\lambda}$
If $\mathbf{x}$ is the eigen-vector of $A$, so is the conjugate of $\mathbf{x}$, denoted as $\overline{\mathbf{x}}$

Since $A$ is a real matrix,

$A = \overline{A}$

Knowing that $A\mathbf{x} = \lambda \mathbf{x}$

$A \overline{\mathbf{x}} = \overline{A} \overline{\mathbf{x}}=\overline{A \mathbf{x}}=\overline{\lambda \mathbf{x}}=\overline{\lambda} \overline{\mathbf{x}}$

## Proof 1.2

$A$ has only real eigenvalues

Consider $\lambda$ and $\mathbf{x}$ are the eigne-value and eigen-vector of $A$, repectively. Based on Proof 1.1,

$\overline{\mathbf{x}}^{\mathrm{T}} A \mathbf{x}=\overline{\mathbf{x}}^{\mathrm{T}} \lambda \mathbf{x}=\lambda \overline{\mathbf{x}}^{\mathrm{T}} \mathbf{x}$

Since $A\mathbf{x}$ is a vector and $\overline{\mathbf{x}}^{\mathrm{T}} A \mathbf{x} = (A\mathbf{x})^\mathrm{T}\overline{\mathbf{x}}$,

$\overline{\mathbf{x}}^{\mathrm{T}} A \mathbf{x} = \mathbf{x}^\mathrm{T} A^\mathrm{T} \overline{\mathbf{x}} = \mathbf{x}^\mathrm{T} A \overline{\mathbf{x}} = \overline{\lambda} \mathbf{x}^\mathrm{T} \overline{\mathbf{x}}$

Since $\overline{\mathbf{x}}^{\mathrm{T}} \mathbf{x} = \mathbf{x}^\mathrm{T} \overline{\mathbf{x}}$,

$\lambda = \overline{\lambda}$

## Proof 1.3

$A$ is diagonalizable by an orthogonal matrix.

Schur decomposition:
Every square matrix factors into $A=QTQ^{-1}$ where $T$ is upper triangular and $\overline{Q}^\mathrm{T}=Q^{-1}$. If $A$ has real eigenvalues then $Q$ and $T$ can be chosen real: $Q^\mathrm{T}Q = I$ (a.k.a $Q$ is an orthogonal matrix)

Based on Proof 1.2, all the eigen values of $A$ are real.
Based on Schur decomposition, $Q^\mathrm{T}AQ = T$.
Then,

$T^\mathrm{T} = (Q^\mathrm{T}AQ)^\mathrm{T} = (AQ)^\mathrm{T}Q = Q^\mathrm{T}A^\mathrm{T}Q = Q^\mathrm{T}AQ = T$

Denote the diagonal matrix $T$ as $D$, we have

$A = QDQ^\mathrm{T}$

## Proof 1.4

If $A$ is nonsingular, $A^{-1}$ is symmetric

Since $A$ is invertible,

$A^{-1}A = I$

Taking the transpose, we have

\begin{aligned} I &=I^{\mathrm{T}}=\left(A^{-1} A\right)^{\mathrm{T}} \\ &=A^{\mathrm{T}}\left(A^{-1}\right)^{\mathrm{T}} \\ &=A\left(A^{-1}\right)^{\mathrm{T}} \end{aligned}

Hence, $A^{-1} = (A^{-1})^\mathrm{T}$

# 2. Positive definite symmetric matrix

## Definition

A real symmetric $n \times n$ matrix $A$ is called positive definite if $\mathbf{x}^\mathrm{T}A\mathbf{x}>0$ for all non-zero vectors $\mathbf{x} \in \mathbb{R}^n$.

## Proof 2.1

The eigenvalues of a real symmetric positive-definite matrix $A$ are all positive.

Let $\lambda$ be a (real) eigenvalue of $A$ and let $\mathbf{x}$ be a corresponding real eigenvector. That is, we have

$A\mathbf{x}=\lambda \mathbf{x}$

Then we multiply by $\mathbf{x}^\mathrm{T}$ on left and obtain,

\begin{aligned}\mathbf{x}^\mathrm{T} A \mathbf{x} &=\lambda \mathbf{x}^\mathrm{T} \mathbf{x} \\ &=\lambda\|\mathbf{x}\|^{2}\end{aligned}

The left hand side is positive as $A$ is positive definite and $x$ is a nonzero vector as it is an eigenvector.
Since the norm $\|\mathbf{x}\|^2$ is positive, we must have $\lambda$ is positive.
It follows that every eigenvalue $\lambda$ of $A$ is real.

## Proof 2.2

If eigenvalues of a real symmetric matrix $A$ are all positive, then $A$ is positive-definite.

Since Proof 1.3, $A = QDQ^\mathrm{T}$ where $Q^{-1} = Q^\mathrm{T}$, we have

$\mathbf{x}^\mathrm{T}A\mathbf{x} = \mathbf{x}^\mathrm{T}QDQ^\mathrm{T}\mathbf{x}$

where

$D=\left[\begin{array}{cccc}{\lambda_{1}} & {0} & {0} & {0} \\ {0} & {\lambda_{2}} & {0} & {0} \\ {\vdots} & {\cdots} & {\ddots} & {\vdots} \\ {0} & {0} & {\cdots} & {\lambda_{n}}\end{array}\right]$

Putting $y = Q^\mathrm{T}\mathbf{x}$, we can rewrite the above equation as

$\mathbf{x}^\mathrm{T}A\mathbf{x} = \mathbf{y}^\mathrm{T}D\mathbf{y}$

Let

$\mathbf{y}=\left[\begin{array}{c}{y_{1}} \\ {y_{2}} \\ {\vdots} \\ {y_{n}}\end{array}\right]$

Then we have

$\mathbf{x}^{\mathrm{T}} A x=\mathbf{y}^{\mathrm{T}} D \mathbf{y} = \lambda_{1} y_{1}^{2}+\lambda_{2} y_{2}^{2}+\cdots+\lambda_{n} y_{n}^{2}$

By assumption eigenvalues $\lambda_i$ are positive.
Also, since $x$ is a nonzero vector and $Q$ is invertible, $y=QTx$ is not a zero vector.
Thus the sum expression above is positive, hence $\mathbf{x}^\mathrm{T}A\mathbf{x}$ is positive for any nonzero vector $\mathbf{x}$.
Therefore, the matrix $A$ is positive-definite.

## Proof 2.3

A is invertible

Method 1

Since Proof 2.1, the matrix $A$ does not have 0 as an eigenvalue

We can prove this by contradiction:
If $A\mathbf{x}=0 \cdot \mathbf{x}$ for some $\mathbf{x} \neq 0$ then by definition of eigenvalues (non-invertible), $\mathbf{x}$ is an eigenvector with eigenvalue $\lambda = 0$

Method 2

We can prove this by using determinent

$det A = \prod_i^n \lambda_i \gt 0$

## Proof 2.4

the inverse of A is positive-definite

Based on Proof 2.1 and Proof 2.2, we know the fact that a symmetric matrix is positive-definite if and only if its eigenvalues are all positive.

All eigenvalues of $A^{−1}$ are of the form $1 / \lambda$, where $\lambda$ is an eigenvalue of $A$.
Since A is positive-definite, each eigenvalue $\lambda$ is positive, hence $1 / \lambda$ is positive.

So all eigenvalues of $A^{-1}$ are positive, and it yields that $A^{-1}$ is positive-definite.

# 3. Matrix calculus

## Definition

https://en.wikipedia.org/wiki/Matrix_calculus

## Derivatives of Matrices, Vectors and Scalar Form

$a$ is scalar and $\mathbf{x}$ is a column vector $\begin{bmatrix} x_{1} & x_{2} & \cdots & x_{n}\end{bmatrix}^T$

$\frac{\partial a}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial a}{\partial x_1} & \frac{\partial a}{\partial x_2} & \dots & \frac{\partial a}{\partial x_n}\end{bmatrix}$

$\mathbf{a}$ and $\mathbf{x}$ are column vectors

$\frac{\partial \mathbf{a}^T \mathbf{x}}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^T \mathbf{a}}{\partial \mathbf{x}} = \mathbf{a}$

$A$ is matrix, $\mathbf{x}$ is column vector and $\mathbf{x}^T A \mathbf{x}$ is scalar

$\frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial \mathbf{x}} = (A + A^T) \mathbf{x}$

If $A$ is symmetric, then

$\frac{\partial \mathbf{x}^T A \mathbf{x}}{\partial \mathbf{x}} = 2A\mathbf{x}$

$\mathbf{a}$ and $\mathbf{b}$ are column vectors, $X$ is a matrix and $\mathbf{a}^T X \mathbf{b}$ is scalar

$\frac{\partial \mathbf{a}^T X \mathbf{b}}{\partial X} = \mathbf{a} \mathbf{b}^T$

## Derivatives of a Determinant

$X$ is a matrix

$\frac{\partial det(X)}{\partial X} = det(X) \cdot (X^{-1})^T$

## Derivatives of an Inverse

$A$ is a matrix and $A$ depends on $x$

$\frac{\partial A^{-1}}{\partial x}=-A^{-1} \frac{\partial A}{\partial x} A^{-1}$

If $x$ is $A$, then

$\frac{\partial A^{-1}}{\partial x}=-A^{-2}$

]]>
<p>A list of useful properties and their proofs in linear algebra. Some of them are also very useful in machine learning. Complex proofs are not in the scope of the post, but the references will be given if interested.</p>
Expectation-Maximization Algorithm Explained http://invkrh.me/2019/05/14/expectation-maximization-explained/ 2019-05-14T21:49:26.000Z 2019-07-09T21:51:24.921Z The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. In this post, we will talk about how the algorithm works and then prove its correctness, finally we will show a concrete yet the most common use case where the algorithm is applied.

# Description

The typical setting of the EM algorithm is a parameter estimation problem in which we have

• A training set $\mathbf{X} = \{ x_1, \ldots, x_n \}$ consisting of $n$ independent observed data points. Each may be discrete or continuous. Associated with each data point may be a vector of observations $\mathbf{Y} = \{ y_1, \ldots, y_n \}$.
• A set of unobserved latent data or missing values $\mathbf{Z} = \{ z_1, \ldots, z_n \}$ associated with each data point. They are discrete, drawn from a fixed number of values, and with one latent variable per observed unit.
• A vector of unknown parameters $\boldsymbol{\theta}$ which are continuous, and are of two kinds:
• Parameters that are associated with all data points
• Those associated with a specific value of a latent variable (i.e., associated with all data points which corresponding latent variable has that value).

We wish to fit the parameters of a model $p(x ; \boldsymbol{\theta})$ (may or may not include $y$) and the log marginal likelihood function of all data points to maximize is given by

\begin{aligned}& L(\boldsymbol{\theta};\mathbf{X}) \\= \ \ &\sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\= \ \ &\sum_{i=1}^{n} \log \sum_{z_i} p(x_i, z_i ; \boldsymbol{\theta})\end{aligned}

However, maximizing $L(\boldsymbol{\theta};\mathbf{X})$ explicityly might be difficult because $\mathbf{Z}$ are the latent random variables. So let’s refine it by Bayes’ theorem:

\begin{aligned}& \sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\= \ \ & \sum_{i=1}^{n} (\log p(x_i, z_i ; \boldsymbol{\theta}) - \log p(z_i | x_i; \boldsymbol{\theta})) \\= \ \ & \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta})\end{aligned}

Denote:

$L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta})$

Then:

$L(\boldsymbol{\theta};\mathbf{X}) = L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta})$

This equation will be used to prove the correctness. Let’s denote it as equation $(eq.*)$

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

## Expectation step (E step)

Define $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ as the expected value of the log likelihood function of $\boldsymbol{\theta}$, with respect to the current conditional distribution of $\mathbf{Z}$, given $\mathbf{X}$ and the current estimates of the parameters $\boldsymbol{\theta}^{(t)}$

\begin{aligned}& Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \\= \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] \\= \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\= \ \ & \sum_{i=1}^{n} \mathrm{E}_{z_i \mid x_i; \boldsymbol{\theta}^{(t)}} \left [ \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\= \ \ & \sum_{i=1}^{n} \sum_{z_i} p(z_i \mid x_i; \boldsymbol{\theta}^{(t)}) \log p(x_i, z_i ; \boldsymbol{\theta})\end{aligned}

## Maximization step (M step)

Find the parameters that maximize this quantity:

$\boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta}}{\arg \max} \, Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right)$

# Proof of Correctness

The trick of EM algorithm is to improve $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ rather than directly improving $L(\boldsymbol{\theta};\mathbf{X})$. Here is shown that improvements to the former imply improvements to the latter.

We take the expectation over possible values of the unknown data $z_i$ under the current parameter estimate $\boldsymbol{\theta}^{(t)}$ for both sides of $(eq.*)$ equation by multiplying $p(z_i \mid x_i; \boldsymbol{\theta}^{(t)})$ and summing (or integrating) over $z$ . The left-hand side is the expectation of a constant since it is independent of $z_i$, so we get:

\begin{aligned}& L(\boldsymbol{\theta};\mathbf{X}) \\= \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] - \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left[\sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta}) \right] \\= \ \ & Q \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right) - \sum_{i=1}^{n}\sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}) \\ = \ \ & Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right)\end{aligned}

where $H_i \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right)$ is defined by the negated sum it is replacing.

This last equation holds for every value of $\boldsymbol{\theta}$ including $\boldsymbol{\theta} = \boldsymbol{\theta}^{(t)}$,

$L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)$

and subtracting this last equation from the previous equation gives

$L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)$

Note that

\begin{aligned}& H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) \\= \ \ & \sum_{i=1}^{n} \underbrace{- \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta})}_{cross-entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)}, z_i \mid x_i; \boldsymbol{\theta})} - \underbrace{\left[ - \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}^{(t)}) \right]}_{entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)})}\end{aligned}

Based on Gibbs’ inequality: the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q, which means

$H\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \geq H\left(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)$

Hence, we can conclude that

$L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) \geq Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)$

In words, choosing $\boldsymbol{\theta}$ to improve $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ causes $L(\boldsymbol{\theta};\mathbf{X})$ to improve at least as much.

Alternatively, you can also prove the correctness by using Jensen’s equality. More details in this note.

# Gaussian Mixture Model (GMM)

A typical application of EM algorithm is the gaussian mixture model (GMM) which is a clustering problem as the following:

• A training set $\mathbf{X} = \{ x_1, \ldots, x_n \}$ which is a sample of $n$ independent observations from a mixture of $k$ multivariate normal distributions of dimension $d$. (since we are in the unsupervised learning setting, these points do not come with any labels.)
• Associated with the latent variables $\mathbf{Z} = \{ z_1,z_2,\ldots,z_n \}$ that determine the component from which the observation originates. $z_i \sim \text { Multinomial }(\phi)$ where $\phi_j \geq 0, \sum_{j=1}^{k} \phi_j=1$ and the parameter $\phi_j$ gives $p(z_i=j)$.
• The model posits that each $x_i$ was generated by randomly choosing $z_i$ from ${1, \ldots, k}$, and then $x_i$ was drawn from one of $k$ Gaussians depending on $z_i$. Then, $x_i | z_i=j \sim \mathcal{N}\left(\mu_{j}, \Sigma_{j}\right)$
• The parameters to estimate $\boldsymbol{\theta} = \{\phi_j, \mu_j, \Sigma_j | j \in 1,\ldots,k\}$. For simplicity, $\theta_j = \{\phi_j, \mu_j, \Sigma_j\}$

And we wish to model the data by specifying the joint distribution $p(x_i, z_i)$. Hence, the log likelihood is

\begin{aligned}& L(\boldsymbol{\theta} ; \mathbf{X}) \\= \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i, z_i = j; \theta_j) \\= \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i \mid z_i = j; \mu_j, \Sigma_j) \cdot p(z_i = j; \phi_j) \\= \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j\end{aligned}

However, if we set to zero the derivatives of this formula with respect to the parameters and try to solve, we’ll find that it is not possible to find the maximum likelihood estimates of the parameters in closed form. (Try this yourself at home.) This is where the EM algorithm comes in.

## E-step

• Define $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$

$L(\theta ; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_{j}$

where

$f\left(x_i ; \mu_j, \Sigma_j\right) = \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right)$

then

$Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{j=1}^{k} p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \cdot \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j$

• Compute $w_{ij}^{(t)}$
Given our current estimate of the parameters $\theta^{(t)}$, the conditional distribution of the $z_i$ is determined by Bayes theorem to be normalized Gaussian density weighted by $\phi_j$:

\begin{aligned}& w_{ij}^{(t)} \\= \ \ & p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \\= \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{p(x_i; \boldsymbol{\theta}^{(t)})} \\= \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{\sum_{l = 1}^{k} p(x_i, z_i = l; \boldsymbol{\theta}^{(t)})} \\= \ \ & \frac{f\left(x_i ; \mu_j^{(t)}, \Sigma_j^{(t)}\right) \cdot \phi_{j}^{(t)}}{\sum_{l = 1}^{k} f\left(x_i ; \mu_l^{(t)}, \Sigma_l^{(t)}\right) \cdot \phi_{l}^{(t)}}\end{aligned}

## M-step

We need to maximize, with respect to our parameters $\boldsymbol{\theta}^{(t)}$, the quantity

\begin{aligned}& Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\= \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right) \cdot \phi_{j} \\= \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \left[\log \phi_j - \frac{1}{2} \log \left|\Sigma_{j}\right| - \frac{1}{2}(x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) - \frac{d}{2} \log(2\pi) \right]\end{aligned}

• Update $\phi_j$
Grouping together only the terms that depend on $\phi_j$, we find that we need to maximize

$\sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j$

with subject to

$\sum_{j=1}^{k} \phi_j = 1$

So we construct the Lagrangian

$\mathcal{L}(\phi) = \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j + \lambda \left( \sum_{j=1}^{k} \phi_j - 1 \right)$

where $\lambda$ is the Lagrange multiplier. Taking derivative, we find

$\frac{\partial}{\partial \phi_{j}} \mathcal{L}(\phi)=\sum_{i=1}^{n} \frac{w_{ij}^{(t)}}{\phi_{j}}+\lambda$

Setting this to zero and solving, we get

$\phi_{j}=\frac{\sum_{i=1}^{n} w_{ij}^{(t)}}{-\lambda}$

Using the constraint that $\sum_{j=1}^{k} \phi_j = 1$ and knowing the fact that $\sum_{j=1}^{k} w_{ij}^{(t)} = 1$ (probabilities sum to 1), we easily find:

$-\lambda=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{ij}^{(t)} = \sum_{i=1}^{n} 1= n$

We therefore have updates for the parameters $\phi_j$:

$\phi_{j} :=\frac{1}{n} \sum_{i=1}^{n} w_{ij}^{(t)}$

• Update $\mu_j$

\begin{aligned}& \frac{\partial}{\partial \mu_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\= \ \ & \frac{\partial}{\partial \mu_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\= \ \ & \frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \mu_{j}} \left( 2 \mu_j^T\Sigma_j^{-1}x_i - \mu_j^T\Sigma_j^{-1}\mu_j \right) \\= \ \ & \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_j^{-1} x_i - \Sigma_j^{-1}\mu_j \right)\end{aligned}

Setting this to zero and solving for $\mu_j$ therefore yields the update rule

$\mu_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} x_i}{\sum_{i=1}^{n} w_{ij}^{(t)}}$

• Update $\Sigma_j$

\begin{aligned}& \frac{\partial}{\partial \Sigma_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\= \ \ & \frac{\partial}{\partial \Sigma_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\= \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \Sigma_{j}} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\= \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_{j}^{-1} - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \right)\end{aligned}

Setting the partial derivative to zero and solving for $\Sigma_j$ therefore yields the update rule

$\Sigma_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} (x_i-\mu_j)(x_i - \mu_j)^T}{\sum_{i=1}^{n} w_{ij}^{(t)}}$

More details on the derivative calculus
Note that $\Sigma_{j}$ is invertible, symmetric, square, positive-definite matrix, then the following holds

• The inverse of a symmetric matrix is symmetric, then $(\Sigma_{j}^{-1})^T = \Sigma_{j}^{-1}$
• $\Sigma_{j}$ is positive-definite, then $\Sigma_{j}^{-1}$ is positive-definite
• Positive-definite matrix is invertible, then $\Sigma_{j}^{-2}$ exists

So we get

\begin{aligned}& \dfrac{\partial}{\partial \Sigma_{j}} \log\left|\Sigma_{j}\right| \\= \ \ & \frac{1}{\left|\Sigma_{j}\right|} \frac{\partial}{\partial \Sigma_{j}} \left|\Sigma_{j}\right| \\= \ \ & \frac{1}{\left|\Sigma_{j}\right|} \left|\Sigma_{j}\right| \cdot (\Sigma_{j}^{-1})^T \\= \ \ & \Sigma_{j}^{-1}\end{aligned}

and

\begin{aligned}& \frac{\partial}{\partial \Sigma_{j}} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\= \ \ & (x_i-\mu_j)(x_i - \mu_j)^T \frac{\partial}{\partial \Sigma_{j}} \Sigma_{j}^{-1} \\= \ \ & - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \\\end{aligned}

Compared to K-Means which uses hard assignment, GMM uses soft assignment which a good way to represent uncertainty. But GMM is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data.
This is due to high-dimensional data:

• making it difficult to cluster at all (based on statistical/theoretical arguments)
• numerical issues with Gaussian distributions
]]>
<p>The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. In this post, we will talk about how the algorithm works and then prove its correctness, finally we will show a concrete yet the most common use case where the algorithm is applied.</p>

1小时

Le Creuset 铸铁炖锅

### 做法

1. 鸡去脖子和屁股，洗干净不必控水，里外抹上盐、黑胡椒， 腌制半小时。
2. 姜切片，一部分葱切段，锅底抹一层油。放入姜片和葱段铺满锅底。另一部分葱和小米椒切碎备用。
3. 将腌制后的童子鸡放入锅中，淋上三大勺料酒。
4. 盖上锅盖，中火开始煮，煮至有蒸汽出来后，调整火力至小火。
5. 小火20分钟后，如筷子可以轻松戳透鸡肉，说明鸡煮好了。
6. 别关火，倒入少许生抽或者蒸鱼豉油。
7. 倒入切碎的葱和小米椒，一把香菜，均匀撒在鸡身上。
8. 炒锅加热，倒入适量油，加入一小撮花椒粒，加一勺香油，烧到滚烫冒烟，关火，花椒粒弃用。
9. 把滚烫的油浇到鸡上，烫熟葱和小米椒。

### 成品 ]]>
<h3 id="用时"><a class="markdownIt-Anchor" href="#用时"></a> 用时</h3> <p>1小时</p> <h3 id="必备厨具"><a class="markdownIt-Anchor" href="#必备厨具"></a> 必备厨具</h3> <p>Le Creuset 铸铁炖锅</p>
Google Hash Code 2019 http://invkrh.me/2019/04/13/google-hash-code-2019/ 2019-04-12T22:00:00.000Z 2019-06-16T13:47:58.800Z My thoughts on the qualification round of Google Hash Code 2019. If you are interested in the past Online Qualification Round and Finals problem statements, you should check this page.

## General information

Based on the problem statement, your team of 4 need to solve 5 different cases of the problem. Each case is represented by an input text file. An output file should be generated for each input file and submitted to the Judge System. The Judge System will give a score for each output file. You can submit as many as you can, but only the highest score will be counted for each case. The summation of all the scores is the final score of the team. The top teams will be invited to the final round which will be held at Google Dublin on April 27, 2019.

The problem is always NP-hard, which means that you can never find the best solution in polynomial time. Fortunately, you can make the result better and better. The greedy algorithm and dynamic programming are very helpful.

## Problem statement

• A slideshow contains a list of slides
• A slide contains either one horizontal photo or two vertical photos
• A photo contains a list of tags (string type)
• The tag set of a slide is the set of all tags of the photos in the slide
• If the slide containing 1 horizontal photos: $S = tags(h)$
• If the slide containing 2 vertical photos: $S = tags(v_1) \cup tags(v_2)$
• The slideshow is scored based on how interesting the transitions between each pair of subsequent slides are
• For two subsequent slides $S_i$ and $S_{i+1}$, the score is the minimum (the smallest number of the three) of:
1. the number of common tags between $S_i$ and $S_{i+1}$
2. the number of tags in $S_i$ but not in $S_{i+1}$
3. the number of tags in $S_{i+1}$ but not in $S_i$
• Score formula: $min(\vert S_i \cap S_{i+1} \vert, \vert S_i \setminus S_{i+1} \vert, \vert S_{i+1} \setminus S_i \vert)$

For simplicity, horizontal slide refers to a slide containing 1 horizontal slide and vertical slide refers to a slide containing 2 vertical slides

## Solution

First of all, let’s look at how the tags are distributed in the input files. This may help us understand the problem better.

OrientationMetricABCDE
H# photos28000050030000NA
H# distict tags48400001558220NA
H# tags per photo2.518.09.4310.04NA
H# photos per tag1.251.713.021369.09NA
V# photos2NA5006000080000
V# distict tags3NA1548220500
V# tags per photo2NA9.5210.0119.09
V# photos per tag1.3NA3.072732.063055.96

Some observations:

• B, D, E will come up with a large amount of score because there are many photos in these cases
• B contains horizontal photos only, which would be a good start point to try ideas
• B has many distinct tags which are far more than D, E
• In B and E, photos have twice more tags than D
• In term of photos per tag, D and E are thousands of times larger than B
• E contains vertical photos only which would be the last challenge

### Intuition

Actually, we can consider the problem as building a sequence of slides and maximizing its overall transition score. Given the tag set of the previous slide, we can find a slide which maximizes the current transition. If every subsequent transition is optimized, we could eventually optimize the final score.

### Greedy Algorithm

prevSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photosresult += prevSlidewhile some photos are unused:    (hSlide, hScore) <- Best horizontal slide w.r.t prevSlide    (vSlide, vScore) <- Best vertical slide w.r.t prevSlide    if both hSlide and vSlide are found, then        if hScove >= vScore, then            prevSlide <- hScove        else            prevSlide <- vScove    else if hSlide not found, then        prevSlide <- vSlide    else if vSlide not found, then        prevSlide <- hSlide    else        preSlide <- create a slide of 1 horizontal unused photo or 2 vertical unused photos    res += prevSlidedonereturn result

#### How to find the best horizontal slide w.r.t the previous slide?

Obviously, we just iterate over the horizontal photos $h_i$ and take the photo with the highest score after the previous slide $p$.

$h_{i}^{*} = \underset{h_i \in H}{\operatorname{argmax}} score(p, h_i)$

Actually, we don’t need to search all the photos in the pool of horizontal photos hPool, instead, we only need to look for those having at least one common tag with the previous slide, otherwise, the score must be zero. The search space is restricted in those candidates

Essentially, we can keep track of the photos sharing the same tag. We store this information in a HashMap[String, HashSet[Photo]] call hDict, which is a mapping from tag to a list of photos containing the tag.

#### How to find the best vertical slide w.r.t the previous slide?

This is the most tricky part of the problem.

One solution: we can iterate over all the combination of 2 vertical photos and choose the best vertical slide as we did for horizontal slide. However, the time complexity for each step is very high, $O(N_v^2)$. If you compute the candidates size for each previous slides, we will find that it is, in average, about 7200 vertical photos in case D and 60000 in case E. The brute force solution may work for D, but never works for E.

A more reasonable solution is heuristic. Given a previous slide, we can first take the best vertical photo, denoted as $v_i$, and then find the second vertical photo $v_j$ maximizing the score between the previous slide $p$ and the vertical slide of $v_i$ and $v_j$.

$v_{i}^{*} = \underset{v_i \in V}{\operatorname{argmax}} score(p, v_i)$

$v_{j}^{*} = \underset{v_j \in (V - v_i^*)}{\operatorname{argmax}} score(p, v_{i}^{*} + v_j)$

#### Can we do better?

We notice that the sequence can be built in parallel. More precisely, we can hash partition the input photos into $M$ blocks. Each block of photos can be used to build a sub-sequence independently, and then all the sub-sequences could be joined into a sequence. This is a classical map-reduce pattern which cam largely improve the performance.

### Code

You may find the code here (comments included)

## Result

With the heuristic algorithm and parallelization trick, we can obtain the following result. The final score is 1112255 which around top 20 in the qualification round, see here. Moreover, the time for D and E are no more than 5 mins which are also acceptable during the competition.

Case: a_exampleTime: 119 msScore: 2Case: b_lovely_landscapesTime: 17569 msScore: 201792Case: c_memorable_momentsTime: 165 msScore: 1736Case: d_pet_picturesTime: 228491 msScore: 428041Case: e_shiny_selfiesTime: 294582 msScore: 480684Final score = 1112255

## Conclusion

The competition has been well organized as usual. However, I am a little disappointed at the problem itself. The provided data set is very easy to be cracked brute forced. One can just randomize the slide sequence several times and return the best one. Some teams did that and had a good result. Fortunately, these randomized brute force solutions are not good enough for the finalist. Anyway, this reminds me again the maxim Done is better than perfect.

## Auxiliary

The CPU information of the computer on which the programme has been run.

\$ lscpu   Architecture:          x86_64CPU op-mode(s):        32-bit, 64-bitByte Order:            Little EndianCPU(s):                8On-line CPU(s) list:   0-7Thread(s) per core:    2Core(s) per socket:    4Socket(s):             1NUMA node(s):          1Vendor ID:             GenuineIntelCPU family:            6Model:                 94Model name:            Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHzStepping:              3CPU MHz:               800.015CPU max MHz:           4000.0000CPU min MHz:           800.0000...
]]>
<p>My thoughts on the qualification round of <a href="/2019/04/13/google-hash-code-2019/qualification.pdf" title="Google Hash Code 2019">Google Hash Code 2019</a>. If you are interested in the past Online Qualification Round and Finals problem statements, you should check this <a href="https://codingcompetitions.withgoogle.com/hashcode/archive" target="_blank" rel="noopener">page</a>.</p>
Tensorflow in Practice Learning Note http://invkrh.me/2019/03/11/tensorflow-specialization-learning-note/ 2019-03-11T22:03:33.000Z 2019-09-07T17:02:45.072Z A learning note of the coursera specialization Tensorflow in practice given by deeplearning.ai.

• Course 1: Introduction to TensorFlow for AI, ML and DL
• Course 2: Convolutional Neural Networks in TensorFlow
• Course 3: Natural Language Processing in TensorFlow
• Course 4: Sequences, Time Series and Prediction

# C1W1: A New Programming Paradigm

inputoutput

## Code

### How to fit a line

import tensorflow as tfimport numpy as npfrom tensorflow import kerasmodel = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=)])model.compile(optimizer='sgd', loss='mean_squared_error')xs = np.array([-1.0,  0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)model.fit(xs, ys, epochs=500)print(model.predict([10.0]))

The predicted value is not 19.0 but a little under. It is because neural networks deal with probabilities, so given the data that we fed the NN with, it calculated that there is a very high probability that the relationship between $X$ and $Y$ is $Y=2X-1$, but with only 6 data points we can’t know for sure. As a result, the result for 10 is very close to 19, but not necessarily 19.

# C1W2: Introduction to Computer Vision

## Note

### Why are the labels numbers instead of words

Using a number is a first step in avoiding bias – instead of labelling it with words in a specific language and excluding people who don’t speak that language! You can learn more about bias and techniques to avoid it here.

### What is cross entropy (CE)

$CE = - \sum_{i=0}^{C - 1} y_i \cdot log( f(\vec{x_i}) )$

where

• $C$: the number of classes
• $\vec{x_i}$: the feature vector of the example $i$
• $y_i$: the label of the example $i$
• $f$: the learned prediction function which takes the feacture vector $\vec{x_i}$ and returns the probability of being class $y_i$

When $c = 2$

$CE = - \big[ y_i \cdot log( p_i ) + (1 - y_i) \cdot log( 1 - p_i ) \big]$

### Difference between categorical_crossentropy and sparse_categorical_crossentropy

• If your targets are one-hot encoded, use categorical_crossentropy.
Examples of one-hot encodings:
[1,0,0][0,1,0][0,0,1]
• But if your targets are integers, use sparse_categorical_crossentropy.
Examples of integer encodings (for the sake of completion):
123

## Code

# Early stoppingclass myCallback(tf.keras.callbacks.Callback):  def on_epoch_end(self, epoch, logs={}):    if(logs.get('loss')<0.4):      print("\nReached 60% accuracy so cancelling training!")      self.model.stop_training = Truecallbacks = myCallback()mnist = tf.keras.datasets.fashion_mnist(training_images, training_labels), (test_images, test_labels) = mnist.load_data()# Data normalizationtraining_images  = training_images / 255.0test_images = test_images / 255.0model = tf.keras.models.Sequential([tf.keras.layers.Flatten(),                                     tf.keras.layers.Dense(128, activation=tf.nn.relu),                                     tf.keras.layers.Dense(10, activation=tf.nn.softmax)])model.compile(optimizer = 'adam',              loss = 'sparse_categorical_crossentropy',              metrics=['accuracy'])model.fit(training_images, training_labels, epochs=5, callbacks=[callbacks])model.evaluate(test_images, test_labels)

# C1W3: Enhancing Vision with Convolutional Neural Networks

## Note

### Convolution Layer

Each kernal is an edge detector which is perfect for computer vision, because often it’s features that can get highlighted like this that distinguish one item for another, and the amount of information needed is then much less…because you’ll just train on the highlighted features.

### MaxPooling Layer

The convolution layer is followed by a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convolution

### Why CNN works

CNN tries different filters on the image and learning which ones work when looking at the training data. As a result, when it works, you’ll have greatly reduced information passing through the network, but because it isolates and identifies features, you can also get increased accuracy

## Code

### Model

# Reshape to a 4D tensor, otherwise the Convolutions do not recognize the shapetraining_images=training_images.reshape(60000, 28, 28, 1)training_images=training_images / 255.0test_images = test_images.reshape(10000, 28, 28, 1)test_images=test_images/255.0# 2-convolution-layer NNmodel = tf.keras.models.Sequential([  # default: strides = 1, padding = 'valid'  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),   # default: strides = None (same as pool_size), padding = 'valid'  tf.keras.layers.MaxPooling2D(2, 2),  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),   tf.keras.layers.MaxPooling2D(2,2),  tf.keras.layers.Flatten(),  tf.keras.layers.Dense(128, activation='relu'),  tf.keras.layers.Dense(10, activation='softmax')])
_________________________________________________________________ || Layer (type)                 Output Shape              Param #    || Comments================================================================= || conv2d (Conv2D)              (None, 26, 26, 64)        640        || = 64 x (3 x 3 x 1 + 1)_________________________________________________________________ || max_pooling2d (MaxPooling2D) (None, 13, 13, 64)        0          || _________________________________________________________________ || conv2d_1 (Conv2D)            (None, 11, 11, 64)        36928      || = 64 x (3 x 3 x 64 + 1)_________________________________________________________________ || max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0          || _________________________________________________________________ || flatten_1 (Flatten)          (None, 1600)              0          || _________________________________________________________________ || dense_2 (Dense)              (None, 128)               204928     || = 128 x (1600 + 1)_________________________________________________________________ || dense_3 (Dense)              (None, 10)                1290       || = 10 * (128 + 1)================================================================= || Total params: 243,786Trainable params: 243,786Non-trainable params: 0

### How to compute output size

Convolution layer

$(n + 2p - f + 1) \times (n + 2p - f + 1)$

MaxPooling layer

$Floor(\frac{height - f}{s} + 1) \times Floor(\frac{weight - f}{s} + 1)$

• $n$: input size
• $p$: padding size
• $f$: filter size

$p = 0$

• Same: results in padding the input such that the output has the same length as the original input

$n + 2p - f + 1 = n \implies p = (f - 1) / 2$

where $f$ is almost always odd number

### How to compute number of parameters

$NF \times (f \times f \times NC_{input} + 1 )$

• $NF$: number of filters
• $NC_{input}$: number of input channels
• Each filter has a bias term
• Convolutions Over Volume

### Visualizing the Convolutions and Pooling Each row represents an itea. There are 3 shoes images here.
The 4 columns represent the output of the first 4 layers (conv2d, max_pooling2d, conv2d_1, max_pooling2d_1).
We can find the commonality for the same kind of items.

# C1W4: Using Real-world Images

## Note

### ImageGenerator

• ImageGenerator can flow images from a directory and perform operations such as resizing them on the fly.
• You can point it at a directory and then the sub-directories of that will automatically generate labels for you
images|-- training|   |-- horse|   |   |-- 1.jpg|   |   |-- 2.jpg|   |   -- 3.jpg|   -- human|       |-- 1.jpg|       |-- 2.jpg|       -- 3.jpg-- validation    |-- horse    |   |-- 1.jpg    |   |-- 2.jpg    |   -- 3.jpg    -- human        |-- 1.jpg        |-- 2.jpg        -- 3.jpg

If you point ImageGenerator to training directory, it will generate a stream of images labelled with horse or human

### Mini-batch

#### Why mini-batch

For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

• The mini-batches may need to be quite big when adapting fancy methods.
• Big mini-batches are more computationally efficient.

• Momentum
• RMSProp

## Code

### Model

import tensorflow as tffrom tensorflow.keras.optimizers import RMSpropmodel = tf.keras.models.Sequential([    # Note the input shape is the desired size of the image 300x300 with 3 bytes color    # This is the first convolution    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),    tf.keras.layers.MaxPooling2D(2, 2),    # The second convolution    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),    tf.keras.layers.MaxPooling2D(2,2),    # The third convolution    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),    tf.keras.layers.MaxPooling2D(2,2),    # The fourth convolution    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),    tf.keras.layers.MaxPooling2D(2,2),    # The fifth convolution    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),    tf.keras.layers.MaxPooling2D(2,2),    # Flatten the results to feed into a DNN    tf.keras.layers.Flatten(),    # 512 neuron hidden layer    tf.keras.layers.Dense(512, activation='relu'),    # Only 1 output neuron. It will contain a value from 0-1 where 0 for 1 class ('horses') and 1 for the other ('humans')    tf.keras.layers.Dense(1, activation='sigmoid')])# Train our model with the binary_crossentropy loss, # because it's a binary classification problem and our final activation is a sigmoid.# [More details](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)model.compile(loss='binary_crossentropy',              optimizer=RMSprop(lr=0.001),              metrics=['acc'])model.summary()
Layer (type)                 Output Shape              Param #   =================================================================conv2d (Conv2D)              (None, 298, 298, 16)      448       _________________________________________________________________max_pooling2d (MaxPooling2D) (None, 149, 149, 16)      0         _________________________________________________________________conv2d_1 (Conv2D)            (None, 147, 147, 32)      4640      _________________________________________________________________max_pooling2d_1 (MaxPooling2 (None, 73, 73, 32)        0         _________________________________________________________________conv2d_2 (Conv2D)            (None, 71, 71, 64)        18496     _________________________________________________________________max_pooling2d_2 (MaxPooling2 (None, 35, 35, 64)        0         _________________________________________________________________conv2d_3 (Conv2D)            (None, 33, 33, 64)        36928     _________________________________________________________________max_pooling2d_3 (MaxPooling2 (None, 16, 16, 64)        0         _________________________________________________________________conv2d_4 (Conv2D)            (None, 14, 14, 64)        36928     _________________________________________________________________max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64)          0         _________________________________________________________________flatten (Flatten)            (None, 3136)              0         _________________________________________________________________dense (Dense)                (None, 512)               1606144   _________________________________________________________________dense_1 (Dense)              (None, 1)                 513       =================================================================Total params: 1,704,097Trainable params: 1,704,097Non-trainable params: 0

The convolutions reduce the shape from 90000 (300 x 300) down to 3136

### ImageDataGenerator

# All images will be rescaled by 1./255train_datagen = ImageDataGenerator(rescale=1/255)validation_datagen = ImageDataGenerator(rescale=1/255)# Flow training images in batches of 128 using train_datagen generatortrain_generator = train_datagen.flow_from_directory(        '/tmp/horse-or-human/',  # This is the source directory for training images        target_size=(300, 300),  # All images will be resized to 150x150        batch_size=128, # number of images for each batch        # Since we use binary_crossentropy loss, we need binary labels        class_mode='binary')# Flow training images in batches of 128 using train_datagen generatorvalidation_generator = validation_datagen.flow_from_directory(        '/tmp/validation-horse-or-human/',  # This is the source directory for validation images        target_size=(300, 300),  # All images will be resized to 150x150        batch_size=32, # number of images for each batch        # Since we use binary_crossentropy loss, we need binary labels        class_mode='binary')history = model.fit_generator(      train_generator,      steps_per_epoch=8, # number of batches for each epoch durning training        epochs=15,      verbose=1,      validation_data = validation_generator,      validation_steps=8) # number of batches for each epoch durning validation  

### Visualizing Intermediate Representations As you can see we go from the raw pixels of the images to increasingly abstract and compact representations. The representations downstream start highlighting what the network pays attention to, and they show fewer and fewer features being “activated”; most are set to zero. This is called “sparsity.” Representation sparsity is a key feature of deep learning.

These representations carry increasingly less information about the original pixels of the image, but increasingly refined information about the class of the image. You can think of a convnet (or a deep network in general) as an information distillation pipeline.

# C2W1: Exploring a Larger Dataset

## Code

import numpy as npimport randomfrom   tensorflow.keras.preprocessing.image import img_to_array, load_img# Let's define a new Model that will take an image as input, and will output# intermediate representations for all layers in the previous model after# the first.successive_outputs = [layer.output for layer in model.layers[1:]]#visualization_model = Model(img_input, successive_outputs)visualization_model = tf.keras.models.Model(inputs = model.input, outputs = successive_outputs)# Let's prepare a random input image of a cat or dog from the training set.cat_img_files = [os.path.join(train_cats_dir, f) for f in train_cat_fnames]dog_img_files = [os.path.join(train_dogs_dir, f) for f in train_dog_fnames]img_path = random.choice(cat_img_files + dog_img_files)img = load_img(img_path, target_size=(150, 150))  # this is a PIL imagex   = img_to_array(img)                           # Numpy array with shape (150, 150, 3)x   = x.reshape((1,) + x.shape)                   # Numpy array with shape (1, 150, 150, 3)# Rescale by 1/255x /= 255.0# Let's run our image through our network, thus obtaining all# intermediate representations for this image.successive_feature_maps = visualization_model.predict(x)# These are the names of the layers, so can have them as part of our plotlayer_names = [layer.name for layer in model.layers]# -----------------------------------------------------------------------# Now let's display our representations# -----------------------------------------------------------------------for layer_name, feature_map in zip(layer_names, successive_feature_maps):    if len(feature_map.shape) == 4:        #-------------------------------------------    # Just do this for the conv / maxpool layers, not the fully-connected layers    #-------------------------------------------    n_features = feature_map.shape[-1]  # number of features in the feature map    size       = feature_map.shape[ 1]  # feature map shape (1, size, size, n_features)        # We will tile our images in this matrix    display_grid = np.zeros((size, size * n_features))        #-------------------------------------------------    # Postprocess the feature to be visually palatable    #-------------------------------------------------    for i in range(n_features):      x  = feature_map[0, :, :, i]      x -= x.mean()      x /= x.std ()      x *=  64      x += 128      x  = np.clip(x, 0, 255).astype('uint8')      display_grid[:, i * size : (i + 1) * size] = x # Tile each filter into a horizontal grid    #-----------------    # Display the grid    #-----------------    scale = 20. / n_features    plt.figure( figsize=(scale * n_features, scale) )    plt.title ( layer_name )    plt.grid  ( False )    plt.imshow( display_grid, aspect='auto', cmap='viridis' ) 

# C2W2: Augmentation: A technique to avoid overfitting

## Note

### Image augmentation

• Image augmentation implementation in Keras: https://keras.io/preprocessing/image/

• Image generator library lets you load the images into memory, process the images and then steam that to the training set to the neural network we will ultimatedly learn on.The preprocessing doesn’t require you to edit your raw images, nor does it amend them for you on-disk. It does it in-memory as it’s performing the training, allowing you to experiment without impacting your dataset.

• As we start training, we’ll initially see that the accuracy is lower than with the non-augmented version. This is because of the random effects of the different image processing that’s being done. As it runs for a few more epochs, you’ll see the accuracy slowly climbing.

• The image augmentation introduces a random element to the training images but if the validation set doesn’t have the same randomness, then its results can fluctuate. You don’t just need a broad set of images for training, you also need them for testing or the image augmentation won’t help you very much.(which does NOT mean that you should augment your validation set, see below)

• Validation dataset should not be augmented: the validation set is used to estimate how your method works on real world data, thus it should only contain real world data. Adding augmented data will not improve the accuracy of the validation. It will at best say something about how well your method responds to the data augmentation, and at worst ruin the validation results and interpretability. As the validation accuracy is no longer a good proxy for the accuracy on new unseen data if you augment the validation data

## Code

train_datagen = ImageDataGenerator(      rescale=1./255,      rotation_range=40,      width_shift_range=0.2,      height_shift_range=0.2,      shear_range=0.2,      zoom_range=0.2,      horizontal_flip=True,      fill_mode='nearest')

# C2W3: Transfer Learning

## Note

### What is transfer learning

You can take an existing model, freeze many of its layers to prevent them being retrained, and effectively ‘remember’ the convolutions it was trained on to fit images, then added your own DNN underneath this so that you could retrain on your images using the convolutions from the other model.

### Why dropout can do the regularization

The idea behind Dropouts is that they remove a random number of neurons in your neural network. This works very well for two reasons:

• The first is that neighboring neurons often end up with similar weights, which can lead to overfitting, so dropping some out at random can remove this.

• The second is that often a neuron can over-weigh the input from a neuron in the previous layer, and can over specialize as a result. It can not rely on any of the input which will be randomly dropped, instead, it will spread the weights, by which the weights will be shrinked.

## Code

from tensorflow.keras import layersfrom tensorflow.keras import Modelfrom tensorflow.keras.optimizers import RMSpropfrom tensorflow.keras.applications.inception_v3 import InceptionV3local_weights_file = '/tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5'pre_trained_model = InceptionV3(input_shape = (150, 150, 3),                                 include_top = False,  # whether to include the fully-connected layer at the top of the network.                                weights = None) # one of None (random initialization) or 'imagenet' (pre-training on ImageNet).for layer in pre_trained_model.layers:  layer.trainable = Falselast_layer = pre_trained_model.get_layer('mixed7')last_output = last_layer.output# Flatten the output layer to 1 dimensionx = layers.Flatten()(last_output)# Add a fully connected layer with 1,024 hidden units and ReLU activationx = layers.Dense(1024, activation='relu')(x)# Add a dropout rate of 0.2x = layers.Dropout(0.2)(x)                  # Add a final sigmoid layer for classificationx = layers.Dense  (1, activation='sigmoid')(x)           model = Model( pre_trained_model.input, x) model.compile(optimizer = RMSprop(lr=0.0001),               loss = 'binary_crossentropy',               metrics = ['acc'])

# C2W4: Multiclass Classification

## Note

• Use CGI to generate images for Rock, Paper, Scissors

## Code

train_generator = training_datagen.flow_from_directory(TRAINING_DIR,target_size=(150,150),class_mode='categorical')# Same for validationmodel = tf.keras.models.Sequential([    # Convolution layers    # ...    # Flatten the results to feed into a DNN    tf.keras.layers.Flatten(),    tf.keras.layers.Dropout(0.5),    # 512 neuron hidden layer    tf.keras.layers.Dense(512, activation='relu'),    # 3 nodes with softmax    tf.keras.layers.Dense(3, activation='softmax') ])

Another way of using fit_generator API via (images, labels), instead of via directory

history = model.fit_generator(train_datagen.flow(training_images, training_labels, batch_size=32),                              steps_per_epoch=len(training_images) / 32,                              epochs=15,                              validation_data=validation_datagen.flow(testing_images, testing_labels, batch_size=32),                              validation_steps=len(testing_images) / 32)

# C3W1: Sentiment in text

## Code

from tensorflow.keras.preprocessing.text import Tokenizersentences = [  'I love my dog',  'I love my cat']tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')tokenizer.fit_on_texts(sentences)word_index = tokenizer.word_indexprint(word_index)

Remark:

• If the number of distinct words is bigger than num_words, the tokenizer will do is take the top 100 words by volume
• num_words is optional. If it is not set, it will take all the words in the sentences
• oov_token is used for words that aren’t in the word index
• Punctuation like spaces and the comma, have actually been removed
• Token is case sensitive => convert to lower case
• word_index is sorted by commonality
sequences = tokenizer.texts_to_sequences(sentences)

Remark:
If you train a neural network on a corpus of texts, and the text has a word index generated from it, then when you want to do inference with the train model, you’ll have to encode the text that you want to infer on with the same word index, otherwise it would be meaningless.

test_seq = tokenizer.texts_to_sequences(test_data)

Remark:
New words which are not in the index will be lost in the sequences
In the case:

• We need a very board corpus
• We need to put a special value for unknown word Tokenizer(num_words = 100, oov_token="<OOV>")
from tensorflow.keras.preprocessing.sequence import pad_sequencespadded = pad_sequences(sequences)

Remark:

padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)

Remark:

• If you only want your sentences to have a maximum of five words. You can say maxlen=5
• Sentences longer than the maxlen lose information from the beginning by default
• If you want to lose from the end instead, you can do so with the truncating parameter

# C3W2: Word Embeddings

## Note

### Why subwords works poorly

Not only do the meanings of the words matter, but also the sequence in which they are found.
Subwords are meaningless and our neural network does not take the order of the words into account.
This is where RNN comes to play.

## Code

### Check TF version

import tensorflow as tfprint(tf.__version__)

Remark:

• Use python3
• If the version of tensorflow is 1.x, you should do tf.enable_eager_execution() which is default in tensorflow 2.x

### Download imdb_reviews via tensorflow-datasets

!pip install -q tensorflow-datasetsimport tensorflow_datasets as tfdsimdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)train_data, test_data = imdb['train'], imdb['test']

Remark:

### Prepare dataset

from tensorflow.keras.preprocessing.text import Tokenizerfrom tensorflow.keras.preprocessing.sequence import pad_sequencesvocab_size = 10000embedding_dim = 16max_length = 120trunc_type = 'post'padding_type = 'post'oov_tok = '<OOV>'# train_sentences is a list of stringtokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)tokenizer.fit_on_texts(train_sentences)word_index = tokenizer.word_indextrain_sequences = tokenizer.texts_to_sequences(train_sentences)train_padded = pad_sequences(train_sequences,                              padding=padding_type,                              truncating=trunc_type,                              maxlen=max_length)# validation_sentences is a list of stringvalidation_sequences = tokenizer.texts_to_sequences(validation_sentences)validation_padded = pad_sequences(validation_sequences,                              padding=padding_type,                              truncating=trunc_type,                              maxlen=max_length)# label is a list of stringlabel_tokenizer = Tokenizer()label_tokenizer.fit_on_texts(labels)training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))

Remark:

• the number of unique label is always very small, no need to set num_words and oov_token
• Once labels are parsed into a list, we need to convert the list into numpy array which is required by tf.keras APIs used below

### Train word embedding label

model = tf.keras.Sequential([    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),    # tf.keras.layers.Flatten(),    tf.keras.layers.GlobalAveragePooling1D(),    tf.keras.layers.Dense(24, activation='relu'),    tf.keras.layers.Dense(6, activation='softmax'),])model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])num_epochs = 30history = model.fit(train_padded, training_label_seq,                     epochs=num_epochs,                     validation_data=(validation_padded, validation_label_seq),                     verbose=2)

Remark:

• Flatten() more parameters => more accurate
• GlobalAveragePooling1D less parameters => less accurate but still good
• GlobalAveragePooling1D averages across the vector to flatten it out
• Check out the model summary below
Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, 120, 16)           160000    _________________________________________________________________flatten (Flatten)            (None, 1920)              0         _________________________________________________________________dense (Dense)                (None, 6)                 11526     _________________________________________________________________dense_1 (Dense)              (None, 1)                 7         =================================================================Total params: 171,533Trainable params: 171,533Non-trainable params: 0_________________________________________________________________
Layer (type)                 Output Shape              Param #   =================================================================embedding_1 (Embedding)      (None, 120, 16)           160000    _________________________________________________________________global_average_pooling1d (Gl (None, 16)                0         _________________________________________________________________dense_2 (Dense)              (None, 6)                 102       _________________________________________________________________dense_3 (Dense)              (None, 1)                 7         =================================================================Total params: 160,109Trainable params: 160,109Non-trainable params: 0 As shown in the figure above, here is how this network works:

1. Each word in one input sequence is transformed into a one-hot coding encoding vector, which is why Embedding layer take vocab_size as a parameter.
2. Each one-hot vector passes through the same embedding layer, it will be transformed into 16-dim vector. For a sequence, we have 120 such vectors.
3. Instead of flatten these 120 vectors, we take average of them. So the output is still a 16-dim vector.
4. The following 2 dense layer is straightforward.

Remark:
Global Average Pooling (GAP) is generally better flatten layer in the structure above, because it only needs less weight which leads to some extent of regularization and can accelarate the training as well.

### Word embedding visualization

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])def decode_sentence(text):    return ' '.join([reverse_word_index.get(i, '?') for i in text])e = model.layersweights = e.get_weights()print(weights.shape) # shape: (vocab_size, embedding_dim)import ioout_v = io.open('vecs.tsv', 'w', encoding='utf-8')out_m = io.open('meta.tsv', 'w', encoding='utf-8')for word_num in range(1, vocab_size):  word = reverse_word_index[word_num]  embeddings = weights[word_num]  out_m.write(word + "\n")  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")out_v.close()out_m.close()

Remark:

# C3W3: Sequence models

## Note

• In terms of loss and accuracy curves, 2-layer LSTM is more smooth.
• LSTM is more likely to overfit than flatten and averaged layer.
• In this week, we tried B-LSTM, B-GRU and Conv1D models. All of them have over-fitting issue, it is natually because there are words which are out of vocabulary. They can not learning during training and leads to the over-fitting.

### Model comparison

#### IMDB Subwords 8K

Training takes too long to run in colab, so no plots.

model = tf.keras.Sequential([    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),    tf.keras.layers.Dense(64, activation='relu'),    tf.keras.layers.Dense(1, activation='sigmoid')])
Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, None, 64)          523840    _________________________________________________________________bidirectional (Bidirectional (None, 128)               66048     _________________________________________________________________dense (Dense)                (None, 64)                8256      _________________________________________________________________dense_1 (Dense)              (None, 1)                 65        =================================================================Total params: 598,209Trainable params: 598,209Non-trainable params: 0_________________________________________________________________
model = tf.keras.Sequential([    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),    tf.keras.layers.Dense(64, activation='relu'),    tf.keras.layers.Dense(1, activation='sigmoid')])
Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, None, 64)          523840    _________________________________________________________________bidirectional (Bidirectional (None, None, 128)         66048     _________________________________________________________________bidirectional_1 (Bidirection (None, 64)                41216     _________________________________________________________________dense (Dense)                (None, 64)                4160      _________________________________________________________________dense_1 (Dense)              (None, 1)                 65        =================================================================Total params: 635,329Trainable params: 635,329Non-trainable params: 0_________________________________________________________________
model = tf.keras.Sequential([    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),    tf.keras.layers.Conv1D(128, 5, activation='relu'),    tf.keras.layers.GlobalAveragePooling1D(),    tf.keras.layers.Dense(64, activation='relu'),    tf.keras.layers.Dense(1, activation='sigmoid')])
Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, None, 64)          523840    _________________________________________________________________conv1d (Conv1D)              (None, None, 128)         41088     _________________________________________________________________global_average_pooling1d (Gl (None, 128)               0         _________________________________________________________________dense (Dense)                (None, 64)                8256      _________________________________________________________________dense_1 (Dense)              (None, 1)                 65        =================================================================Total params: 573,249Trainable params: 573,249Non-trainable params: 0_________________________________________________________________

#### Sarcasm

model = tf.keras.Sequential([    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),    tf.keras.layers.Dense(24, activation='relu'),    tf.keras.layers.Dense(1, activation='sigmoid')])
Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, 120, 16)           16000     _________________________________________________________________bidirectional (Bidirectional (None, 64)                12544     _________________________________________________________________dense (Dense)                (None, 24)                1560      _________________________________________________________________dense_1 (Dense)              (None, 1)                 25        =================================================================Total params: 30,129Trainable params: 30,129Non-trainable params: 0_________________________________________________________________
model = tf.keras.Sequential([    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),    tf.keras.layers.Conv1D(128, 5, activation='relu'),    tf.keras.layers.GlobalMaxPooling1D(),    tf.keras.layers.Dense(24, activation='relu'),    tf.keras.layers.Dense(1, activation='sigmoid')])
Model: "sequential"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, 120, 16)           16000     _________________________________________________________________conv1d (Conv1D)              (None, 116, 128)          10368     _________________________________________________________________global_max_pooling1d (Global (None, 128)               0         _________________________________________________________________dense (Dense)                (None, 24)                3096      _________________________________________________________________dense_1 (Dense)              (None, 1)                 25        =================================================================Total params: 29,489Trainable params: 29,489Non-trainable params: 0_________________________________________________________________
Bidirectional LSTM1D Convolutional Layer
Time per epoch85s3s
Accuracy  Loss  ## Code

model = tf.keras.Sequential([    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),    tf.keras.layers.Dropout(0.2),    tf.keras.layers.Conv1D(64, 5, activation='relu'),    tf.keras.layers.MaxPooling1D(pool_size=4),    tf.keras.layers.LSTM(64),    tf.keras.layers.Dense(1, activation='sigmoid')])model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])model.summary()num_epochs = 50history = model.fit(training_sequences,                     training_labels,                     epochs=num_epochs,                     validation_data=(test_sequences, test_labels),                     verbose=2)
_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, 16, 100)           13802600  _________________________________________________________________dropout (Dropout)            (None, 16, 100)           0         _________________________________________________________________conv1d (Conv1D)              (None, 12, 64)            32064     _________________________________________________________________max_pooling1d (MaxPooling1D) (None, 3, 64)             0         _________________________________________________________________lstm (LSTM)                  (None, 64)                33024     _________________________________________________________________dense (Dense)                (None, 1)                 65        =================================================================Total params: 13,867,753Trainable params: 65,153Non-trainable params: 13,802,600_________________________________________________________________

Applying regularization techniques like drop out can overcome overfitting. We can see from the figures below that the validation loss does not increase sharply!

Without DropoutWith Dropout
Accuracy  Loss  # C3W4: Sequence models and literature

## Note

When you have very large bodies of text with many many words, the word based prediction does not work well. Because the number of unique words in the collection is very big, and there are over millions of sequences generated using the algorithm. So the labels alone would require the storage of many terabytes of RAM.

A better approache is character-based prediction. The full number of unique characters in a corpus is far less than the full number of unique words, at least in English. So the same principles that you use to predict words can be used to apply here.

## Code

corpus = data.lower().split("\n")tokenizer.fit_on_texts(corpus)total_words = len(tokenizer.word_index) + 1 # Add 1 for OOV# create input sequences using list of tokensinput_sequences = []for line in corpus:token_list = tokenizer.texts_to_sequences([line])for i in range(1, len(token_list)):n_gram_sequence = token_list[:i+1]input_sequences.append(n_gram_sequence)# pad sequences max_sequence_len = max([len(x) for x in input_sequences])input_sequences = np.array(pad_sequences(input_sequences,                                          maxlen=max_sequence_len,                                          padding='pre'))# create predictors and labelpredictors, label = input_sequences[:,:-1],input_sequences[:,-1]label = tensorflow.keras.utils.to_categorical(label, num_classes=total_words)model = Sequential()# input_length: minus 1 since the last word is the labelmodel.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) model.add(Bidirectional(LSTM(150, return_sequences = True)))model.add(Dropout(0.2))model.add(LSTM(100))model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))model.add(Dense(total_words, activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())history = model.fit(predictors, label, epochs=100, verbose=1)
Model: "sequential_1"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================embedding (Embedding)        (None, 10, 100)           321100    _________________________________________________________________bidirectional (Bidirectional (None, 10, 300)           301200    _________________________________________________________________dropout (Dropout)            (None, 10, 300)           0         _________________________________________________________________lstm_1 (LSTM)                (None, 100)               160400    _________________________________________________________________dense (Dense)                (None, 1605)              162105    _________________________________________________________________dense_1 (Dense)              (None, 3211)              5156866   =================================================================Total params: 6,101,671Trainable params: 6,101,671Non-trainable params: 0_________________________________________________________________

# C4W1: Sequences and Prediction

## Note

Imputation: Fill data in the pase or fill the missing data
Trends: upward or downward
Seasonalities: repeated patterns
Autocorrelation: correlated with a delayed copy of itself (lag)
Noise：random / occasional values
Combination of all the above
Non-stationary time series: the behavior changed, it should be trained by using time window

### Split training period, validation period, test period

• Fixed partition:
If test period is the most recent dataset which has a strong signal for the future, it should be used to train the model, otherwise the model may not be optimal. So it is quite common to use just a training period and a validation period for model training, and the test set is in the future

• Roll-forward partition:
At each iteration, we train the model on a training period. And we use it to forecast the following day, or the following week, in the validation period. It can been seen as doing fixed partitioning a number of times, and then continually refining the model as such

### Metric

mse = np.square(errors).mean()mae = np.abs(errors).mean()

mse penalize more large errors than mae does.
if large errors are potentially dangerous and they cost you much more than smaller errors, then you may prefer the mse. But if your gain or your loss is just proportional to the size of the error, then the mae may be better.

### Moving average and differencing

1. Use differencing to cancel out the seasonality and trends
2. Use moving average to forecast the difference time series
3. Use moving average to past time series
4. Add back the smoothed differece to the smoothed past time series

### Trailing windows and centered windows

Moving averages using centered windows can be more accurate than using trailing windows. But we can’t use centered windows to smooth present values since we don’t know future values. However, to smooth past values we can afford to use centered windows.

## Code

from tensorflow import kerasdef moving_average_forecast(series, window_size):  """Forecasts the mean of the last few values.     If window_size=1, then this is equivalent to naive forecast"""  forecast = []  for time in range(len(series) - window_size):    forecast.append(series[time:time + window_size].mean())  return np.array(forecast)xprint(keras.metrics.mean_squared_error(x_valid, naive_forecast).numpy())print(keras.metrics.mean_absolute_error(x_valid, naive_forecast).numpy())

# C4W2: Deep Neural Networks for Time Series

## Note

### Preparing feature and labels

dataset = tf.data.Dataset.range(10)dataset = dataset.window(5, shift=1, drop_remainder=True)dataset = dataset.flat_map(lambda window: window.batch(5))dataset = dataset.map(lambda window: (window[:-1], window[-1:]))dataset = dataset.shuffle(buffer_size=10)dataset = dataset.batch(2).prefetch(1)for x,y in dataset:  print("x = ", x.numpy())  print("y = ", y.numpy())
• On line 3, each window is an instance of class tensorflow.python.data.ops.dataset_ops._VariantDataset containing 5 elements. But We need to convert it into a tensor, so we just cut it to batches by 5 elements. This is why we have window.batch(5)
• On line 5, shuffle fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required and the downside is that it really takes long time. If you don’t care about perfect shuffling, choosing a small number of buffer will just speed things up. You can even buffer_size is set to 1, in this case, no shuffle will happen here
• On line 6, according to the tensorflow doc:
The tf.data API provides a software pipelining mechanism through the tf.data.Dataset.prefetch transformation, which can be used to decouple the time data is produced from the time it is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. Thus, to achieve the pipelining effect illustrated above, you can add prefetch(1) as the final transformation to your dataset pipeline (or prefetch(n) if a single training step consumes n elements).

### Sequence Bias

Sequence bias is when the order of things can impact the selection of things. For example, if I were to ask you your favorite TV show, and listed “Game of Thrones”, “Killing Eve”, “Travellers” and “Doctor Who” in that order, you’re probably more likely to select ‘Game of Thrones’ as you are familiar with it, and it’s the first thing you see. Even if it is equal to the other TV shows. So, when training data in a dataset, we don’t want the sequence to impact the training in a similar way, so it’s good to shuffle them up.

### Find the best learning rate

lr_schedule = tf.keras.callbacks.LearningRateScheduler(    lambda epoch: 1e-8 * 10**(epoch / 20))optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)model.compile(loss="mse", optimizer=optimizer)history = model.fit(dataset, epochs=100, callbacks=[lr_schedule], verbose=0)# plot the loss per epoch against the learning rate per epochlrs = 1e-8 * (10 ** (np.arange(100) / 20))plt.semilogx(lrs, history.history["loss"])plt.axis([1e-8, 1e-3, 0, 300]) Here, the best learning rate is around 7e-6, because it is the lowest point of the curve where it’s still relatively stable.

# C4W3: Recurrent Neural Networks for Time Series

## Note

For numeric series, things such as closer numbers in the series might have a greater impact than those further away from our target value.

In some cases, you might want to input a sequence, but you don’t want to output on and you just want to get a single vector for each instance in the batch. This is typically called a sequence to vector RNN. But in reality, all you do is ignore all of the outputs, except the last one. When using Keras in TensorFlow, this is the default behavior.

If you want the recurrent layer to output a sequence, you have to specify return_sequences=True when creating the layer. You’ll need to do this when you stack one RNN layer on top of another.

(huber loss)[https://en.wikipedia.org/wiki/Huber_loss]
The Huber function is a loss function that’s less sensitive to outliers and as this data can get a little bit noisy, it’s worth giving it a shot.

## Code

tf.keras.backend.clear_session()dataset = windowed_dataset(x_train, window_size, batch_size, shuffle_buffer_size)model = tf.keras.models.Sequential([  tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, axis=-1)),  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),  tf.keras.layers.Dense(1),  tf.keras.layers.Lambda(lambda x: x * 100.0)])model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-5, momentum=0.9),metrics=["mae"])history = model.fit(dataset,epochs=500,verbose=1)

Note:
The last lambda layer is used to scale up the outputs by 100, which helps training. The default activation function in the RNN layers is tanH which is the hyperbolic tangent activation. This outputs values between negative one and one. Since the time series values are in that order usually in the 10s like 40s, 50s, 60s, and 70s, then scaling up the outputs to the same ballpark can help us with learning.

# C4W4: Real-world time series data

## Note

model = tf.keras.models.Sequential([  tf.keras.layers.Conv1D(filters=32, kernel_size=5,                      strides=1, padding="causal",                      activation="relu",                      input_shape=[None, 1]),  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),  tf.keras.layers.Dense(1),  tf.keras.layers.Lambda(lambda x: x * 200)])

padding="causal"
This simply pads the layer’s input with zeros in the front so that we can also predict the values of early time steps in the window

A good explanation (here)[https://theblog.github.io/post/convolution-in-autoregressive-neural-networks/]

]]>
<p>A learning note of the coursera specialization <a href="https://www.deeplearning.ai/tensorflow-in-practice/" target="_blank" rel="noopener">Tensorflow in practice</a> given by <a href="http://deeplearning.ai" target="_blank" rel="noopener">deeplearning.ai</a>.</p> <ul> <li>Course 1: Introduction to TensorFlow for AI, ML and DL</li> <li>Course 2: Convolutional Neural Networks in TensorFlow</li> <li>Course 3: Natural Language Processing in TensorFlow</li> <li>Course 4: Sequences, Time Series and Prediction</li> </ul>

30分钟

### 做法

1. 鸡肉洗干净，抹上盐，黑胡椒，大蒜粉，腌制10分钟。
2. 四瓣蒜做成蒜末，最好蒜泥。
3. 用中大火预热不粘锅，不用放一滴油（鸡皮出油）。鸡皮朝下放入锅中，每2分钟翻面，共8分钟。
4. 锅中央放入蒜泥、蜂蜜、米醋、酱油、80毫升水。火开大点。
5. 收汁，翻面，等汁浓稠，出锅（可以撒上白芝麻和葱）。

### 成品 ]]>
<h3 id="用时"><a class="markdownIt-Anchor" href="#用时"></a> 用时</h3> <p>30分钟</p>
Interview Questions in 2018 http://invkrh.me/2018/12/19/intervew-questions-in-2018/ 2018-12-18T23:00:00.000Z 2019-04-13T13:04:26.855Z A list of coding interview questions that I was asked in 2018.

### Statement (leetcode: 791)

Given two binary strings, return their sum (also a binary string).
The input strings are both non-empty and contains only characters 1 or 0.

Example 1:Input: a = "11", b = "1"Output: "100"Example 2:Input: a = "1010", b = "1011"Output: "10101"

### Solution

String addBinary(String a, String b) {    int i = a.length() - 1, j = b.length() - 1;    int carry = 0;    StringBuilder sb = new StringBuilder();    while (i >= 0 || j >= 0) {        int sum = carry;        if (i >= 0) {            sum += a.charAt(i) - '0';            i--;        }        if (j >= 0) {            sum += b.charAt(j) - '0';            j--;        }        carry = sum / 2;        sb.insert(0, sum % 2);    }    if (carry != 0) sb.insert(0, carry);    return sb.toString();}
TimeO(n)/
SpaceO(n)StringBuilder

### Extention

What if the given strings can be numbers of any base ?

String addBinary(String a, String b) {    int i = a.length() - 1, j = b.length() - 1;    int carry = 0;    StringBuilder sb = new StringBuilder();    while (i >= 0 || j >= 0) {        int sum = carry;        if (i >= 0) {            sum += a.charAt(i) - '0';            i--;        }        if (j >= 0) {            sum += b.charAt(j) - '0';            j--;        }        carry = sum / 2;        sb.insert(0, sum % 2);    }    if (carry != 0) sb.insert(0, carry);    return sb.toString();}

## Question 2: cd command

### Statement

Write a function to simluate linux command cd

Example 1:Input: cur = "/etc", path = "/bin/"Output: "/bin"Example 2:Input: a = "/etc", b = "hadoop"Output: "/etc/hadoop"Example 3:Input: a = "/etc/hadoop/conf", b = "../../hive"Output: "/etc/hive"Example 4:Input: a = "/etc/hadoop/conf", b = ".././conf"Output: "/etc/hadoop/conf"

### Solution

String cd(String cur, String path) {    if (path.startsWith("/")) return path;    Stack<String> stack = new Stack<>();    for (String dir : cur.split("/"))        if (!dir.isEmpty()) stack.push(dir);    for (String dir : path.split("/"))        if (dir.equals("..")) {            if (!stack.isEmpty()) stack.pop();        } else if (!dir.equals(".")) {            stack.push(dir);        }    String res = String.join("/", stack);    return res.startsWith("/") ? res : "/" + res;}
TimeO(n)/
SpaceO(n)Stack

## Question 3: Custom Sort String

### Statement (leetcode: 791)

S and T are strings composed of lowercase letters. In S, no letter occurs more than once.
S was sorted in some custom order previously. We want to permute the characters of T so that they match the order that S was sorted. More specifically, if x occurs before y in S, then x should occur before y in the returned string.
Return any permutation of T (as a string) that satisfies this property.

Example :Input: S = "cba", T = "abcd"Output: "cbad"Explanation: "a", "b", "c" appear in S, so the order of "a", "b", "c" should be "c", "b", and "a". Since "d" does not appear in S, it can be at any position in T. "dcba", "cdba", "cbda" are also valid outputs.

Note:

• S has length at most 26, and no character is repeated in S.
• T has length at most 200.
• S and T consist of lowercase letters only.

### Solution

public String customSortString(String S, String T) {    int[] dict = new int;    for (char c : T.toCharArray()) {        dict[c - 'a'] += 1;    }    StringBuilder sb = new StringBuilder();        for (char c : S.toCharArray()) {        for (int i = 0; i < dict[c - 'a']; i++)            sb.append(c);        dict[c - 'a'] = 0;    }        for (char c = 'a'; c <= 'z'; c++)        for (int i = 0; i < dict[c - 'a']; i++)            sb.append(c);        return sb.toString();}
TimeO(n)/
SpaceO(n)StringBuilder

## Question 4: Position of the leftmost one

### Statement

Given a binary matrix (containing only 0 and 1) of order n * n. All rows are sorted already. We need to find position of the left most 1.
Note: in case of tie, return the position of the smallest row number.

Example:Input matrix0 1 1 10 0 1 11 1 1 1  // this row has maximum 1s0 0 0 0Output: [2, 0]

### Solution

int[] findPosition(int[][] matrix) {    int r = matrix.length;    if (r == 0) return null;    int c = matrix.length;    if (c == 0) return null;    int[] res = new int[] {};    int j = c - 1;    for (int i = 0; i < r; i++) {        while (j >= 0 && matrix[i][j] == 1) {            j--;            res = new int[] {i, j + 1};        }    }    return res;}
TimeO(r + c)ends on the boundary
SpaceO(1)/

## Question 5: Validate Binary Search Tree

### Statement (leetcode: 98)

Given a binary tree, determine if it is a valid binary search tree (BST).
Assume a BST is defined as follows:

• The left subtree of a node contains only nodes with keys less than the node’s key.
• The right subtree of a node contains only nodes with keys greater than the node’s key.
• Both the left and right subtrees must also be binary search trees.
Example 1:Input:    2   / \  1   3Output: trueExample 2:    5   / \  1   4     / \    3   6Output: falseExplanation: The input is: [5,1,4,null,null,3,6]. The root node's value             is 5 but its right child's value is 4.

### Solution

boolean validate(TreeNode node, long min, long max) {    if (node == null) {        return true;    } else {        if (node.val > min && node.val < max) {            return validate(node.left, min, node.val) && validate(node.right, node.val, max);        } else {            return false;        }    }}boolean isValidBST(TreeNode root) {    return validate(root, Long.MIN_VALUE, Long.MAX_VALUE);}
TimeO(n)visit all the nodes
SpaceO(log n)recursice call stack

## Question 6: Search word in the dictionary

### Statement (leetcode: 211)

Design a data structure that supports the following two operations:

class WordDictionary {    /** Initialize data structure */    public WordDictionary()    /** Adds a word into the data structure. */    public void addWord(String word)    /** Returns if the word is in the data structure. A word could contain the dot character '.' to represent any one letter. */    public boolean search(String word)}void addWord(word)bool search(word)

search(word) can search a literal word or a regular expression string containing only letters a-z or .. A . means it can represent any one letter.

Example:addWord("bad")addWord("dad")addWord("mad")search("pad") -> falsesearch("bad") -> truesearch(".ad") -> truesearch("b..") -> true

### Solution

class WordDictionary {    class TrieNode {        TrieNode[] next = new TrieNode;        String word = null;    }    TrieNode root;    public WordDictionary() {        this.root = new TrieNode();    }        /** Adds a word into the data structure. */    public void addWord(String word) {        TrieNode node = root;        for (int i = 0; i < word.length(); i++) {            char c = word.charAt(i);            if (node.next[c - 'a'] == null) node.next[c - 'a'] = new TrieNode();            node = node.next[c - 'a'];        }        node.word = word;    }        /** Returns if the word is in the data structure. A word could contain the dot character '.' to represent any one letter. */    public boolean search(String word) {        return match(word, 0, root);    }        private boolean match(String word, int i, TrieNode node) {        if (i == word.length()) return node.word != null;                char c = word.charAt(i);        if (c == '.') {            for (TrieNode nextNode : node.next) {                if (nextNode != null && match(word, i + 1, nextNode)) {                    return true;                }            }            return false;        } else {            TrieNode nextNode = node.next[c - 'a'];            return nextNode != null && match(word, i + 1, nextNode);        }    }}
TimeO(n)/
SpaceO(n)node creation
TimeO(n)/
SpaceO(n)recursive call stack

## Question 7: Valid Palindrome

### Statement (leetcode: 125)

Given a string, determine if it is a palindrome, considering only alphanumeric characters and ignoring cases.
Note: For the purpose of this problem, we define empty string as valid palindrome.

Example 1:Input: "A man, a plan, a canal: Panama"Output: trueExample 2:Input: "race a car"Output: false

### Solution

boolean isValid(char c) {    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9');}boolean isPalindrome(String s) {    int i = 0;    int j = s.length() - 1;    while(i <= j) {        if (!isValid(s.charAt(i))) {            i++;            continue;        }        if (!isValid(s.charAt(j))) {            j--;            continue;        }        if (Character.toLowerCase(s.charAt(i)) == Character.toLowerCase(s.charAt(j))) {            i++;            j--;        } else {            return false;        }    }    return true;}
TimeO(n)/
SpaceO(1)/

## Question 8: Shortest Distance To All Stations

### Statement Given a metro map of London, find the station which is closest to all the others stations.

### Solution (Floyd–Warshall algorithm)

/** graph is a weighted undirected adjacency matrix */int solve(double[][] graph) {    int n = graph.length;    double[][] dist = new double[n][n];    for (int i = 0; i < n; i++) {        for (int j = 0; j < n; j++) {                dist[i][j] = graph[i][j];            }        }    /** Floyd–Warshall algorithm */    for (int k = 0; k < n; k++) {        for (int i = 0; i < n; i++) {            for (int j = 0; j < n; j++) {                if (dist[i][j] > dist[i][k] + dist[k][j]) {                    dist[i][j] = dist[i][k] + dist[k][j];                }            }        }    }    double min = Integer.MAX_VALUE;    int res = -1;    for (int i = 0; i < n; i++) {        double sum = 0        for (double d : dist[i])            sum += d;        if (sum < min) {            res = i;            min = sum;        }    }    return res;}
TimeO(n ^ 3)/
SpaceO(n ^ 2)/

## Question 9: Equilibrium Point

### Statement (leetcode: 724)

Given an array of integers nums, write a method that returns the “pivot” index of this array.
We define the pivot index as the index where the sum of the numbers to the left of the index is equal to the sum of the numbers to the right of the index.
If no such index exists, we should return -1. If there are multiple pivot indexes, you should return the left-most pivot index.

Example 1:Input: nums = [1, 7, 3, 6, 5, 6]Output: 3Explanation: The sum of the numbers to the left of index 3 (nums = 6) is equal to the sum of numbers to the right of index 3.Also, 3 is the first index where this occurs.Example 2:Input: nums = [1, 2, 3]Output: -1Explanation: There is no index that satisfies the conditions in the problem statement.

### Solution

pivotIndex(int[] nums) {    int sum = 0, leftsum = 0;    for (int x: nums) sum += x;    for (int i = 0; i < nums.length; ++i) {        if (leftsum == sum - leftsum - nums[i]) return i;        leftsum += nums[i];    }    return -1;}
TimeO(n)/
SpaceO(1)/

## Question 10: Complete Binary Tree

### Statement

Given a complete binary tree in which each node marked with a number in level order (root = 1) and several connections are removed.
Find if the given number is still reachable from the root of the tree.

Example 1:Input: tree = root, num = 5            1 -> root           / \          /   \         /     \        /       \       /         \      2           3     /           / \    /           /   \   4     5     6     7  / \   / \   / \   / \ 8   9 10 11 12 13 14 15Output: falseExample 2:Input: tree = root, num = 6            1 -> root             \              \               \                \                 \      2           3     / \         / \    /   \       /   \   4     5     6     7  / \   / \   / \   / \ 8   9 10 11 12 13 14 15Output: true

### Solution

boolean findInCompleteTree(TreeNode root, int n) {    List<Boolean> path = new LinkedList<>();        while (n > 1) {        if (n % 2 == 0) {            path.add(0, true);        } else {            path.add(0, false);        }        n /= 2;    }    for (boolean p : path) {        if (p) root = root.left;        else root = root.right;        if (root == null) return false;    }    return true;}
TimeO(log n)/
SpaceO(log n)/

### Extension (leetcode: 222)

Count the number of node in a complete binary tree.

Example 1:Input: tree = root            1 -> root           / \          /   \         /     \        /       \       /         \      2           3     / \         / \    /   \       /   \   4     5     6     7  / \   / \   / 8   9 10 11 12Output: 12
int countInCompleteTree(TreeNode root) {    TreeNode node = root;    int depthLeft = 0;    while (node != null) {        depthLeft++;        node = node.left;    }    node = root;    int depthRight = 0;    while (node != null) {        depthRight++;        node = node.right;    }        return depthLeft == depthRight ?        (1 << depthLeft) - 1 :        1 + countInCompleteTree(root.left) + countInCompleteTree(root.right);}
TimeO(log n * log n)log n calls and each call takes log n to compute depth
SpaceO(log n)recursive call stack

## Question 11: UTF-8 Encoding

### Statement

A character in UTF8 can be from 1 to 4 bytes long, subjected to the following rules:

• For 1-byte character, the first bit is a 0, followed by its unicode code.
• For n-bytes character, the first n-bits are all one’s, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10.

This is how the UTF-8 encoding would work:

Number of bytesBits for code pointFirst code pointLast code pointByte 1Byte 2Byte 3Byte 4
17U+0000U+007F0xxxxxxx
211U+0080U+07FF110xxxxx10xxxxxx
316U+0800U+FFFF1110xxxx10xxxxxx10xxxxxx
421U+10000U+10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

Given a byte array which contains only UTF-8 encoded characters and an integer limit,
return the max number of bytes contains only valid UTF-8 encordings in the first limit bytes.

Example 1:Input:stream = | 0xxxxxxx | 110xxxxx | 10xxxxxx | 1110xxxx | 10xxxxxx | 10xxxxxx | 11110xxx | 10xxxxxx ||| 10xxxxxx | 10xxxxxx |limit = 8Output: 5Example 2:Input:stream = | 0xxxxxxx | 110xxxxx | 10xxxxxx |limit = 5Output: 2

### Solution

int countUTF8Byte(byte[] stream, int limit) {    if (stream.length <= limit) {        return stream.length;    } else {        while (limit > 0 && (stream[limit] & 0xFF) >> 6 == 2) {            limit--;        }        return limit;    }}
TimeO(1)No more than 6 bytes
SpaceO(1)/

## Question 12: Design Rate limiter

### Statement (inspired by leetcode: 362)

Design rate limiter API based on the count limit per minute and per hour.
The granularity of timestamp is in second if needed.

class RateLimiter {    /** Initialize data structure */    public RateLimiter(long minuteCount, long hourCount)            /** Return true if the function calls exceeded either minuteCount or hourCount, otherwise return false */    public boolean isLimited() }RateLimiter rl = new RateLimit(100, 6000);rl.isLimited() // return false;

### Solution

public class RateLimiter {    class HitCounter {        private int   numBucket;        private int[] time;        private int[] hit;        public HitCounter(int numBucket) {            this.numBucket = numBucket;            this.time = new int[numBucket];            this.hit = new int[numBucket];        }        public void hit(int ts) {            int bucket = ts % this.numBucket;            if (time[bucket] == ts) {                hit[bucket]++;            } else {                time[bucket] = ts;                hit[bucket] = 1;            }        }        public int count(int ts) {            int cnt = 0;            for (int i = 0; i < this.numBucket; i++) {                if (ts - time[i] < this.numBucket) {                    cnt += hit[i];                }            }            return cnt;        }    }    private long       minuteLimit;    private long       hourLimit;    private HitCounter minuteCounter;    private HitCounter hourCounter;    public RateLimiter(long minuteLimit, long hourLimit) {        this.minuteLimit = minuteLimit;        this.hourLimit = hourLimit;        this.minuteCounter = new HitCounter(60);        this.hourCounter = new HitCounter(3600);    }    public boolean isLimited() {        int tsInSec = (int) (System.currentTimeMillis() / 1000);        if (this.minuteCounter.count(tsInSec) < this.minuteLimit &&            this.hourCounter.count(tsInSec) < this.hourLimit) {            minuteCounter.hit(tsInSec);            hourCounter.hit(tsInSec);            return false;        } else {            return true;        }    }    public static void main(String[] args) throws InterruptedException {        RateLimiter rl = new RateLimiter(10, 600);        int count = 0;        while (true) {            Thread.sleep(1000);            if (rl.isLimited()) {                break;            } else {                count++;                System.out.println("Limit not reached: " + count);            }        }        System.out.println("Limit exceeded: " + count);    }}
TimeO(1)/
SpaceO(n)number of the buckets
TimeO(n)number of the buckets
SpaceO(n)number of the buckets

## Question 13: Design Task Scheduler (cron)

### Statement

Implement the following 3 methods. Start with scheduling part and then execution part.

public class CronScheduler {    void schedule(TimerTask task, long delay) {}    void repeat(TimerTask t, long delay, long period) {}    void daily(TimerTask t, long delay) {}}

### Solution

Reference: java.util.Timer and java.util.TimerTask

// TODO`
]]>
<p>A list of coding interview questions that I was asked in 2018.</p>