Score Matching 系列 (二) Denoising Score Matching (DSM) 改善效率並可 Scalable

發表於 2022-03-06 | 分類於 ML

這是一篇論文筆記: “A Connection Between Score Matching and Denoising Autoencoders”
建議看本文前請先參前一篇: Score Matching 系列 (一) Non-normalized 模型估計

前言

基於 Score Matching, 提出 Denoising Score Matching (DSM) 的目標函式, 好處是在 energy-based model 下:

不用將 score function 的 gradients 也納入 loss 去計算 (避免二次微分做 backprop 提高效率)
當 input $x$ 的維度很大也沒問題 (可以 scalable)

但缺點是:

最多只能學到加 noise 後的分布
加 noise 的 level 不好調整

這兩個缺點在下一篇 Sliced Score Matching (SSM) 可以得到改善
這篇論文最後也點出了 denoising autoencoder 跟 score matching 的關係 (實際上就是 DSM loss)

以下正文開始

閱讀全文 »

Score Matching 系列 (一) Non-normalized 模型估計

發表於 2022-01-08 | 分類於 ML

這是一篇論文筆記: “Estimation of Non-Normalized Statistical Models by Score Matching”, 其實推薦直接讀論文, 數學式很清楚, 表達也明確, 只是想順著自己的說法做一下筆記

動機介紹

在 Machine Learning 中, 我們常常希望用參數 $\theta$ 估出來的 pdf $p(.;\theta)$ 能跟真實 data (training data) 的 pdf $p_x(.)$ 愈像愈好.
由於是 pdf $p(.;\theta)$, 必須滿足機率形式, i.e. 積分所有 outcomes 等於 1, 因此引入一個 normalization term $Z(\theta)$
$$p(\xi;\theta)=\frac{1}{Z(\theta)}q(\xi;\theta)$$
其中 $\xi\in\mathbb{R}^n$ 為一個 data point
假設我們有 $T$ 個 observations $\{x_1,...,x_T\}$, 套用 empirical expecation 並對 likelihood estimation 找出最佳 $\theta$ (MLE):
$$\theta_{mle}=\arg\max_\theta \sum_{t=1}^T \log p(x_t;\theta)$$
計算 gradient, 會發現由於存在 $Z(\theta)$ 變得很難計算, 導致 gradient-based optimization 也很困難.

山不轉路轉, 如果我們能換個想法:

閱讀全文 »

Score Function and Fisher Information Matrix

發表於 2022-01-07 | 分類於 ML

Bayesian statistics 視 dataset $\mathcal{D}=\{x_1,...,x_n\}$ 為固定, 而 model parameter $\theta$ 為 random variables, 透過假設 prior $p(\theta)$, 利用 Bayes rule 可得 posterior $p(\theta|\mathcal{D})$. 而估計的參數就是 posterior 的 mode, i.e. $\theta_{map}$ (Maximum A Posterior, MAP)

關於 Fisher information matrix 在 Bayesian 觀點的用途, 其中一個為幫助定義一個特殊的 prior distribution (Jeffreys prior), 使得如果 parameter $\theta$ 重新定義成 $\phi$, 例如 $\phi=f(\theta)$, 則 MAP 解不會改變. 關於這部分還請參考 Machine Learning: a Probabilistic Perspective by Kevin Patrick Murphy. Chapters 5 的 Figure 5.2 圖很清楚

反之 Frequentist statistics 視 dataset $\mathcal{D}=\{X_1,...,X_n\}$ 為 random variables, 透過真實 $\theta^*$ 採樣出 $K$ 組 datasets, 每一組 dataset $\mathcal{D}^k$ 都可以求得一個 $\theta_{mle}^k$ (MLE 表示 Maximum Likelihood Estimator), 則 $\theta_{mle}^k$ 跟 $\theta^*$ 的關係可藉由 Fisher information matrix 看出來

本文探討 score function 和 Fisher information matrix 的定義, 重點會放在怎麼直觀理解. 然後會說明 Fisher information matrix 在 Frequentist statistics 角度代表什麼意義.

先定義一波

閱讀全文 »