LoRAPrune, Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning 筆記

發表於 2023-10-09 | 分類於 ML

本文是這篇論文 “LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning [arxiv]” 的筆記.

一般來說使用 first-order Taylor importance 的 pruning 方法 (下面會介紹此法) 需計算 gradients 來對每個 weight 計算重要性, 然後根據重要性剪枝. 但是現在模型已經愈來愈大, 對所有 weights 都須計算 gradient 的負擔太大.

另一方面, 在 LLM 中對於大模型的 fine tuning 使用 LoRA (PEFT, Parameter Efficient Fine Tuning, 的一種) 來計算 gradients 非常有效率, 原因是對原來的 weights 是 fixed 的, 只 train LoRA 外掛的”少量”參數, 因此只有少量的 gradients 需要計算. 不過我們思考一下, 如果要對已經 prune 的 weights 旁邊外掛 LoRA 的話, LoRA train 完後沒辦法 merge 回去原來的 weights, 因為有可能打亂原本要 prune 的位置. 但是反過來說, 如果先用 LoRA fine tune 完才進行剪枝, 又回到當模型太大而負擔太大沒效率的問題. 況且這樣分兩步驟可能不是很直接, 如果能在 LoRA fine tune 時就能一併考慮某些 weights 會被 prune 的情況下去 fine tune 可能會更好.

如何 pruning 原來的參數又能利用上 LoRA 的效率就是此篇論文的工作.

$$\begin{array}{|c |c |c |} \hline & 能否對原來的參數做剪枝? & 是否很有效率? \\ \hline \text{1st order pruning} & \text{Yes} & \text{No} \\ \hline \text{LoRA} & \text{No} & \text{Yes} \\ \hline \text{LoRAPrune} & \text{Yes} & \text{Yes} \\ \hline \end{array}$$

以下會先介紹 first-order Taylor importance 的 pruning 方法, 再來介紹 LoRA, 最後說明如何取兩者之優點得出此篇的方法: LoRAPrune

閱讀全文 »

Movement Pruning Adaptive Sparsity by Fine-Tuning 筆記

發表於 2023-02-24 | 分類於 ML

先引用這篇論文的論點 Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [pdf]

同樣的小 model size, 從頭訓練還不如先用大的 model size 做出好效果, 再壓縮到需要的大小
所以 pruning 不僅能壓小 model size, 同樣對 performance 可能也是個好策略

Introduction

使用單純的 absolutely magnitude pruning 對於在 SSL model 不好. 因為原來的 weight 是對 SSL 的 loss 計算的, 並不能保證後來的 fine tune (down stream task loss) 有一樣的重要性關聯.
例如傳統上的 magnitude pruning 作法, 如這一篇 2015 NIPS 文章 [Learning both Weights and Connections for Efficient Neural Networks] (cited 5xxx) 作法很簡單:
先對 model train 到收斂, 然後 prune, 接著繼續訓練 (prune 的 weight 就 fix 為 $0$), 然侯再多 prune … iterative 下去到需要的 prune 數量
但作者認為, 只靠 magnitude 大小判斷效果不好, 因為在 fine tune 過程中, 如果某一個 weight 雖然 magnitude 很大, 但 gradient update 後傾向把 magnitude 變小, 就表示它重要性應該降低才對, 這是本篇的精華思想

因此我們先定義重要性就是代表 weight 的 magnitude 會變大還是變小, 變大就是重要性大, 反之

因此作者對每一個參數都引入一個 score, 命為 $S$, 希望能代表 weight 的重要性. 而在 fine-tune 的過程, 除了對 weight $W$ update 之外, score $S$ 也會 update
如果 score $S$ 正好能反映 weight 的 gradient 傾向, 即 $S$ 愈大剛好表示該對應的 weight 在 fine-tune 過程會傾向讓 magnitude 變大, 反之亦然, 那這樣的 $S$ 正好就是我們要找的.

要這麼做的話, 我們還需要回答兩個問題:

怎麼引入 score $S$?
Score $S$ 正好能代表重要性? 換句話說能反映 weight 在 fine tune 過程的 magnitude 傾向嗎?

閱讀全文 »

L0 Regularization 詳細攻略

發表於 2023-01-15 | 分類於 ML

這是一篇論文Learning Sparse Neural Networks through L0 Regularization 的詳細筆記, 同時自己實作做實驗 [My Github]
主要以詳解每個部分並自己能回憶起為目的, 所以或許不是很好閱讀

Introduction

NN model 參數 $\theta$, 我們希望非$0$的個數愈少愈好, i.e. $|\theta|_0$ 愈小愈好, 所以會加如下的 regularization term:
$$\mathcal{L}_C^0(\theta)=\|\theta\|_0=\sum_{j=1}^{|\theta|}\mathbb{I}[\theta_j\neq0]$$ 所以 Loss 為:

$$\mathcal{L}_E(\theta)=\frac{1}{N}\left( \sum_{i=1}^N\mathcal{L}(NN(x_i;\theta),y_i) \right) \\ \mathcal{L}(\theta)=\mathcal{L}_E(\theta)+\mathcal{L}_C^0(\theta)$$

但實務上我們怎麼實現 $\theta$ 非 $0$ 呢?
一種方式為使用一個 mask random variable $Z=\{Z_1,...,Z_{|\theta|}\}$ (~Bernoulli distribution, 參數 $q=\{q_1,...,q_{|\theta|}\}$), 因此 Loss 改寫如下: (注意到 $\mathcal{L}_C^0$ 可以有 closed form 並且與 $\theta$ 無關了)

$$\begin{align} \mathcal{L}_C^0(\theta, q)=\mathbb{E}_{Z\sim\text{Bernoulli}(q)}\left[ \sum_{j=1}^{|\theta|}\mathbb{I}[\theta_j\odot Z_j\neq0] \right] = \mathbb{E}_{Z\sim\text{Bernoulli}(q)}\left[ \sum_{j=1}^{|\theta|} Z_j \right] = \sum_j^{|\theta|} q_j\\ \mathcal{L}_E(\theta,q)=\mathbb{E}_{Z\sim\text{Bernoulli}(q)}\left[ \frac{1}{N}\left( \sum_{i=1}^N\mathcal{L}(NN(x_i;\theta\odot Z_i),y_i) \right) \right] \\ \mathcal{L}(\theta,q)=\mathcal{L}_E(\theta,q)+\lambda\mathcal{L}_C^0(q) \end{align}$$

現在最大的麻煩是 entropy loss $\mathcal{L}_E$, 原因是 Bernoulli 採樣沒辦法對 $q$ 微分, 因為 $\nabla_q\mathcal{L}_E(\theta,q)$ 在計算期望值時, 採樣的機率分佈也跟 $q$ 有關

參考 Gumbel-Max Trick 開頭的介紹說明

好消息是, 可以藉由 reparameterization (Gumbel Softmax) 方法使得採樣從一個與 $q$ 無關的 r.v. 採樣 (所以可以微分了), 因此也就能在 NN 訓練使用 backpropagation.
以下依序說明: (參考這篇 [L0 norm稀疏性: hard concrete门变量] 整理的順序, 但補足一些內容以及參考論文的東西)
Gumbel max trick $\Rightarrow$ Gumbel softmax trick (so called concrete distribution)
$\Rightarrow$ Binary Concrete distribution $\Rightarrow$ Hard (Binary) Concrete distribution $\Rightarrow$ L0 regularization
最後補上對 GoogleNet 架構加上 $L0$ regularization 在 CIFAR10 上的模型壓縮實驗

文長…

閱讀全文 »

Learning Zero Point and Scale in Quantization Parameters

發表於 2022-12-04 | 分類於 ML

在上一篇搞懂 Quantization Aware Training 中的 Fake Quantization 我們討論了 fake quantization 以及 QAT
提到了 observer 負責計算 zero point and scale $(z,s)$, 一般來說只需要透過統計觀測值的 min/max 範圍就能給定, 所以也不需要參與 backward 計算

直觀上我們希望找到的 zero/scale 使得 quantization error 盡量小, 但其實如果能對任務的 loss 優化, 應該才是最佳的
這就必須讓 $(z,s)$ 參與到 backward 的計算, 這種可以計算 gradient 並更新的做法稱為 learnable quantization parameters

本文主要參考這兩篇論文:
1. LSQ: Learned Step Size Quantization
2. LSQ+: Improving low-bit quantization through learnable offsets and better initialization

LSQ 只討論 updating scale, 而 LSQ+ 擴展到 zero point 也能學習, 本文只推導關鍵的 gradients 不說明論文裡的實驗結果

很快定義一下 notations:
- $v$: full precision input value
- $s$: quantizer step size (scale)
- $z$: zero point (offset)
- $Q_P,Q_N$: the number of positive and negative quantization levels
e.g.: for $b$ bits, unsigned $Q_N=0,Q_P=2^b-1$, for signed $Q_N=2^{b-1},Q_P=2^{b-1}-1$
- $\lfloor x \rceil$: round $x$ to nearest integer
將 $v$ quantize 到 $\bar{v}$ (1), 再將 $\bar{v}$ dequantize 回 $\hat{v}$ (2), 而 $v-\hat{v}$ 就是 precision loss
$$\begin{align} \bar{v}={clip(\lfloor v/s \rceil+z,-Q_N,Q_P)} \\ \hat{v}=(\bar{v}-z)\times s\\ \end{align}$$

閱讀全文 »

搞懂 Quantization Aware Training 中的 Fake Quantization

發表於 2022-11-19 | 分類於 ML

看完本文會知道什麼是 fake quantization 以及跟 QAT (Quantization Aware Training) 的關聯
同時了解 pytorch 的 torch.ao.quantization.fake_quantize.FakeQuantize 這個 class 做了什麼

Fake quantization 是什麼?

我們知道給定 zero ($z$) and scale ($s$) 情況下, float 數值 $r$ 和 integer 數值 $q$ 的關係如下:

$$\begin{align} r=s(q-z) \\ q=\text{round_to_int}(r/s)+z \end{align}$$ 其中 $s$ 為 scale value 也是 float, 而 $z$ 為 zero point 也是 integer, 例如 int8
Fake quantization 主要概念就是用 256 個 float 點 (e.g. 用 int8) 來表示所有 float values, 因此一個 float value 就使用256點中最近的一點 float 來替換
則原來的 floating training 流程都不用變, 同時也能模擬因為 quantization 造成的精度損失, 這種訓練方式稱做 Quantization Aware Training (QAT) (See Quantization 的那些事)

閱讀全文 »

Weight Normalization 的筆記

發表於 2022-09-26 | 分類於 ML

使用 SGD 做優化時, 如果 ill-conditioned of Hessian matrix, i.e. $\sigma_1/\sigma_n$ 最大最小的 eigenvalues 之比值, 會使得收斂效率不彰
(ref zig-zag).

可以想成 loss function 的曲面愈不像正圓則愈 ill-conditioned (愈扁平).

希望藉由 re-parameterization 來將 ill-conditioned 狀況降低.
一般來說 NN 的 layer 可以這麼寫:
$$y=\phi(w^Tx+b)$$ 把 weight vector $w$ 重新改寫如下:

$$w={g\over\|v\|}v\quad\quad(\star)$$ WN 就是將 $w$ 拆成用 unit vector $v/||v||$ 和 magnitude $g$ 兩個 variables 來表示

閱讀全文 »

Why Stochastic Weight Averaging? averaging results V.S. averaging weights

發表於 2022-07-20 | 分類於 ML

由以前這篇文章知道, 對多顆不同 models 的結果取平均通常會得到更好的結果.
但如果對 models 的參數先取平均呢? 一樣會好嗎?
Stochastic Weight Averaging (SWA) 的這篇文章 “Averaging Weights Leads to Wider Optima and Better Generalization“ 嘗試說明這是有效的.
而實務上, PyTorch 和 PyTorch Lightning 也已經直接導入了 SWA 的 API. 甚至在語音辨識業界裡, 有取代 Kaldi 勢頭的 WeNet 裡面也有類似的機制.

本文直接截圖自己的 slides 內容, 而 Pdf 檔案可參考這裡

投影片內容

直接上圖:

閱讀全文 »

SGD 泛化能力的筆記

發表於 2022-05-28 | 分類於 Optimization

Sharp V.S. Flat Local Minimum 的泛化能力

先簡單介紹這篇文章:
On large-batch training for deep learning: Generalization gap and sharp minima
考慮下圖兩個 minimum, 對於 training loss 來說其 losses 一樣.
從圖可以容易理解到, 如果找到太 sharp 的點, 由於 test and train 的 mismatch, 會導致測試的時候 data 一點偏移就會對 model output 影響很大.
論文用實驗的方式, 去評量一個 local minimum 的 sharpness 程度, 簡單說利用 random perturb 到附近其他點, 然後看看該點 loss 變化的程度如何, 變化愈大, 代表該 local minimum 可能愈 sharp.
然後找兩個 local minimums, 一個估出來比較 sharp 另一個比較 flat. 接著對這兩點連成的線, 線上的參數值對應的 loss 劃出圖來, 長相如下:
這也是目前一個普遍的認知: flat 的 local minimum 泛化能力較好.
所以可以想像, step size (learning rate) 如果愈大, 愈有可能跳出 sharp minimum.
而 batch size 愈小, 表示 gradient 因為 mini-batch 造成的 noise 愈大, 相當於愈有可能”亂跑”跑出 sharp minimum.
但這篇文章僅止於實驗性質上的驗證. Step size and batch size 對於泛化能力, 或是說對於找到比較 flat optimum 的機率會不會比較高? 兩者有什麼關聯呢?
DeepMind 的近期 (2021) 兩篇文章給出了很漂亮的理論分析.

閱讀全文 »

Numerical Methods for Ordinary Differential Equations

發表於 2022-05-15 | 分類於 ML

如果對於 Differential Equation 完全沒概念, 建議先看以下兩分鐘的影片
- Solving Differential Equations vs. Solving Algebraic Equations
主要筆記了 Prof. Jeffrey Chasnov 在 Coursera 的兩門課 針對 numerical solution 解 ODE 的內容:
1. Differential Equations for Engineers
2. Numerical Methods for Engineers
本文介紹:
1️⃣ Introduction to ODE: linear? ordinary? n-th order?
2️⃣ Euler Method: 雖然簡單, 但 error 很大
3️⃣ Modified Euler Method: error $O(\Delta t^3)$, 比 Euler method 小了一個 order
4️⃣ Runge Kutta Methods: Modified Euler 方法是 Second-order RK 的一個特例
5️⃣ Higher-order Runge-Kutta Methods: $n$-th order RK 的 error 為 $O(\Delta t^{n+1})$
6️⃣ Higher-order ODEs and Systems: 以上都只介紹 first-order ODE 逼近法, 那 higher-order ODE 怎解?

👏 那兩門課的講義教授很佛心得都有附上:
Lecture notes: Differential Equations for Engineers
Lecture notes: Numerical Methods for Engineers

閱讀全文 »

忘記物理也要搞懂的 Hamiltonian Monte Carlo (HMC) 筆記

發表於 2022-05-07 | 分類於 ML

2024/07/28 更新 (見本文最後一段): 補充與 Langevin Dynamics 的關係, 這是我們在 [Score Matching 系列 (五) SM 加上 Langevin Dynamics 變成生成模型] 裡提到一旦訓練出 score function 後, 模型使用的採樣技術. 另外 Score Match + Langevin Dynamics (SMLD) 這種生成模型事實上跟 DDPM (Denoising Diffusion Probabilistic Models) 是一樣的! Yang Song 這篇 2021 ICLR best paper award (Score-Based Generative Modeling through Stochastic Differential Equations) 闡明了 SMLD 跟 DDPM 其實是兩種不同的觀點, 都可以用相同的 SDE (Stochastic Differential Equation) 來表達.

先說我物理什麼的都還給老師了, 只能用自己理解的方式, 筆記下 Hamiltonian dynamic.

　💡 如果連我都能懂, 相信大家都能理解 HMC 了

但還是建議先看 MCMC by Gibbs and Metropolis-Hasting Sampling, 因為這篇要說的 Hamiltonian Monte Carlo (HMC) 是 Metropolis-Hastings (MH) 方法的一種, 只是 proposal distribution 從 random walk 改成使用 Hamiltonian dynamics 來做, 因而變的非常有效率 (accept rate 很高), 且對於高維度資料採樣也很有效.

首先粗體字如 $\mathbf{x}, \mathbf{v}, \mathbf{p}$ 都是 column vector, 而非粗體字表 scalar, e.g. $m,t$

閱讀全文 »