梯度下降法

梯度下降法（英語：Gradient descent）是一種求解無約束最佳化問題的一階疊代最佳化算法，它被用來求得可微函數的局部極小值，通常也稱為最陡下降法，但是不該與近似積分的最陡下降法（英語：Method of steepest descent）混淆。要使用梯度下降法找到一個函數的局部極小值，必須向函數上當前點對應梯度（或者是近似梯度）的反方向的規定步長距離點進行疊代搜索，因為這是最陡下降的方向。如果相反地向梯度正方向疊代進行搜索，則會接近函數的局部極大值點，這個過程則被稱為梯度上升法。梯度下降法在機器學習中對於成本的最小化或損失函數的最小化都特別有用。^[1]梯度下降法不應與局部搜索算法（英語：local search algorithms）相混淆，儘管兩者都是疊代最佳化算法。

梯度下降法通常被認為是奧古斯丁-路易·柯西（法語：Augustin-Louis Cauchy）在1847年首次提出的。^[2]雅克·所羅門·阿達馬（法語：Jacques Solomon Hadamard）在1907年獨立提出了一個類似的方法。^[3]^[4]哈斯凱爾·柯里（英語：Haskell Brooks Curry）在1944年首先研究了該方法對非線性最佳化問題的收斂性^[5]；在隨後的幾十年裡，該方法得到了越來越多的研究和使用。^[6] ^[7]

隨機梯度下降法（英語：stochastic gradient descent）作為梯度下降法的一個簡單延展，是目前用於訓練大多數深度學習結構（英語：deep networks）的最基本的算法。

描述

梯度下降方法基於以下的觀察：如果實值函數 $F(\mathbf {x} )$ 在點 $\mathbf {a}$ 處可微且有定義，那麼函數 $F(\mathbf {x} )$ 在 $\mathbf {a}$ 點沿著梯度相反的方向 $-\nabla F(\mathbf {a} )$ 下降最多。

因而，如果

\mathbf {b} =\mathbf {a} -\gamma \nabla F(\mathbf {a} )

對於一個足夠小數值 $\gamma >0$ 時成立，那麼 $F(\mathbf {a} )\geq F(\mathbf {b} )$ 。

考慮到這一點，我們可以從函數 $F$ 的局部極小值的初始估計 $\mathbf {x} _{0}$ 出發，並考慮如下序列 $\mathbf {x} _{0},\mathbf {x} _{1},\mathbf {x} _{2},\dots$ 使得

\mathbf {x} _{n+1}=\mathbf {x} _{n}-\gamma _{n}\nabla F(\mathbf {x} _{n}),\ n\geq 0

。

因此可得到

F(\mathbf {x} _{0})\geq F(\mathbf {x} _{1})\geq F(\mathbf {x} _{2})\geq \cdots ,

如果順利的話序列 $(\mathbf {x} _{n})$ 收斂到期望值的局部極小值。注意每次疊代步長 $\gamma$ 可以改變。

右側的圖片示例了這一過程，這裡假設 $F$ 定義在平面上，並且函數圖像是一個碗形。藍色的曲線是等高線（水平集），即函數 $F$ 為常數的集合構成的曲線。紅色的箭頭指向該點梯度的反方向。（一點處的梯度方向與通過該點的等高線垂直）。沿著梯度下降方向，將最終到達碗底，即函數 $F$ 局部極小值的點。

例子

梯度下降法處理一些複雜的非線性函數會出現問題，例如Rosenbrock函數

f(x,y)=(1-x)^{2}+100(y-x^{2})^{2}.\quad

其最小值在 $(x,y)=(1,1)$ 處，數值為 $f(x,y)=0$ 。但是此函數具有狹窄彎曲的山谷，最小值 $(x,y)=(1,1)$ 就在這些山谷之中，並且谷底很平。最佳化過程是之字形的向極小值點靠近，速度非常緩慢。

下面這個例子也鮮明的示例了"之字"的上升（非下降），這個例子用梯度上升（非梯度下降）法求 $F(x,y)=\sin \left({\frac {1}{2}}x^{2}-{\frac {1}{4}}y^{2}+3\right)\cos(2x+1-e^{y})$ 的局部極大值（非局部極小值）。

|}

缺點

梯度下降法的缺點包括：^[8]

靠近局部極小值時速度減慢。
直線搜索可能會產生一些問題。
可能會「之字型」地下降。

上述例子也已體現出了這些缺點。

參閱

參考文獻

^ Boyd, Stephen; Vandenberghe, Lieven. Convex Optimization. Cambridge University Press. 2004-03-08. ISBN 978-0-521-83378-3. doi:10.1017/cbo9780511804441.
^ ((Lemaréchal, C.)). Cauchy and the gradient method (PDF). ((Grötschel, M.)) (編). Optimization Stories. Documenta Mathematica Series 6 1st. EMS Press. 1 January 2012: 251–254 [2020-01-26]. ISBN 978-3-936609-58-5. doi:10.4171/dms/6/27 . （原始內容 (PDF)存檔於2018-12-29）.
^ Hadamard, Jacques. Mémoire sur le problème d'analyse relatif à l'équilibre des plaques élastiques encastrées. Mémoires présentés par divers savants éstrangers à l'Académie des Sciences de l'Institut de France. 1908, 33.
^ Courant, R. Variational methods for the solution of problems of equilibrium and vibrations. Bulletin of the American Mathematical Society. 1943, 49 (1): 1–23. doi:10.1090/S0002-9904-1943-07818-4 .
^ Curry, Haskell B. The Method of Steepest Descent for Non-linear Minimization Problems. Quart. Appl. Math. 1944, 2 (3): 258–261. doi:10.1090/qam/10667 .
^ Polyak, Boris. Introduction to Optimization. 1987 （英語）.
^ Akilov, G. P.; Kantorovich, L. V. Functional Analysis 2nd. Pergamon Press. 1982. ISBN 0-08-023036-9.
^ David W. A. Bourne. Steepest Descent Method. （原始內容存檔於2009年2月10日）（英語）.

Mordecai Avriel (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
Jan A. Snyman (2005). Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer Publishing. ISBN 0-387-24348-8

外部連結

（英文）Interactive examples of gradient descent and some step size selection methods （頁面存檔備份，存於網際網路檔案館）
（英文）Using gradient descent in C++, Boost, Ublas for linear regression （頁面存檔備份，存於網際網路檔案館）

[auto-1] Boyd, Stephen; Vandenberghe, Lieven. Convex Optimization. Cambridge University Press. 2004-03-08. ISBN 978-0-521-83378-3. doi:10.1017/cbo9780511804441.

[2] ((Lemaréchal, C.)). Cauchy and the gradient method (PDF). ((Grötschel, M.)) (編). Optimization Stories. Documenta Mathematica Series 6 1st. EMS Press. 1 January 2012: 251–254 [2020-01-26]. ISBN 978-3-936609-58-5. doi:10.4171/dms/6/27 . （原始內容 (PDF)存檔於2018-12-29）.

[3] Hadamard, Jacques. Mémoire sur le problème d'analyse relatif à l'équilibre des plaques élastiques encastrées. Mémoires présentés par divers savants éstrangers à l'Académie des Sciences de l'Institut de France. 1908, 33.

[4] Courant, R. Variational methods for the solution of problems of equilibrium and vibrations. Bulletin of the American Mathematical Society. 1943, 49 (1): 1–23. doi:10.1090/S0002-9904-1943-07818-4 .

[5] Curry, Haskell B. The Method of Steepest Descent for Non-linear Minimization Problems. Quart. Appl. Math. 1944, 2 (3): 258–261. doi:10.1090/qam/10667 .

[BP-6] Polyak, Boris. Introduction to Optimization. 1987 （英語）.

[AK82-7] Akilov, G. P.; Kantorovich, L. V. Functional Analysis 2nd. Pergamon Press. 1982. ISBN 0-08-023036-9.

[8] David W. A. Bourne. Steepest Descent Method. （原始內容存檔於2009年2月10日）（英語）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

閱論編可微分計算
概論	可微分編程自動微分張量微積分資訊幾何統計流形神經形態工程（英語：Neuromorphic engineering）模式識別運算學習理論（英語：Computational learning theory）歸納偏置
概念	梯度下降 SGD（英語：Stochastic gradient descent）聚類迴歸過適注意力摺積損失函數反向傳播激勵函數 softmax sigmoid ReLU 正則化資料集擴散（英語：Diffusion process）自我迴歸
應用	機器學習人工神經網絡深度學習科學計算人工智慧語言模型大型語言模型
硬體	TPU VPU IPU（英語：Graphcore）憶阻器 SpiNNaker（英語：SpiNNaker）
軟體庫	Theano TensorFlow Keras PyTorch Caffe JAX MindSpore（英語：MindSpore） Flux.jl（英語：Flux (machine-learning framework)）
主題計算機編程技術分類人工神經網絡機器學習