




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蜕疃葘W(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P颓癫琶?/9/2022深度學(xué)習(xí)理論深度學(xué)習(xí)理論- A Review- A Review3Page . 神經(jīng)網(wǎng)絡(luò)將許多單一的神經(jīng)元連接在一起 一個(gè)神經(jīng)元的輸出作為另一個(gè)神經(jīng)元的輸入 多層神經(jīng)網(wǎng)絡(luò)模型可以理解為多個(gè)非線性函數(shù)“嵌套” 多層神經(jīng)網(wǎng)絡(luò)層數(shù)可以無限疊加 具有無限建模能力, 可以擬合任意函數(shù)多層神經(jīng)網(wǎng)絡(luò)多層神經(jīng)網(wǎng)絡(luò)前向傳播前向傳播1122( ( ( , ), ) ), )nnff f f z w b w bw b4Page . Sigmoid Tanh Rectified linear units (ReLU)常用激活函數(shù)常用激活函數(shù)1(
2、 )1zf ze( )zzzzeef zee0( )00zzf zz5Page .層數(shù)逐年增加層數(shù)逐年增加誤差逐年下降層數(shù)逐年增加今天今天1000 layers1000 layers6Page . Features are learned rather than hand-crafted More layers capture more invariances More data to train deeper networks More computing (GPUs) Better regularization: Dropout New nonlinearities Max pooling
3、, Rectified linear units (ReLU) Theoretical understanding of deep networks remains shallowTheoretical understanding of deep networks remains shallow為什么深度學(xué)習(xí)性能如此之好為什么深度學(xué)習(xí)性能如此之好? ?1 Razavian, Azizpour, Sullivan, Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition. CVPRW142 Sriv
4、astava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):19291958.3 Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training
5、by reducing internal covariate shift. In International Conference on Machine Learning, pages 448456.7Page . Experimental Neuroscience uncovered: Neural architecture of Retina/LGN/V1/V2/V3/ etc Existence of neurons with weights and activation functions (simple cells) Pooling neurons (complex cells) A
6、ll these features are somehow present in Deep Learning systems神經(jīng)科學(xué)帶來的啟示神經(jīng)科學(xué)帶來的啟示NeuroscienceNeuroscienceDeep NetworkDeep NetworkSimple cellsFirst layerComplex cellsPooling LayerGrandmother cellsLast layer8Page . Olshausen and Field demonstrated that receptive fields learned from image patches. Olsha
7、usen and Field showed that optimization process can drive learning image representations.OlshausenOlshausen and Fields Work (Nature, and Fields Work (Nature, 1996) 1996) 9Page . Olshausen-Field representations bear strong resemblance to defined mathematical objects from harmonic analysis wavelets, r
8、idgelets, curvelets. Harmonic analysis: long history of developing optimal representations via optimization Research in 1990s: Wavelets etc are optimal sparsifying transforms for certain classes of imagesHarmonic analysisHarmonic analysis10Page . Class prediction rule can be viewed as function f(x)
9、of high-dimensional argument Curse of Dimensionality Traditional theoretical obstacle to high-dimensional approximation Functions of high dimensional x can wiggle in too many dimensions to be learned from finite datasetsApproximation TheoryApproximation Theory11Page . Approximation theoryApproximati
10、on theory Perceptrons and multilayer feedforward networks are universal approximators: Cybenko 89, Hornik 89, Hornik 91, Barron 93 Optimization theoryOptimization theory No spurious local optima for linear networks: Baldi & Hornik 89 Stuck in local minima: Brady 89 Stuck in local minima, but con
11、vergence guarantees for linearly separable data: Gori & Tesi 92 Manifold of spurious local optima: Frasconi 97Early Theoretical Results on Deep Early Theoretical Results on Deep LearningLearning1 Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, a
12、nd Systems, 2 (4), 303-314, 1989.2 Hornik, Stinchcombe and White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.3 Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251257, 1991.4 Barron. Universal approxi
13、mation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930945, 1993.5 P Baldi, K Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks, 1989.6 Brady, Raghavan, Slawny. Back propagation
14、fails to separate where perceptrons succeed. IEEE Trans Circuits & Systems, 36(5):665674, 1989.7 Gori, Tesi. On the problem of local minima in backpropagation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):7686, 1992.8 Frasconi, Gori, Tesi. Successes and failures of backpropaga
15、tion: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.12Page . Invariance, stability, and learning theoryInvariance, stability, and learning theory Scattering networks: Bruna 11, Bruna 13, Mallat 13 Deformation stability for Lipschitz non-linearities: Wiatowski 15 Distance and
16、margin-preserving embeddings: Giryes 15, Sokolik 16 Geometry, generalization bounds and depth efficiency: Montufar 15,Neyshabur 15, Shashua 14 15 16 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 1 Bruna-Mallat. Classification with scattering operators, CVPR1
17、1. Invariant scattering convolution networks, arXiv12. Mallat Waldspurger. Deep Learning by Scattering, arXiv13.2 Wiatowski, Blcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv 2015.3 Giryes, Sapiro, A Bronstein. Deep Neural Networks with Random Gaussia
18、n Weights: A Universal Classification Strategy? arXiv:1504.08291.4 Sokolic. Margin Preservation of Deep Neural Networks, 20155 Montufar. Geometric and Combinatorial Perspectives on Deep Neural Networks, 2015.6 Neyshabur. The Geometry of Optimization and Generalization in Neural Networks: A Path-base
19、d Approach, 2015. 13Page . Optimization theory and algorithmsOptimization theory and algorithms Learning low-degree polynomials from random initialization: Andoni14 Characterizing loss surface and attacking the saddle point problem: Dauphin 14 , Choromanska 15, Chaudhuri 15 Global optimality in neur
20、al network training: Haeffele 15 Non-convex optimization: Dauphin14 Training NNs using tensor methods: Janzamin 15 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 7 Andoni, Panigraphy, Valiant, Zhang. Learning Polynomials with Neural Networks. ICML 2014.8 Daup
21、hin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014.9 Choromanska, Henaff, Mathieu, Arous, LeCun, “The Loss Surfaces of Multilayer Networks,” AISTAT 2015.10 Chaudhuri and Soatto The Effect of Gradient
22、 Noise on the Energy Landscape of Deep Networks, arXiV 2015.11 Haeffele, Vidal. Global Optimality in Tensor Factorization, Deep Learning and Beyond, arXiv, 2015.12 Janzamin, Sedghi, Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, arxiv 20
23、15. 13 Dauphin, Yann N., et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems. 2014.RMT of Deep LearningRMT of Deep Learning15Page . Pearson, Fisher, Neyman, 經(jīng)典統(tǒng)計(jì)學(xué)(1900-1940s) 無窮向量的相關(guān)性(Karl Pearson, 1
24、905)有限向量的相關(guān)性(Fisher, 1924)低維問題,隨機(jī)變量維數(shù) N=2-10 高維假設(shè)檢驗(yàn)基因檢測(cè):N=6033基因組, n=102人 電網(wǎng)檢測(cè):N=300010000 PMUs, n次采樣觀測(cè) A. N. Kolmogorov,漸近理論 (1970-1974)高維協(xié)方差矩陣新統(tǒng)計(jì)模型:,n, 傳統(tǒng)意義上中心極限定理不再適用!隨機(jī)矩陣?yán)碚? E. Wigner (1955), Marchenko-Pastur (1967)估計(jì)誤差累積, 有限的偏差Whats Random Matrix Theory(RMT)Whats Random Matrix Theory(RMT)111212
25、122212T120,110001010011 ?TTN TijNNNTN NNXXXXXXXXXXXXXT N16Page . The eigenvalues of a non-Hermitia random matrix follow a ring law The unit circles are predicted by free probability theory. Product of Non-Product of Non-HermitianHermitian Random Matrices: Random Matrices: Noise OnlyNoise Only165/9/2
26、022121LLiiYX XXXL=5L=117Page . The Stieltjes transform G of a probability distribution is The distribution can be recovered using the inversion formula Given the Stieltjes transform G, the R transform is defined as the solution to the functional equation The benefit of the R transform. If A and B ar
27、e freely independent, thenStieltjesStieltjes transform, R transform transform, R transform18Page . Gradient descent types algorithms. Use one order information Under sufficient regularity conditions Newton type methods. Use two order information (curvature)Hessian matrixHessian matrixHessian Hessian
28、 MatrixMatrixGradient descent (green) and Newtons method (red) for minimizing a function.19Page . Hessian decomposition: LeCun98 G is the sample covariance matrix of the gradients of model outputs H is the Hessian of the model outputs Generalized Gauss-Newton decomposition of the Hessian: Sagun17 Mo
29、del function: Loss function: The gradient of the loss per fixed sample is Hessian can be written asHessian decompositionHessian decomposition1 Y LeCun, L Bottou, GB ORR, and K-R Mller. Efficient backprop. Lecture notes in computer science, pages 950, 1998.2 Sagun, L., Evci, U., Guney, V. U., Dauphin
30、, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.20Page . Define Hessian of the loss can be written asHessian decompositionHessian decomposition21Page . 增加神經(jīng)元個(gè)數(shù)“成比例”地增加bulk , 但outliers取決于數(shù)據(jù)Empirical resultsEmpirical results1 Sagun, L., Evci, U., G
31、uney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.22Page . H = wishart + wigner H0 is positive semi-definite H1 comes from second derivatives and contains all of the explicit dependence on the residualsGeometry of Loss Surfaces
32、via Geometry of Loss Surfaces via RMTRMT01HHH1 Pennington, Jeffrey, and Yasaman Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. International Conference on Machine Learning. 2017.23Page . Under very weak assumption With R transform Stieltjes transform G is the solution of t
33、he cubic equation Index:負(fù)特征值個(gè)數(shù) , 臨界值(特征值全為0)Spectral distributionSpectral distributionc0( , )cc Page .25Page . Hessian 矩陣的特征值分布直觀上可以分成兩部分:bulk & outlier Bulk is concentrated around zero Outliers are scattered away from zero bulk 反應(yīng)了網(wǎng)絡(luò)參數(shù)的冗余 outliers 反應(yīng)了輸入數(shù)據(jù)的復(fù)雜度Singularity of the Singularity of th
34、e H Hessian in deep essian in deep learninglearningthe bulk of the eigenvalues depend on the the bulk of the eigenvalues depend on the architecturearchitecturethe top discrete eigenvalues depend on the top discrete eigenvalues depend on the datathe data1 Sagun, Levent, Lon Bottou, and Yann LeCun. Si
35、ngularity of the Hessian in Deep Learning.arXiv preprint arXiv:1611.07476 (2016).26Page . Using RMT and exact solution in linear models , the authors derive the generation error and train dynamics of learning : 0特征值,no learning dynamics,此時(shí)初值直接影響泛化性能 Non-zero but smaller特征值,導(dǎo)致學(xué)習(xí)很慢,以及嚴(yán)重的過擬合 防止過擬合時(shí),參數(shù)與
36、樣本數(shù)差不多是最壞的情況,必須要早停 對(duì)于很大的網(wǎng)絡(luò)(參數(shù)很多),過度訓(xùn)練影響不大 減小 , 有利于減小泛化誤差,即選擇較小初值Errors of shallow networksErrors of shallow networks1 Advani, Madhu S., and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017).1 Advani, Madhu S., and Andrew M. S
37、axe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017). Notation and assumption The Gram matrix , W is a random weight matrix, X is a random data matrix, and f is a pointwise nonlinear activation function Assumption:Nonlinear random matrix t
38、heory for deep Nonlinear random matrix theory for deep learninglearningMoment methodMoment method The moment generate function: The Stieltjes transform: The k-th moment of the LSD: The idea behind the moment method is to compute the k-th by expanding out powers of M inside the trace as: Define The S
39、tieltjes transform of the spectral density of M satisfies whereThe The StieltjesStieltjes transform of M transform of M , then which is precisely the equation satisfied by the Stieltjes transform of the Marchenko-Pastur distribution with shape parameterLimiting casesLimiting casesHow to calculate mo
40、mentHow to calculate momentPage .34Page . Background Weight initialization in deep networks can have a dramatic impact on learning speed. Ensuring the mean squared singular value of a networks input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. In
41、deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. It is unclear how to achieve dynamical isometry in nonlinear deep networks.Dynamical isometryDynamical
42、 isometry1 Saxe, Andrew M., James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 (2013).35Page . ReLU networks are incapable of dynamical isometry; Sigmoidal networks can achieve isometry, but onl
43、y with orthogonal weight initialization; Controlling the entire distribution of Jacobian singular values is an important design consideration in deep learningDynamical isometry Dynamical isometry 1 Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning t
44、hrough dynamical isometry: theory and practice. Advances in neural information processing systems. 2017.深度學(xué)習(xí)應(yīng)用于醫(yī)療數(shù)據(jù)深度學(xué)習(xí)應(yīng)用于醫(yī)療數(shù)據(jù)Page . 問題-分析已知結(jié)論對(duì)象(健康人/病患者)的EEG數(shù)據(jù),提出對(duì)精神疾病有效的判別依據(jù) 數(shù)據(jù)-測(cè)試者5分鐘的腦電波數(shù)據(jù),容量為64*304760,常規(guī)統(tǒng)計(jì)方法難以提煉出有價(jià)值的信息 方法1-采用LES指標(biāo),可很好判別測(cè)試者 方法2-采用深度學(xué)習(xí)效果顯著精神病腦電波數(shù)據(jù)分析精神病腦電波數(shù)據(jù)分析Raw DataRaw Data40 High
45、-risk individuals (CHR)40 Healthy controls (HC)40 First-episode patients (FES)Data FormatData Format64*(1000*60*5)=64*300000 Sampling frequencySampling frequency1000HzPage . 隨機(jī)矩陣 深度學(xué)習(xí) 網(wǎng)絡(luò)共7層; 每類人選取75%人做訓(xùn)練,25%人做測(cè)試,交叉驗(yàn)證;隨機(jī)矩陣隨機(jī)矩陣v.sv.s. .深度學(xué)習(xí)深度學(xué)習(xí)HCHCFESFESCHRCHRAvgAvg0.949 0.895 0.72985.798.1%39Page .吸毒
46、成癮吸毒成癮MRIMRI數(shù)據(jù)分析數(shù)據(jù)分析Raw DataRaw Data(Resting State)(Resting State)30 Methamphetamine (MA)29 Healthy controls (HC)SamplingSampling Time Time8 minsSampling Sampling frequencyfrequency0.5HzData SizeData Size(64*64*31)*240Page . 每個(gè)MRI文件切分為31張64x64圖片 建立31個(gè)相應(yīng)CNN模型,進(jìn)行分類 最終判別結(jié)果由31個(gè)分類結(jié)果投票表決 訓(xùn)練集46人(MA:24,HC:2
47、2) 測(cè)試集12人(MA:6,HC:6) 深度學(xué)習(xí)結(jié)果 隨機(jī)矩陣結(jié)果深度學(xué)習(xí)深度學(xué)習(xí)HCHCMAMA100%70.7%85.33%腦部腦部CTCT圖像識(shí)別腦出血圖像識(shí)別腦出血/ /腦梗腦梗 正常、腦梗、腦出血三類數(shù)據(jù)總共90個(gè)樣本 輸入為256*256, jpg格式 CNN網(wǎng)絡(luò)共7層卷積層 每類樣本取80%做訓(xùn)練,20%做測(cè)試正常正常梗塞梗塞出血、挫傷出血、挫傷AvgAvg100%100%50%83.33%腦出血深度學(xué)習(xí)應(yīng)用于微波圖像深度學(xué)習(xí)應(yīng)用于微波圖像43Page .微波遙感圖像目標(biāo)檢測(cè)識(shí)別微波遙感圖像目標(biāo)檢測(cè)識(shí)別44Page .毫米波人體安檢儀毫米波人體安檢儀微波圖像安檢儀危險(xiǎn)/可疑物品45Page . 有效數(shù)據(jù) 被檢測(cè)物體能夠清晰可見 陶瓷刀20組, 有效數(shù)據(jù)70個(gè) 水瓶30組, 有效數(shù)據(jù)153個(gè) 槍30組, 有效數(shù)據(jù)145個(gè) 金屬刀30組, 有效數(shù)據(jù)108個(gè) 遷移學(xué)習(xí)遷移學(xué)習(xí)- -利用小樣本集利用小樣本集46Page .深度學(xué)習(xí)性能卓越深度學(xué)
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 互動(dòng)韻律教學(xué)課件
- 2025年GCP考試題庫附參考答案ab卷
- 2025年《幼兒園工作規(guī)程》試題試題(附答案)
- 專題:任務(wù)型閱讀 人教精通版四年級(jí)(三升四) 小學(xué)英語專項(xiàng)自學(xué)(含解析)
- 論語學(xué)而篇教學(xué)課件
- utc無人機(jī)考試題庫及答案
- ctex控制頁邊距教學(xué)課件
- pmp考試題庫及答案
- php考試題庫及答案
- 2025年機(jī)器人及具有獨(dú)立功能專用機(jī)械項(xiàng)目合作計(jì)劃書
- T/CCAA 39-2022碳管理體系要求
- 公廁資產(chǎn)移交協(xié)議書
- TCSES《基于側(cè)流厭氧處理的污泥源減量工藝技術(shù)規(guī)范》
- 2025年上半年山東菏澤市巨野縣事業(yè)單位招聘征集普通高等院校本科畢業(yè)生入伍15人重點(diǎn)基礎(chǔ)提升(共500題)附帶答案詳解
- 萬達(dá)人才測(cè)評(píng)試題及答案
- 斷絕親人關(guān)系協(xié)議書
- 工藝用氣驗(yàn)證方案及報(bào)告
- 2025-2030中國(guó)手機(jī)銀行行業(yè)市場(chǎng)發(fā)展趨勢(shì)與前景展望戰(zhàn)略研究報(bào)告
- 網(wǎng)絡(luò)安全應(yīng)急演練實(shí)施方案
- 200MW400MWh獨(dú)立共享儲(chǔ)能電站可行性研究報(bào)告
- 鈑金生產(chǎn)車間安全培訓(xùn)
評(píng)論
0/150
提交評(píng)論