深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蚠v0.1_第1頁
深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蚠v0.1_第2頁
深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蚠v0.1_第3頁
深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蚠v0.1_第4頁
深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蚠v0.1_第5頁
已閱讀5頁,還剩64頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、深度學(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P蜕疃葘W(xué)習(xí)的隨機(jī)矩陣?yán)碚撃P颓癫琶?/9/2022深度學(xué)習(xí)理論深度學(xué)習(xí)理論- A Review- A Review3Page . 神經(jīng)網(wǎng)絡(luò)將許多單一的神經(jīng)元連接在一起 一個(gè)神經(jīng)元的輸出作為另一個(gè)神經(jīng)元的輸入 多層神經(jīng)網(wǎng)絡(luò)模型可以理解為多個(gè)非線性函數(shù)“嵌套” 多層神經(jīng)網(wǎng)絡(luò)層數(shù)可以無限疊加 具有無限建模能力, 可以擬合任意函數(shù)多層神經(jīng)網(wǎng)絡(luò)多層神經(jīng)網(wǎng)絡(luò)前向傳播前向傳播1122( ( ( , ), ) ), )nnff f f z w b w bw b4Page . Sigmoid Tanh Rectified linear units (ReLU)常用激活函數(shù)常用激活函數(shù)1(

2、 )1zf ze( )zzzzeef zee0( )00zzf zz5Page .層數(shù)逐年增加層數(shù)逐年增加誤差逐年下降層數(shù)逐年增加今天今天1000 layers1000 layers6Page . Features are learned rather than hand-crafted More layers capture more invariances More data to train deeper networks More computing (GPUs) Better regularization: Dropout New nonlinearities Max pooling

3、, Rectified linear units (ReLU) Theoretical understanding of deep networks remains shallowTheoretical understanding of deep networks remains shallow為什么深度學(xué)習(xí)性能如此之好為什么深度學(xué)習(xí)性能如此之好? ?1 Razavian, Azizpour, Sullivan, Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition. CVPRW142 Sriv

4、astava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):19291958.3 Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

5、by reducing internal covariate shift. In International Conference on Machine Learning, pages 448456.7Page . Experimental Neuroscience uncovered: Neural architecture of Retina/LGN/V1/V2/V3/ etc Existence of neurons with weights and activation functions (simple cells) Pooling neurons (complex cells) A

6、ll these features are somehow present in Deep Learning systems神經(jīng)科學(xué)帶來的啟示神經(jīng)科學(xué)帶來的啟示NeuroscienceNeuroscienceDeep NetworkDeep NetworkSimple cellsFirst layerComplex cellsPooling LayerGrandmother cellsLast layer8Page . Olshausen and Field demonstrated that receptive fields learned from image patches. Olsha

7、usen and Field showed that optimization process can drive learning image representations.OlshausenOlshausen and Fields Work (Nature, and Fields Work (Nature, 1996) 1996) 9Page . Olshausen-Field representations bear strong resemblance to defined mathematical objects from harmonic analysis wavelets, r

8、idgelets, curvelets. Harmonic analysis: long history of developing optimal representations via optimization Research in 1990s: Wavelets etc are optimal sparsifying transforms for certain classes of imagesHarmonic analysisHarmonic analysis10Page . Class prediction rule can be viewed as function f(x)

9、of high-dimensional argument Curse of Dimensionality Traditional theoretical obstacle to high-dimensional approximation Functions of high dimensional x can wiggle in too many dimensions to be learned from finite datasetsApproximation TheoryApproximation Theory11Page . Approximation theoryApproximati

10、on theory Perceptrons and multilayer feedforward networks are universal approximators: Cybenko 89, Hornik 89, Hornik 91, Barron 93 Optimization theoryOptimization theory No spurious local optima for linear networks: Baldi & Hornik 89 Stuck in local minima: Brady 89 Stuck in local minima, but con

11、vergence guarantees for linearly separable data: Gori & Tesi 92 Manifold of spurious local optima: Frasconi 97Early Theoretical Results on Deep Early Theoretical Results on Deep LearningLearning1 Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, a

12、nd Systems, 2 (4), 303-314, 1989.2 Hornik, Stinchcombe and White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.3 Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251257, 1991.4 Barron. Universal approxi

13、mation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930945, 1993.5 P Baldi, K Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks, 1989.6 Brady, Raghavan, Slawny. Back propagation

14、fails to separate where perceptrons succeed. IEEE Trans Circuits & Systems, 36(5):665674, 1989.7 Gori, Tesi. On the problem of local minima in backpropagation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):7686, 1992.8 Frasconi, Gori, Tesi. Successes and failures of backpropaga

15、tion: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.12Page . Invariance, stability, and learning theoryInvariance, stability, and learning theory Scattering networks: Bruna 11, Bruna 13, Mallat 13 Deformation stability for Lipschitz non-linearities: Wiatowski 15 Distance and

16、margin-preserving embeddings: Giryes 15, Sokolik 16 Geometry, generalization bounds and depth efficiency: Montufar 15,Neyshabur 15, Shashua 14 15 16 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 1 Bruna-Mallat. Classification with scattering operators, CVPR1

17、1. Invariant scattering convolution networks, arXiv12. Mallat Waldspurger. Deep Learning by Scattering, arXiv13.2 Wiatowski, Blcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv 2015.3 Giryes, Sapiro, A Bronstein. Deep Neural Networks with Random Gaussia

18、n Weights: A Universal Classification Strategy? arXiv:1504.08291.4 Sokolic. Margin Preservation of Deep Neural Networks, 20155 Montufar. Geometric and Combinatorial Perspectives on Deep Neural Networks, 2015.6 Neyshabur. The Geometry of Optimization and Generalization in Neural Networks: A Path-base

19、d Approach, 2015. 13Page . Optimization theory and algorithmsOptimization theory and algorithms Learning low-degree polynomials from random initialization: Andoni14 Characterizing loss surface and attacking the saddle point problem: Dauphin 14 , Choromanska 15, Chaudhuri 15 Global optimality in neur

20、al network training: Haeffele 15 Non-convex optimization: Dauphin14 Training NNs using tensor methods: Janzamin 15 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 7 Andoni, Panigraphy, Valiant, Zhang. Learning Polynomials with Neural Networks. ICML 2014.8 Daup

21、hin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014.9 Choromanska, Henaff, Mathieu, Arous, LeCun, “The Loss Surfaces of Multilayer Networks,” AISTAT 2015.10 Chaudhuri and Soatto The Effect of Gradient

22、 Noise on the Energy Landscape of Deep Networks, arXiV 2015.11 Haeffele, Vidal. Global Optimality in Tensor Factorization, Deep Learning and Beyond, arXiv, 2015.12 Janzamin, Sedghi, Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, arxiv 20

23、15. 13 Dauphin, Yann N., et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems. 2014.RMT of Deep LearningRMT of Deep Learning15Page . Pearson, Fisher, Neyman, 經(jīng)典統(tǒng)計(jì)學(xué)(1900-1940s) 無窮向量的相關(guān)性(Karl Pearson, 1

24、905)有限向量的相關(guān)性(Fisher, 1924)低維問題,隨機(jī)變量維數(shù) N=2-10 高維假設(shè)檢驗(yàn)基因檢測(cè):N=6033基因組, n=102人 電網(wǎng)檢測(cè):N=300010000 PMUs, n次采樣觀測(cè) A. N. Kolmogorov,漸近理論 (1970-1974)高維協(xié)方差矩陣新統(tǒng)計(jì)模型:,n, 傳統(tǒng)意義上中心極限定理不再適用!隨機(jī)矩陣?yán)碚? E. Wigner (1955), Marchenko-Pastur (1967)估計(jì)誤差累積, 有限的偏差Whats Random Matrix Theory(RMT)Whats Random Matrix Theory(RMT)111212

25、122212T120,110001010011 ?TTN TijNNNTN NNXXXXXXXXXXXXXT N16Page . The eigenvalues of a non-Hermitia random matrix follow a ring law The unit circles are predicted by free probability theory. Product of Non-Product of Non-HermitianHermitian Random Matrices: Random Matrices: Noise OnlyNoise Only165/9/2

26、022121LLiiYX XXXL=5L=117Page . The Stieltjes transform G of a probability distribution is The distribution can be recovered using the inversion formula Given the Stieltjes transform G, the R transform is defined as the solution to the functional equation The benefit of the R transform. If A and B ar

27、e freely independent, thenStieltjesStieltjes transform, R transform transform, R transform18Page . Gradient descent types algorithms. Use one order information Under sufficient regularity conditions Newton type methods. Use two order information (curvature)Hessian matrixHessian matrixHessian Hessian

28、 MatrixMatrixGradient descent (green) and Newtons method (red) for minimizing a function.19Page . Hessian decomposition: LeCun98 G is the sample covariance matrix of the gradients of model outputs H is the Hessian of the model outputs Generalized Gauss-Newton decomposition of the Hessian: Sagun17 Mo

29、del function: Loss function: The gradient of the loss per fixed sample is Hessian can be written asHessian decompositionHessian decomposition1 Y LeCun, L Bottou, GB ORR, and K-R Mller. Efficient backprop. Lecture notes in computer science, pages 950, 1998.2 Sagun, L., Evci, U., Guney, V. U., Dauphin

30、, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.20Page . Define Hessian of the loss can be written asHessian decompositionHessian decomposition21Page . 增加神經(jīng)元個(gè)數(shù)“成比例”地增加bulk , 但outliers取決于數(shù)據(jù)Empirical resultsEmpirical results1 Sagun, L., Evci, U., G

31、uney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.22Page . H = wishart + wigner H0 is positive semi-definite H1 comes from second derivatives and contains all of the explicit dependence on the residualsGeometry of Loss Surfaces

32、via Geometry of Loss Surfaces via RMTRMT01HHH1 Pennington, Jeffrey, and Yasaman Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. International Conference on Machine Learning. 2017.23Page . Under very weak assumption With R transform Stieltjes transform G is the solution of t

33、he cubic equation Index:負(fù)特征值個(gè)數(shù) , 臨界值(特征值全為0)Spectral distributionSpectral distributionc0( , )cc Page .25Page . Hessian 矩陣的特征值分布直觀上可以分成兩部分:bulk & outlier Bulk is concentrated around zero Outliers are scattered away from zero bulk 反應(yīng)了網(wǎng)絡(luò)參數(shù)的冗余 outliers 反應(yīng)了輸入數(shù)據(jù)的復(fù)雜度Singularity of the Singularity of th

34、e H Hessian in deep essian in deep learninglearningthe bulk of the eigenvalues depend on the the bulk of the eigenvalues depend on the architecturearchitecturethe top discrete eigenvalues depend on the top discrete eigenvalues depend on the datathe data1 Sagun, Levent, Lon Bottou, and Yann LeCun. Si

35、ngularity of the Hessian in Deep Learning.arXiv preprint arXiv:1611.07476 (2016).26Page . Using RMT and exact solution in linear models , the authors derive the generation error and train dynamics of learning : 0特征值,no learning dynamics,此時(shí)初值直接影響泛化性能 Non-zero but smaller特征值,導(dǎo)致學(xué)習(xí)很慢,以及嚴(yán)重的過擬合 防止過擬合時(shí),參數(shù)與

36、樣本數(shù)差不多是最壞的情況,必須要早停 對(duì)于很大的網(wǎng)絡(luò)(參數(shù)很多),過度訓(xùn)練影響不大 減小 , 有利于減小泛化誤差,即選擇較小初值Errors of shallow networksErrors of shallow networks1 Advani, Madhu S., and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017).1 Advani, Madhu S., and Andrew M. S

37、axe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017). Notation and assumption The Gram matrix , W is a random weight matrix, X is a random data matrix, and f is a pointwise nonlinear activation function Assumption:Nonlinear random matrix t

38、heory for deep Nonlinear random matrix theory for deep learninglearningMoment methodMoment method The moment generate function: The Stieltjes transform: The k-th moment of the LSD: The idea behind the moment method is to compute the k-th by expanding out powers of M inside the trace as: Define The S

39、tieltjes transform of the spectral density of M satisfies whereThe The StieltjesStieltjes transform of M transform of M , then which is precisely the equation satisfied by the Stieltjes transform of the Marchenko-Pastur distribution with shape parameterLimiting casesLimiting casesHow to calculate mo

40、mentHow to calculate momentPage .34Page . Background Weight initialization in deep networks can have a dramatic impact on learning speed. Ensuring the mean squared singular value of a networks input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. In

41、deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. It is unclear how to achieve dynamical isometry in nonlinear deep networks.Dynamical isometryDynamical

42、 isometry1 Saxe, Andrew M., James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 (2013).35Page . ReLU networks are incapable of dynamical isometry; Sigmoidal networks can achieve isometry, but onl

43、y with orthogonal weight initialization; Controlling the entire distribution of Jacobian singular values is an important design consideration in deep learningDynamical isometry Dynamical isometry 1 Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning t

44、hrough dynamical isometry: theory and practice. Advances in neural information processing systems. 2017.深度學(xué)習(xí)應(yīng)用于醫(yī)療數(shù)據(jù)深度學(xué)習(xí)應(yīng)用于醫(yī)療數(shù)據(jù)Page . 問題-分析已知結(jié)論對(duì)象(健康人/病患者)的EEG數(shù)據(jù),提出對(duì)精神疾病有效的判別依據(jù) 數(shù)據(jù)-測(cè)試者5分鐘的腦電波數(shù)據(jù),容量為64*304760,常規(guī)統(tǒng)計(jì)方法難以提煉出有價(jià)值的信息 方法1-采用LES指標(biāo),可很好判別測(cè)試者 方法2-采用深度學(xué)習(xí)效果顯著精神病腦電波數(shù)據(jù)分析精神病腦電波數(shù)據(jù)分析Raw DataRaw Data40 High

45、-risk individuals (CHR)40 Healthy controls (HC)40 First-episode patients (FES)Data FormatData Format64*(1000*60*5)=64*300000 Sampling frequencySampling frequency1000HzPage . 隨機(jī)矩陣 深度學(xué)習(xí) 網(wǎng)絡(luò)共7層; 每類人選取75%人做訓(xùn)練,25%人做測(cè)試,交叉驗(yàn)證;隨機(jī)矩陣隨機(jī)矩陣v.sv.s. .深度學(xué)習(xí)深度學(xué)習(xí)HCHCFESFESCHRCHRAvgAvg0.949 0.895 0.72985.798.1%39Page .吸毒

46、成癮吸毒成癮MRIMRI數(shù)據(jù)分析數(shù)據(jù)分析Raw DataRaw Data(Resting State)(Resting State)30 Methamphetamine (MA)29 Healthy controls (HC)SamplingSampling Time Time8 minsSampling Sampling frequencyfrequency0.5HzData SizeData Size(64*64*31)*240Page . 每個(gè)MRI文件切分為31張64x64圖片 建立31個(gè)相應(yīng)CNN模型,進(jìn)行分類 最終判別結(jié)果由31個(gè)分類結(jié)果投票表決 訓(xùn)練集46人(MA:24,HC:2

47、2) 測(cè)試集12人(MA:6,HC:6) 深度學(xué)習(xí)結(jié)果 隨機(jī)矩陣結(jié)果深度學(xué)習(xí)深度學(xué)習(xí)HCHCMAMA100%70.7%85.33%腦部腦部CTCT圖像識(shí)別腦出血圖像識(shí)別腦出血/ /腦梗腦梗 正常、腦梗、腦出血三類數(shù)據(jù)總共90個(gè)樣本 輸入為256*256, jpg格式 CNN網(wǎng)絡(luò)共7層卷積層 每類樣本取80%做訓(xùn)練,20%做測(cè)試正常正常梗塞梗塞出血、挫傷出血、挫傷AvgAvg100%100%50%83.33%腦出血深度學(xué)習(xí)應(yīng)用于微波圖像深度學(xué)習(xí)應(yīng)用于微波圖像43Page .微波遙感圖像目標(biāo)檢測(cè)識(shí)別微波遙感圖像目標(biāo)檢測(cè)識(shí)別44Page .毫米波人體安檢儀毫米波人體安檢儀微波圖像安檢儀危險(xiǎn)/可疑物品45Page . 有效數(shù)據(jù) 被檢測(cè)物體能夠清晰可見 陶瓷刀20組, 有效數(shù)據(jù)70個(gè) 水瓶30組, 有效數(shù)據(jù)153個(gè) 槍30組, 有效數(shù)據(jù)145個(gè) 金屬刀30組, 有效數(shù)據(jù)108個(gè) 遷移學(xué)習(xí)遷移學(xué)習(xí)- -利用小樣本集利用小樣本集46Page .深度學(xué)習(xí)性能卓越深度學(xué)

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論