大型語(yǔ)言模型的知識(shí)蒸餾與數(shù)據(jù)集蒸餾:新興趨勢(shì)、挑戰(zhàn)與未來(lái)方向_第1頁(yè)
大型語(yǔ)言模型的知識(shí)蒸餾與數(shù)據(jù)集蒸餾:新興趨勢(shì)、挑戰(zhàn)與未來(lái)方向_第2頁(yè)
大型語(yǔ)言模型的知識(shí)蒸餾與數(shù)據(jù)集蒸餾:新興趨勢(shì)、挑戰(zhàn)與未來(lái)方向_第3頁(yè)
大型語(yǔ)言模型的知識(shí)蒸餾與數(shù)據(jù)集蒸餾:新興趨勢(shì)、挑戰(zhàn)與未來(lái)方向_第4頁(yè)
大型語(yǔ)言模型的知識(shí)蒸餾與數(shù)據(jù)集蒸餾:新興趨勢(shì)、挑戰(zhàn)與未來(lái)方向_第5頁(yè)
已閱讀5頁(yè),還剩114頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1

KnowledgeDistillationandDatasetDistillationof

LargeLanguageModels:EmergingTrends,Challenges,

andFutureDirections

arXiv:2504.14772v1[cs.CL]20Apr2025

LuyangFang*1,XiaoweiYu*2,JiazhangCai1,YongkaiChen3,ShushanWu1,

ZhengliangLiu4,ZhenyuanYang4,HaoranLu1,XilinGong1,YufangLiu1,

TerryMa5,WeiRuan4,AliAbbasi6,JingZhang2,TaoWang1,EhsanLatif7,

WeiLiu8,WeiZhang9,SoheilKolouri6,XiaomingZhai7,DajiangZhu2,

WenxuanZhongt1,TianmingLiut4,andPingMat1

1DepartmentofStatistics,UniversityofGeorgia,GA,USA

2DepartmentofComputerScienceandEngineering,TheUniversityofTexasatArlington,TX,USA

3DepartmentofStatistics,HarvardUniversity,MA,USA

4SchoolofComputing,UniversityofGeorgia,GA,USA

5SchoolofComputerScience,CarnegieMellonUniversity,PA,USA

6DepartmentofComputerScience,VanderbiltUniversity,TN,USA

7AI4STEMEducationCenter,UniversityofGeorgia,GA,USA

8DepartmentofRadiationOncology,MayoClinicArizona,AZ,USA

9SchoolofComputerandCyberSciences,AugustaUniversity,GA,USA

April22,2025

Abstract

TheexponentialgrowthofLargeLanguageModels(LLMs)continuestohighlighttheneedforefficientstrategiestomeetever-expandingcomputationalanddatade-mands.Thissurveyprovidesacomprehensiveanalysisoftwocomplementaryparadigms:

KnowledgeDistillation(KD)andDatasetDistillation(DD),bothaimedatcompress-ingLLMswhilepreservingtheiradvancedreasoningcapabilitiesandlinguisticdiversity.

WefirstexaminekeymethodologiesinKD,suchastask-specificalignment,rationale-basedtraining,andmulti-teacherframeworks,alongsideDDtechniquesthatsynthesize

*Co-firstauthors.

tCo-correspondingauthors.{pingma;tliu;wenxuan}@

2

compact,high-impactdatasetsthroughoptimization-basedgradientmatching,latentspaceregularization,andgenerativesynthesis.Buildingonthesefoundations,weex-plorehowintegratingKDandDDcanproducemoreeffectiveandscalablecompressionstrategies.Together,theseapproachesaddresspersistentchallengesinmodelscalabil-ity,architecturalheterogeneity,andthepreservationofemergentLLMabilities.

Wefurtherhighlightapplicationsacrossdomainssuchashealthcareandeducation,wheredistillationenablesefficientdeploymentwithoutsacrificingperformance.Despitesubstantialprogress,openchallengesremaininpreservingemergentreasoningandlin-guisticdiversity,enablingefficientadaptationtocontinuallyevolvingteachermodelsanddatasets,andestablishingcomprehensiveevaluationprotocols.Bysynthesizingmethodologicalinnovations,theoreticalfoundations,andpracticalinsights,oursurveychartsapathtowardsustainable,resource-efficientLLMsthroughthetighterintegra-tionofKDandDDprinciples.

Keywords:LargeLanguageModels,KnowledgeDistillation,DatasetDistillation,Efficiency,ModelCompression,Survey

1Introduction

TheemergenceofLargeLanguageModels(LLMs)likeGPT-4(

Brownetal.

,

2020

),DeepSeek(

Guoetal.

,

2025

),andLLaMA(

Touvronetal.

,

2023

)hastransformednaturallanguageprocessing,enablingunprecedentedcapabilitiesintasksliketranslation,reasoning,andtextgeneration.Despitetheselandmarkachievements,theseadvancementscomewithsignificantchallengesthathindertheirpracticaldeployment.First,LLMsdemandimmensecomputa-tionalresources,oftenrequiringthousandsofGPUhoursfortrainingandinference,whichtranslatestohighenergyconsumptionandenvironmentalcosts.Second,theirrelianceonmassivetrainingdatasetsraisesconcernsaboutdataefficiency,quality,andsustainability,aspubliccorporabecomeoverutilizedandmaintainingdiverse,high-qualitydatabecomesincreasinglydifficult(

Hadietal.

,

2023

).Additionally,LLMsexhibitemergentabilities,suchaschain-of-thoughtreasoning(

Weietal.

,

2022

),whicharechallengingtoreplicateinsmallermodelswithoutsophisticatedknowledgetransfertechniques.

Tosurmountthesechallenges,distillationhasemergedasapivotalstrategy,integratingKnowledgeDistillation(KD)(

Hintonetal.

,

2015

)andDatasetDistillation(DD)(

Wang

etal.

,

2018

),totacklebothmodelcompressionanddataefficiency.Crucially,thesuccessofKDinLLMshingesonDDtechniques,whichenablethecreationofcompact,information-richsyntheticdatasetsthatencapsulatethediverseandcomplexknowledgeoftheteacherLLMs.

KDtransfersknowledgefromalarge,pre-trainedteachermodeltoasmaller,moreef-ficientstudentmodelbyaligningoutputsorintermediaterepresentations.Whileeffective

3

formoderate-scaleteachermodels,traditionalKDstruggleswithLLMsduetotheirvastscale,whereknowledgeisdistributedacrossbillionsofparametersandintricateattentionpatterns.Moreover,theknowledgeisnotlimitedtooutputdistributionsorintermediaterep-resentationsbutalsoincludeshigher-ordercapabilitiessuchasreasoningabilityandcomplexproblem-solvingskills(

WilkinsandRodriguez

,

2024

;

Zhaoetal.

,

2023

;

Latifetal.

,

2024

).DDaimstocondenselargetrainingdatasetsintocompactsyntheticdatasetsthatretaintheessentialinformationrequiredtotrainmodelsefficiently.RecentworkhasshownthatDDcansignificantlyreducethecomputationalburdenofLLMtrainingwhilemaintainingperformance.Forexample,DDcandistillmillionsoftrainingsamplesintoafewhundredsyntheticexamplesthatpreservetask-specificknowledge(

Cazenavetteetal.

,

2022

;

Maekawa

etal.

,

2024

).WhenappliedtoLLMs,DDactsasacriticalenablerforKD:itidentifieshigh-impacttrainingexamplesthatreflecttheteacher’sreasoningprocesses,therebyguidingthestudenttolearnefficientlywithoutoverfittingtoredundantdata(

Sorscheretal.

,

2022

).

ThescaleofLLMsintroducesdualchallenges:relianceonunsustainablemassivedatasets(

Hadietal.

,

2023

)andemergentabilities(e.g.,chain-of-thoughtreasoning(

Weietal.

,

2022

))requiringprecisetransfer.ThesechallengesnecessitateadualfocusonKDandDD.WhileKDcompressesLLMsbytransferringknowledgetosmallermodels,traditionalKDalonecannotaddressthedataefficiencycrisis:trainingnewerLLMsonredundantorlow-qualitydatayieldsdiminishingreturns(

Albalaketal.

,

2024

).DDcomplementsKDbycuratingcom-pact,high-fidelitydatasets(e.g.,rarereasoningpatterns(

Lietal.

,

2024

)),asdemonstratedinLIMA,where1,000examplesachievedteacher-levelperformance(

Zhouetal.

,

2023

).ThissynergyleveragesKD’sabilitytotransferlearnedrepresentationsandDD’scapacitytogen-eratetask-specificsyntheticdatathatmirrorstheteacher’sdecisionboundaries.Together,theyaddressprivacyconcerns,computationaloverhead,anddatascarcity,enablingsmallermodelstoretainboththeefficiencyofdistillationandthecriticalcapabilitiesoftheirlargercounterparts.

ThissurveycomprehensivelyexaminesKDandDDtechniquesforLLMs,followedbyadiscussionoftheirintegration.TraditionalKDtransfersknowledgefromlargeteachermod-elstocompactstudents,butmodernLLMs’unprecedentedscaleintroduceschallengeslikecapturingemergentcapabilitiesandpreservingembeddedknowledge.DDaddressesthesechallengesbysynthesizingsmaller,high-impactdatasetsthatretainlinguistic,semantic,andreasoningdiversityforeffectivetraining.OuranalysisprioritizesstandaloneadvancementsinKDandDDwhileexploringtheircombinedpotentialtoenhancemodelcompression,trainingefficiency,andresource-awaredeployment.Thissurveyunderscorestheircollective

4

roleinovercomingscalability,datascarcity,andcomputationalbarriers.

Thesubsequentsectionsexplorethefollowingkeyaspects:

?FundamentalsofKDandDD(Section

2

),distinguishingtheirrolesincompressingLLMsandoptimizingtrainingefficiency.

?MethodologiesforKDinLLMs(Section

3

),includingrationale-baseddistillation,uncertainty-awareapproaches,multi-teacherframeworks,dynamic/adaptivestrategies,andtask-specificdistillation.Additionally,wereviewtheoreticalstudiesthatofferdeeperinsightsintotheunderlyingprinciplesofKD.

?MethodologiesforDDinLLMs(Section

4

),coveringoptimization-baseddistillation,syntheticdatageneration,andcomplementarydataselectionstrategiesforcompacttrainingdata.

?IntegrationofKDandDD(Section

5

),presentingunifiedframeworksthatcombineKDandDDstrategiesforenhancingLLMs.

?Evaluationmetrics(Section

6

)forassessingtheeffectivenessofdistillationinLLMs,focusingonperformanceretention,computationalefficiency,androbustness.

?Applicationsacrossmultipledomains(Section

7

),includingmedicalandhealth,edu-cation,andbioinformatics,demonstratingthepracticalbenefitsofdistillationinreal-worldscenarios.

?Challengesandfuturedirections(Section

8

),identifyingkeyareasforimprovement.ThetaxonomyofthissurveyisillustratedinFigure

1

.

2FundamentalsofDistillation

ThissectionintroducesthedefinitionandcoreconceptsofKnowledgeDistillation(KD)andDatasetDistillation(DD).Additionally,itdiscussesthesignificanceofdistillationinLLMscomparedtotraditionaldistillationmethods.

5

Rationale-BasedKD

(Hsiehetal.,2023),(Chuetal.,2023),(Wangetal.,2023),(Fengetal.,2024)

(KPOD)

Uncertainty&BayesianKD

(KorattikaraBalanetal.,2015),(Vaderaetal.,2020),(Malininetal.,2019),(Menon

etal.,2021),(Fangetal.,2024)

(Youetal.,2017),(Zhangetal.,2022),(Duetal.,2020),(Fukudaetal.,2017),(Zhu

Multi-Teacher&EnsembleKD

etal.,2021),(Tianetal.,2024)(TinyLLM),(Khanujaetal.,2021)

(MERGEDISTILL),(Liuetal.,2024)(DIVERSEDISTILL),(Wadhwaetal.,2025)

MethodologiesforKDinLLMs

Dynamic&AdaptiveKD

(NiandHu,2023),(Changetal.,2022),(NiyazandBathula,2022),(Sunetal.,2021),(Li

etal.,2024)(BiLD),(Lietal.,2023),(Linetal.,2020)(Ensemblecontext)

(Wuetal.,2023)(LAMINI),(Yangetal.,2023),(Zhangetal.,2023),(Taorietal.,2023a)(Alpaca),(Wei

Task-SpecificKD

etal.,2021)(FLAN),(Thoppilanetal.,2022),(Brownetal.,2020),(Zhangetal.,2023),(Wangetal.,2022),

(Ouyangetal.,2022),(Liuetal.,2024),(Austinetal.,2021),(Chenetal.,2021),(Fangetal.,2025)(Task

synergy)

TheoreticalStudies

(Wuetal.,2023;Yangetal.,2023;Zhangetal.,2023),Taorietal.(2023a),Weietal.(2021),

(Zhangetal.,2023;Wangetal.,2022;Ouyangetal.,2022)

(Wangetal.,2018),(Zhaoetal.,2020),(ZhaoandBilen,2021),(Cazenavette

Optimization-BasedDD

etal.,2022),(Kimetal.,2022a),(Nguyenetal.,2021),(Looetal.,2022),

(Zhouetal.,2022)

SyntheticDataGeneration

Filtering

MethodologiesforDDinLLMs

(Cazenavetteetal.,2023)(GLaD),(Liuetal.,2022)(HaBa),(Leeetal.,2022)(KFS)

DistillationinLLMs

(Albalaketal.,2024),(Abbasetal.,2023),(Penedoetal.,2023),(Raeetal.,

2021),(Yaoetal.,2020),(Wenzeketal.,2020)

DataSelectionCoreset

(Zhangetal.,2024),(Liu,Zeng,He,Jiang,andHe,Liuetal.),(Xiaetal.,

2024),(Zhang,Zhai,Ma,Shen,Li,Jiang,andLiu,Zhangetal.)

DataAttribution

(GhorbaniandZou,2019),(Wangetal.,2025),(Wangetal.,2024),(Wang

etal.,2025)

KnowledgeTransferviaDatasetDistillation

(Yinetal.,2023),(Shaoetal.,2024),(Sunetal.,2024),(YinandShen,

IntegrationofKnowledgeDistillation

&DatasetDistillation

2023),(XiaoandHe,2024)

(Wangetal.,2022)(PromDA),(Heetal.,2022)(GAL),(Dengetal.,2022),

Prompt-BasedSyntheticDataGeneration

(Pryzantetal.,2023)(ProTeGi),(Zhouetal.,2024)(DiffLM),(DeSalvo

etal.,2024)(SoftSRV)

(Pillutlaetal.,2021),(Zhangetal.,2019),(Cobbeetal.,2021),(Hendrycks

EvaluationandMetrics

etal.,2021),(Liuetal.,2020),(Minetal.,2023),(Changetal.,2024),(Wang

etal.,2023),(Wangetal.,2021),(Zhuetal.,2023),(Guoetal.,2017),(Tian

etal.,2023),(Leeetal.,2025)

(Niuetal.,2024)(ClinRaGen),(Tariqetal.,2024),(Nievasetal.,2024)(Trial-LLAMA),(Zhangetal.,2025)(KEDRec-LM),(Dingetal.,2024),

Medical&Healthcare(Hasanetal.,2024),(Nievasetal.,2024),(Sutantoetal.,2024),(Yagnik

etal.,2024),(Guetal.,2023),(Vedulaetal.,2024),(Zhangetal.,2025),(Ge

etal.,2025),

Applications

(Zhai,2022),(Selwyn,2019),(Danetal.,2023),(Xieetal.,2023),(Xieetal.,2024),

Education&E-Learning(Dagdelenetal.,2024),(Latifetal.,2024),(Jiaoetal.,2019)(TinyBERT),(Quetal.,2024),(Baladónetal.,2023),(Fangetal.,2025)

Bioinformatics

(Shangetal.,2024),(Zhouetal.,2024)(DEGU),(Luetal.,2023),(Zhou

etal.,2021)

Figure1:TaxonomyofDistillationofLargeLanguageModels.

2.1KnowledgeDistillation

2.1.1DefinitionandCoreConcepts

KDisamodelcompressionparadigmthattransfersknowledgefromacomputationallyinten-siveteachermodelfTtoacompactstudentmodelfS.Formally,KDtrainsfStoapproximateboththeoutputbehaviorandintermediaterepresentationsoffT.Thefoundationalworkof

Hintonetal.

(

2015

)introducedtheconceptof“softlabels”:insteadoftrainingonhardlabelsy,thestudentlearnsfromtheteacher’sclassprobabilitydistributionpT=σ(zT/τ),wherezTarelogitsfromfT,σisthesoftmaxfunction,andτisatemperatureparameterthatcontrolsdistributionsmoothness.Thestudent’sobjectivecombinesacross-entropyloss

6

LCE(forhardlabels)andadistillationlossLKL:

LKD=α·LCE(σ(zS(x)),y)+(1—α)·τ2·LKL(σ(zT(x)/τ),σ(zS(x)/τ)),(1)

whereLKListheKullback-Leibler(KL)divergencebetweenstudentandteachersoftenedoutputs,andαbalancesthetwoterms.Beyondlogits,laterworksgeneralizedKDtotransferhiddenstateactivations(

Romeroetal.

,

2014

),intermediatelayers(

Sunetal.

,

2019

),atten-tionmatrices(

Jiaoetal.

,

2019

),orrelationalknowledge(

Parketal.

,

2019

),formalizedasminimizingdistancemetrics(e.g.,ⅡhT—hSⅡ2)betweenteacherandstudentrepresentations.Thisframeworkenablesthestudenttoinheritnotonlytask-specificaccuracybutalsotheteacher’sgeneralizationpatterns,makingKDacornerstoneforefficientmodeldeployment.

Figure2:OverviewofKnowledgeDistillationinLLMs.KnowledgeisdistilledfromateacherLLM,whichhasbeentrainedonalargeexistingdatabase.Thisknowledge,potentiallyenrichedwithcurrent,task-specificdata,istransferredtoasmallerstudentLLM.Bylearningfromboththeteacher’sguidanceandthecurrentdata,thestudentLLMbecomesmoreefficientandeffectiveatperformingdownstreamtasks.

2.1.2KDintheEraofLLMsvs.TraditionalModels

TheemergenceofLLMs,exemplifiedbymodelslikeGPT-3(175Bparameters(

Brownetal.

,

2020

)),hasnecessitatedrethinkingtraditionalKDparadigms.WhileclassicalKDusuallyfocusesoncompressingtask-specificmodels(e.g.,ResNet-50toMobileNet)withhomoge-neousarchitectures(

Gouetal.

,

2021

),LLM-drivendistillationconfrontsfourfundamentalshifts:

?Scale-DrivenShifts:TraditionalKDoperatesonstaticoutputdistributions(e.g.,classprobabilities),butautoregressiveLLMsgeneratesequentialtokendistributions

7

overvocabulariesof~50ktokens.Thisdemandsnoveldivergencemeasuresforsequence-levelknowledgetransfer(

Shridharetal.

,

2023

),suchastoken-levelKullback-Leiblerminimizationordynamictemperaturescaling.

?ArchitecturalHeterogeneity:TraditionalKDoftenassumedmatchedorcloselyrelatedteacher-studenttopologies(e.g.,bothCNNs).LLMdistillationoftenbridgesarchitecturallydistinctmodels(e.g.,sparseMixture-of-Expertsteacherstodensestu-dents(

Fedusetal.

,

2022

)).Thisrequireslayerremappingstrategies(

Jiaoetal.

,

2019

)andrepresentationalignment(e.g.,attentionheaddistillation(

Micheletal.

,

2019

))tobridgetopologicalgapswhilepreservinggenerativecoherence.

?KnowledgeLocalization:LLMsencodeknowledgeacrossdeeplayerstacksandmulti-headattentionmechanisms,necessitatingdistillationstrategiesthataddress:

–Structuralpatterns:Attentionheadsignificance(

Micheletal.

,

2019

)andlayer-specificfunctionalroles(e.g.,syntaxvs.semantics).

–Reasoningtrajectories:ExplicitrationaleslikeChain-of-Thought(

Weietal.

,

2022

)andimplicitlatentstateprogressions.

Unliketraditionalmodeldistillation,whichoftenfocusesonreplicatinglocalizedfea-tures,LLMdistillationmustpreservecross-layerdependenciesthatencodelinguisticcoherenceandlogicalinference(

Sunetal.

,

2019

).

?DynamicAdaptation:LLMdistillationincreasinglyemploysiterativeprotocolswhereteachersevolveviareinforcementlearningfromhumanfeedback(RLHF)(

Ouyang

etal.

,

2022

)orsyntheticdataaugmentation(

Taorietal.

,

2023b

),divergingfromstaticteacherassumptionsinclassicalKD.

2.2DatasetDistillation

2.2.1OverviewofDatasetDistillation

Datasetdistillation(

Wangetal.

,

2018

)isatechniquedesignedtocondenseknowledgefromlargedatasetsintosignificantlysmaller,syntheticdatasetswhileretainingtheabilitytotrainmodelseffectively.Unlikedataselectionmethods(e.g.,datapruningorcoresetselec-tion(

Dasguptaetal.

,

2009

)),whichfocusonchoosingrepresentativerealsamples,dataset

8

distillationactivelysynthesizesnew,compactsamplesthatencapsulatetheessentiallearn-ingsignal.Thedistilleddatasetisoftenordersofmagnitudesmalleryetenablesmodelstoachievecomparableorevenimprovedperformance.

Formally,letD纟{(xi,yi)}|1bealargedataset,andDsyn纟i,i

datasetwithn?|D|.ForalearningmodelΦ,letθDandθDsynbetheparameterslearnedfromtrainingonDandDsyn,respectively.DatasetdistillationaimstomakeθDandθDsynproducesimilaroutcomes:

a,n(x~,~Y{Il(ΦθD(x),y)-l(ΦθDsyn(x),y)I}).(2)

An?-approximatedatasummarysatisfies:

x~,~Y{Il(ΦθD(x),y)-l(ΦθDsyn(x),y)I}≤?.(3)

Figure3:OverviewofDatasetDistillationinLLMs.AteacherLLMistrainedonamassiveoriginaldatabase.Throughdatasetdistillation,acompact,high-qualitysubset(DistilledDatabase)issynthesizedtopreserveessentialknowledge.ThissmallerdatasetisthenusedtotrainastudentLLM,aimingtoachievesimilarperformanceastheteacherwhilerequiringsignificantlyfewerdata.

Withtheincreasingscaleofmoderndeeplearningmodels,datasetdistillationhasgainedattentionforacceleratingtraining(

Dingetal.

,

2024

),enablingcontinuallearning(

Deng

andRussakovsky

,

2022

),andimprovingdataefficiencyinlow-resourcesettings(

Songetal.

,

2023

).However,challengesremain,suchaspreservingsufficientdiversityandrobustnessinthedistilleddataandavoidingoverfitting(

Cazenavetteetal.

,

2023

).InLLMs,datasetdistillationiscrucialforreducingcomputationaloverheadwhilemaintainingtherichseman-

9

ticdiversityneededforeffectivelanguagemodeling.Table

1

outlinessomecommonlyuseddatasetsfordatadistillation.

Table1:CommonDatasetsforDataDistillation.

Dataset

Size

Category

RelatedWorks

SVHN

60K

Image

HaBa(

Liuetal.

,

2022

),FreD(

Shinetal.

,

2023

)

CIFAR-10

60K

Image

FreD(

Shinetal.

,

2023

),IDC(

Kimetal.

,

2022b

)

CIFAR-100

60K

Image

IDM(

Zhaoetal.

,

2023

),DD(

Wangetal.

,

2018

)

TinyImageNet

100K

Image

CMI(

Zhongetal.

,

2024

),DD(

Wangetal.

,

2018

)

ImageNet-1K

14000K

Image

Teddy(

Yuetal.

,

2024

),TESLA(

Cuietal.

,

2023

)

TDBench

23datasets

Table

TDColER(

Kangetal.

,

2025

)

SpeechCommands

8K

Audio

IDC(

Kimetal.

,

2022b

)

AMCAIMESTaR

4K

Text

Numinamath(

Lietal.

,

2024

)

R1-Distill-SFT

17K

Text

QWTHN(

Kongetal.

,

2025

)

OpenThoughts-114k

114K

Text

FreeEvalLM(

Zhaoetal.

,

2025

)

2.2.2MethodsandApproaches

Datasetdistillationhasevolvedthroughvariousmethodologicaladvancements,eachaimingtocompresslargedatasetsintosmallyethighlyinformativesubsets(

SachdevaandMcAuley

,

2023

;

YinandShen

,

2024

;

Yuetal.

,

2023

).Fromahigh-levelperspective,twomaincategoriescanbedistinguished:optimization-baseddatasetdistillationandsyntheticdatagenerationfordatasetdistillation.

Optimization-BasedMethods.Thesemethodsdistilldatasetsbyaligningtrainingdy-namics(e.g.,gradients,trajectories)orfinalmodelperformancebetweensyntheticandorig-inaldata.

?Meta-LearningOptimization:Abi-leveloptimizationapproachthattrainsamodelonasyntheticdatasetandevaluatesitsperformanceontheoriginaldataset.Thesyn-theticsamplesareiterativelyrefinedtomatchthemodel’sperformancewhentrainedonthefulldataset(

Wangetal.

,

2018

).

?GradientMatching:Proposedby

Zhaoetal.

(

2020

),italignsthegradientsfromone-stepupdatesonrealvs.syntheticdatasets.Theobjectiveistomakegradientsconsistentsothattrainingonsyntheticdatacloselyresemblestrainingonrealdataovershorttimespans.

10

?TrajectoryMatching:Toaddressthelimitationofsingle-stepgradientmatch-ing,multi-stepparametermatching(a.k.a.trajectorymatching)wasintroducedby

Cazenavetteetal.

(

2022

).Italignstheendpointsoftrainingtrajectoriesovermultiplesteps,makingthedistilleddatasetmorerobustoverlongertraininghorizons.

SyntheticDataGeneration.Thesemethodsdirectlycreateartificialdatathatapprox-imatesthedistributionoftheoriginaldataset.TherepresentativetechniquehereisDis-tributionMatching(DM)(

ZhaoandBilen

,

2023

),aimingtogeneratesyntheticdatawhoseempiricaldistributionalignswiththerealdatasetusingmetricslikeMaximumMeanDis-crepancy(MMD).

2.2.3ApplicationsandUseCases

Datasetdistillationhaswide-rangingapplicationswheredataredundancymustbereducedwithoutsacrificingperformance:

?LLMFine-TuningandAdaptation:Bycreatingadistilledsubsetofdomain-specificdata,researcherscanrapidlyadaptlarge-scaleLLMstospecializedtaskswith-outincurringthefullcomputationalcost.

?Low-ResourceLearning:InfederatedlearningandedgeAIscenarios(

Wuetal.

,

2024

;

Quetal.

,

2025

),adistilleddatasetreducescommunicationandcomputationaloverhead.

?NeuralArchitectureSearchandHyperparameterTuning:Distilleddatasetsprovideaproxyforevaluatingmodelvariantsquickly,cuttingdownonexpensivefull-datasettraining(

Prabhakaretal.

,

2022

).

?PrivacyandSecurity:Distilleddatasetscanserveasprivacy-preservingproxiesforsensitivedata(e.g.,medicalorfinancialrecords),reducingexposureofindividual-levelinformation(

Yuetal.

,

2023

).

?ContinualLearning:Bysummarizingpasttasks,distilleddatasetshelpmitigatecatastrophicforgettingwhenmodelslearnincrementallyovertime(

Binicietal.

,

2022

).

11

3MethodologiesandTechniquesforKnowledgeDistilla-

tioninLLMs

Inthissection,wereviewmethodologiesforknowledgedistillation(KD)inLLMs,includ-ingrationale-basedapproaches,uncertainty-awaretechniques,multi-teacherframeworks,dy-namic/adaptivestrategies,andtask-specificdistillation.WealsoexploretheoreticalstudiesthatuncoverthefoundationalmechanismsdrivingKD’ssuccessinLLMs.

3.1Rationale-BasedKD

Rationale-BasedKnowledgeDistillation(RBKD)improvestraditionalknowledgedistillationbyallowingthestudentmodeltolearnnotonlytheteacher’sfinalpredictionsbutalsothereasoningprocessbehindthem,knownasChain-of-Thought(CoT)reasoning.Thismakesthestudentmodelmoreinterpretableandreducestheneedforlargeamountsoflabeleddata(

Hsiehetal.

,

2023

).Insteadofmerelyimitatingtheteacher’soutputs,thestudentdevelopsadeeperunderstandingofproblem-solving,leadingtobettergeneralizationandadaptabilitytonewtasks.

Formally,givenadatasetD={(xi,yi,ririistherationalegeneratedbytheteacher,thestudentistrainedtojointlypredictboththerationaleandthefinalanswer.Thisobjectivecanbeformulatedas:

wherePθdenotesthestudentmodelparameterizedbyθ.Thisformulationencouragesthestudenttointernalizethereasoningpathratherthanshortcuttingtotheanswer.

Byincorporatingreasoningsteps,RBKDenhancestransparency,makingmodelsmorereliableinfieldslikehealthcareandlaw,whereunderstandingdecisionsiscrucial.Italsoimprovesefficiency,assmallermodelscanachievestrongperformancewithoutrequiringextensivecomputationalresources.ThismakesRBKDapracticalapproachthatbalancesaccuracy,interpretability,andresourceefficiency(

Chuetal.

,

2023

).

OnepromisingdirectionisKeypoint-basedProgressiveCoTDistillation(KPOD),whichaddressesbothtokensignificanceandtheorderoflearning(

Fengetal.

,

2024

).InKPOD,thedifficultyofeachreasoningstepisquantifiedusingaweightedtokengenerationloss:

12

wheredk(i)isthedifficultyscoreofthek-thstepinthei-thrationale,pkandqkdenotethestartandendpositionsofthatstep,andj(i)representsthenormalizedimportanceweightfortokenj.Thisdifficulty-awaredesignallowsthestudentmodeltoacquirereasoningskillsprogressively—fromsimplertomorecomplexsteps—resultinginstrongergeneralizationandinteroperability.

ChallengesandFutureDirections.ThegrowingfocusondistillingricherknowledgefromteacherLLMs,suchasreasoningandotherhigher-ordercapabilities,raisesimportantquestionsaboutwhichknowledgetoextractandhowtoextractiteffectively.SincemanyteacherLLMsareclosed-source,capturingtheiradvancedcapabilitiesismorechallengingthanmerelycollectinghard-labelpredictions.ThishighlightstheneedforfurtherresearchintodiverseknowledgeextractionapproachesfromteacherLLMs,aimingtoenhancet

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論