数据库原理与应用课程组课件

PPT
  • 阅读 115 次
  • 下载 0 次
  • 页数 79 页
  • 大小 731.500 KB
  • 2022-12-05 上传
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
此文档由【小橙橙】提供上传,收益归文档提供者,本网站只提供存储服务。若此文档侵犯了您的版权,欢迎进行违规举报版权认领
数据库原理与应用课程组课件
可在后台配置第一页与第二页中间广告代码
数据库原理与应用课程组课件
可在后台配置第二页与第三页中间广告代码
数据库原理与应用课程组课件
可在后台配置第三页与第四页中间广告代码
数据库原理与应用课程组课件
数据库原理与应用课程组课件
还剩10页未读,继续阅读
【这是免费文档,您可以免费阅读】
/ 79
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
文本内容

【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传

转载请保留链接:https://www.ichengzhen.cn/view-92522.html

以下为本文档部分文字说明:

Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数据库原理与应用课程组C

opyright:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAnalysisOLAPExtendedag

gregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalSystemsIncludingWebsearchCopyright:Silberschatz,Kor

thandSudarshan3DecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisionsoftenbasedondatacollectedbyon-linetransaction-proce

ssingsystems.Examplesofbusinessdecisions:whatitemstostock?Whatinsurancepremiumtochange?Whotosendadvertisementsto?Example

sofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,sex,etc.)Copyright:Silberschatz,Kor

thandSudarshan4Decision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExamplet

asksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsa

bove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabas

esStatisticalanalysisisalargefieldwillnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformof

statisticalrulesandpatternsfromLargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,an

dstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinesseswhichgeneratedatafrommultipledivisions,possiblyatmult

iplesitesDatamayalsobepurchasedexternallyCopyright:Silberschatz,KorthandSudarshan5DataAnalysisandOLAPAggregatefunctionssummarizelarg

evolumesofdataOnlineAnalyticalProcessing(OLAP)Interactiveanalysisofdata,allowingdatatobesummarizedandviewedindifferentwaysinanonlinefashion(withneg

ligibledelay)Datathatcanbemodeledasdimensionattributesandmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordata

analysis,wecanidentifysomeofitsattributesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthesa

lesrelationisameasureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensionattribut

es,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarshan6

CrossTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferredtoasapivot-table.Across-t

abisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattributeformthecolumnheadersOtherdimensi

onattributesarelistedontopValuesinindividualcellsare(aggregatesof)thevaluesofthedimensionattributesthatspecifythec

ell.Copyright:Silberschatz,KorthandSudarshan7RelationalRepresentationofCrosstabsCrosstabscanberepresentedasrelat

ionsThevalueallisusedtorepresentaggregatesTheSQL:1999standardactuallyusesnullvaluesinplaceofallMoreonthislater….Copyright:Silbersc

hatz,KorthandSudarshan8Three-DimensionalDataCubeAdatacubeisamultidimensionalgeneralizationofacrosstabCannotviewathree-dimensionalobjectinitsentiret

ybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyticalProcessingTheoperationofc

hangingthedimensionsusedinacross-tabiscalledpivotingSupposeananalystwishestoseeacross-tabonitem-name

andcolorforafixedvalueofsize,forexample,large,insteadofthesumacrossallsizes.Suchanoperationisreferredtoasslicing.Theoperationissometimescalleddicin

g,particularlywhenvaluesformultipledimensionsarefixed.Theoperationofmovingfromfiner-granularitydatatoacoarsergranularity

iscalledarollup.Theoppositeoperation-thatofmovingfromcoarser-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silbe

rschatz,KorthandSudarshan10HierarchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.t

hedimensionDateTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschat

z,KorthandSudarshan11CrossTabulationWithHierarchyCrosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorro

lluponahierarchyCopyright:Silberschatz,KorthandSudarshan12OLAPImplementationTheearliestOLAPsystemsusedm

ultidimensionalarraysinmemorytostoredatacubes,andarereferredtoasmultidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefe

aturesarecalledrelationalOLAP(ROLAP)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesin

arelationaldatabase,arecalledhybridOLAP(HOLAP)systems.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPs

ystemsprecomputedallpossibleaggregatesinordertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgrou

pbyItsufficestoprecomputesomeaggregates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fro

managgregateon(item-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingitfromscratchSeveraloptimizatio

nsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,size)Cancompute

aggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthebasedataCopyright:Sil

berschatz,KorthandSudarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeith

erveryhardorimpossibletospecifyDatacubeComplexaggregates(median,variance)binaryaggregates(correlation,

regressioncurves)rankingqueries(“assigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyo

faggregationfunctionstoaddressabovelimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,KorthandSud

arshan15ExtendedAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthequeryse

lectitem-name,color,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdiffer

entgroupingsofthesalesrelation:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name)

,(color),(size),()}where()denotesanemptygroupbylist.Foreachgrouping,theresultcontainsthenullvalueforat

tributesnotpresentinthegrouping.Copyright:Silberschatz,KorthandSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearl

ier,butwithnullinplaceofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,color)Thefunction

grouping()canbeappliedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallotherc

ases.selectitem-name,color,size,sum(number),grouping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grou

ping(size)assize-flag,fromsalesgroupbycube(item-name,color,size)Canusethefunctiondecode()intheselectclauset

oreplacesuchnullsbyavaluesuchasallE.g.replaceitem-nameinfirstquerybydecode(grouping(item-name),1,„all‟,item-na

me)Copyright:Silberschatz,KorthandSudarshan17ExtendedAggregation(Cont.)Therollupconstructgeneratesuniononeveryprefixofspecifiedlistofattribute

sE.g.selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name,color,size)Generatesunio

noffourgroupings:{(item-name,color,size),(item-name,color),(item-name),()}Rollupcanbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,su

pposetableitemcategory(item-name,category)givesthecategoryofeachitem.Thenselectcategory,item-name,sum(

number)fromsales,itemcategorywheresales.item-name=itemcategory.item-namegroupbyrollup(category,item-name)wouldgiveahiera

rchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSudarshan18ExtendedAggregation(Co

nt.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupbylists,crossproductofsetsgivesoverall

setofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,size)generatesthegroupings{item-name,()

}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,size),(color),()}Copyright:Silbersc

hatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-marks(student-id,marks)find

therankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksAnextraorderbyclauseisneededtogettheminsortedor

derselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksorderbys-rankRankingmayleave

gaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_rankdoesnotleavegaps,sonextdenserankwouldbe2Copy

right:Silberschatz,KorthandSudarshan20Ranking(Cont.)Rankingcanbedonewithinpartitionofthedata.“Findtherankofstudentsw

ithineachsection.”selectstudent-id,section,rank()over(partitionbysectionorderbymarksdesc)assec-rankfromstudent-marks,student-s

ectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultiplerankclausescanoccurinasinglesel

ectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:FindstudentswithtopnranksManysystemsp

rovidespecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelations

tudent-course-marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_rank(

withinpartition,ifpartitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprecedingvaluesrow_number(non-deterministic

inpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(orderbymarksdescnullsla

st)ass-rankfromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Foragivenconstantn,therankingthefunctio

nntile(n)takesthetuplesineachpartitioninthespecifiedorder,anddividesthemintonbucketswithqualnumbersoftuples.Forinstance,weansortem

ployeesbysalary,andusentile(3)tofindwhichrange(bottomthird,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesineachran

ge:selectthreetile,sum(salary)from(selectsalary,ntile(3)over(orderbysalary)asthreetilefromemployee)assgroupbyt

hreetileCopyright:Silberschatz,KorthandSudarshan23WindowingE.g.:“Givensalesvaluesforeachdate,calculateforeachdat

etheaverageofthesalesonthatday,thepreviousday,andthenextday”Suchmovingaveragequeriesareusedtosmoothoutrandomvariations.Incontrasttogroupby,thes

ametuplecanexistinmultiplewindowsWindowspecificationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.gi

venrelationsales(date,value)selectdate,sum(value)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofother

windowspecifications:betweenrowsunboundedprecedingandcurrentrowsunboundedprecedingrangebetween10precedingandcurrentrowAllrowswithvaluesbetweencur

rentrowvalue–10tocurrentvaluerangeinterval10dayprecedingNotincludingcurrentrowCopyright:Silberschatz,KorthandSudarshan24Windowing

(Cont.)CandowindowingwithinpartitionsE.g.Givenarelationtransaction(account-number,date-time,value),wherevalueispositiveforadepositandnegativeforaw

ithdrawal“Findtotalbalanceofeachaccountaftereachtransactionontheaccount”selectaccount-number,date-ti

me,sum(value)over(partitionbyaccount-numberorderbydate-timerowsunboundedpreceding)asbalancefromtransactionorderbyaccount-numbe

r,date-timeCopyright:Silberschatz,KorthandSudarshan25DataMiningCopyright:Silberschatz,KorthandSudarshan26DataMini

ngBroadlyspeaking,dataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificiali

ntelligencedataminingdiscoversstatisticalrulesandpatternsDiffersfrommachinelearninginthatitdealswithlarg

evolumesofdatastoredprimarilyondisk.Sometypesofknowledgediscoveredfromadatabasecanberepresentedbyasetofr

ules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikelytobuysportscars”Othertypesofknowledgerepresentedbye

quations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-processingofdata,choiceofwhichtypeofpatterntofind,postprocessingtofin

dnovelpatternsCopyright:Silberschatz,KorthandSudarshan27ApplicationsofDataMiningPredictionbasedonpasthistoryPredictifacreditcardapplican

tposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasthistoryPredictifacustomerislikelytoswitch

brandloyaltyPredictifacustomerislikelytorespondto“junkmail”PredictifapatternofphonecallingcardusageislikelytobefraudulentSom

eexamplesofpredictionmechanisms:ClassificationGivenatrainingsetconsistingofitemsbelongingtodifferentclasses,andanewitemwhoseclassisunknown,predict

whichclassitbelongstoRegressionformulaegivenasetofparameter-valuetofunction-resultmappingsforanunknownfunction,pred

ictthefunction-resultforanewparameter-valueCopyright:Silberschatz,KorthandSudarshan28ApplicationsofData

Mining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbythesamecustomers.Ifanewcustomerbuysonesuchbook,s

uggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.Associationsmayalsobeusedasafirststepindetecting

causationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacproblemsClustersE.g.typhoidcaseswereclusteredinanareasurroun

dingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemicsCopyright:Silberschatz,KorthandSudarshan29ClassificationRu

lesClassificationruleshelpassignnewobjectstoasetofclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassif

iedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationalle

velofapplicant,salaryofapplicant,ageofapplicant,etc.personP,P.degree=mastersandP.income>75,000P.credi

t=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=goodRulesarenotnecessaril

yexact:theremaybesomemisclassificationsClassificationrulescanbecompactlyshownasadecisiontree.Copyright:Silberschatz,KorthandSudarsh

an30DecisionTreeCopyright:Silberschatz,KorthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasampleinwhichthegroupingforeac

htupleisalreadyknown.Considercreditriskexample:Supposedegreeischosentopartitionthedataattheroot.Sincedegreehasasmallnumberofpo

ssiblevalues,onechildiscreatedforeachvalue.Ateachchildnodeoftheroot,furtherclassificationisdoneifrequired.Here

,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,somenumberofintervalsarechosen,andonechildcreatedforeachinterval.

Differentclassificationalgorithmsusedifferentwaysofchoosingwhichattributetopartitiononateachnode,andwha

ttheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differentno

desatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32ConstructionofDecisionTrees(Cont.)Greed

ytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningconditionforthen

odeMoreonchoosingpartioningattribute/conditionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmoreofthetreeisconstru

ctedThedataatanodeisnotpartitionedfurtherifeitherall(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenc

onsidered,andnofurtherpartitioningispossible.Suchanodeisaleafnode.Otherwisethedataatthenodeispartitionedfurtherbypickinganattributeforpartitioning

dataatthenode.Copyright:Silberschatz,KorthandSudarshan33BestSplitsIdea:evaluatedifferentattributesandpartitioningconditionsandpictheoneth

atbestimprovesthe“purity”ofthetrainingsetexamplesTheinitialtrainingsethasamixtureofinstancesfromdiffere

ntclassesandisthusrelativelyimpureE.g.ifdegreeexactlypredictscreditrisk,partitioningondegreewouldresultineachildhavinginstancesofonlyonecla

ssI.e.,thechildnodeswouldbepureThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.

Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenalli

nstancesareinasingleclass,theGinivalueis0,whileitreachesitsmaximum(of1–1/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,

KorthandSudarshan34BestSplits(Cont.)Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSisspl

itintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityoftheresultantsetofsetsas:purity(S1,S2,…..,Sr)=Theinfo

rmationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….,Sr)=purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2pi

Copyright:Silberschatz,KorthandSudarshan35BestSplits(Cont.)Measureof“cost”ofasplit:Information-content(S,{S1,S2,…..,Sr}))=–Information-gain

ratio=Information-gain(S,{S1,S2,……,Sr})Information-content(S,{S1,S2,…..,Sr})Thebestsplitforanattribut

eistheonethatgivesthemaximuminformationgainratioContinuousvaluedattributesCanbeorderedinafashionmeaningfultoclassificatione.g.integerandrealv

aluesCategoricalattributesCannotbemeaningfullyordered(e.g.country,school/university,item-color,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,Ko

rthandSudarshan36FindingBestSplitsCategoricalattributes:Multi-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecasesBinarysplit:try

allpossiblebreakupofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheinstances,tryeachasaspli

tpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,seebi

bliographicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberschatz,KorthandSuda

rshan37Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenreturn;foreachattribu

teAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPar

tition(Si);Copyright:Silberschatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)Varietyofalgorithmshavebeendevelope

dtoReduceCPUcostand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybeoverfitted,i.e.,over

lytunedtogiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofewtraininginstancesWhenasubtreeis

pruned,aninternalnodebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePruningcanbedonebyusingapartof

thetrainingsettobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclassificationonsecondpartCopyright:Silberschatz,Korthan

dSudarshan39OtherTypesofClassifiersFurthertypesofclassifiersNeuralnetclassifiersBayesianclassifiersNeura

lnetclassifiersusethetrainingdatatotrainartificialneuralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseBayestheorem,

whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgiven

classcj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:Silberschatz,KorthandSuda

rshan40NaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)

p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,a

ndtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramisc

omputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstoreCopyright:Silberschatz,KorthandSudarshan41Regression

Regressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariable

Y.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral

,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofno

iseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:S

ilberschatz,KorthandSudarshan42AssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitem

sthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyals

otobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.whenacustomerbuysaparticularbo

ok,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:cons

equentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.eachtransaction(sale)atashopi

saninstance,andthesetofalltransactionsisthepopulationCopyright:Silberschatz,KorthandSudarshan43AssociationRules(Cont.

)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfractionofthepopulationsatisfiesbothth

eantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupport

fortheruleismilkscrewdriversislow.WeusuallywantruleswithareasonablyhighsupportRuleswithlowsupportareusuallynotveryus

efulConfidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.therulebreadmilkhasaconfidenceof80percentif80

percentofthepurchasesthatincludebreadalsoincludemilk.Usuallywantruleswithreasonablylargeconfidence.Arulewithalowconfidenceisnotmeaningfu

l.Notethattheconfidenceofbreadmilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesamesupports.Copyright:Silbers

chatz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e

.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachsetfinditssupport(i.e.c

ounthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargei

temsetstogenerateassociationrules.1.FromitemsetAgeneratetheruleA-{b}bforeachbA.Supportofrule=support(A).Confidenceof

rule=support(A)/support(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFewitemsets:determinesupportofallitemsetsviaasinglepas

sonsetoftransactionsAcountismaintainedforeachitemset,initiallysetto0.Whenatransactionisfetched,thecountisincr

ementedforeachsetofitemsthatiscontainedinthetransaction.Largeitemsets:setswithahighcountattheendofthepassManyitemsets:Ifmemo

rynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemse

tiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pas

s1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssucht

hatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidatesCopyright:Silbersc

hatz,KorthandSudarshan46OtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromthee

xpectedprobabilityaremoreinterestingE.g.ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpec

tedtopurchaseboth(prob1*prob2)WeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigher

thanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.wheneverbondsgoup,sto

ckpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromasteadygrowthE.g.salesofwinterweargodowninsummerNotsurprising,partofaknow

npattern.LookfordeviationfromvaluepredictedusingpastpatternsCopyright:Silberschatz,KorthandSudarshan47Clustering

Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetri

csinseveralwaysE.g.Grouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:

pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpoints

inaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargeda

tasetsE.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48Hierarch

icalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanis

m)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)Agglomerativeclusteringa

lgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,

repeatedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudarshan49ClusteringAlgorithmsClusteringalgorithmshavebe

endesignedtohandleverylargedatasetsE.g.theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclustere

dInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifislessthansomedistanceawayIftherearemoreleafnodesthanfitin

memory,mergeexistingclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenu

mberofclustersCopyright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…apersonmaybeintere

stedin,onthebasisofPastpreferencesofthepersonOtherpeoplewithsimilarpastpreferencesThepreferencesofsuchpeopleforanewmovie/book/…Oneapp

roachbasedonrepeatedclusteringClusterpeopleonthebasisofpreferencesformoviesThenclustermoviesonthebasisofbeingliked

bythesameclustersofpeopleAgainclusterpeoplebasedontheirpreferencesfor(thenewlycreatedclustersof)moviesRepeatabovetillequi

libriumAboveproblemisaninstanceofcollaborativefiltering,whereuserscollaborateinthetaskoffilteringinformationtofindinformationofinterestCopyright:Sil

berschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdocumentsE.g.clusterWebpagestofindrelatedpagesE.g.cluste

rpagesauserhasvisitedtoorganizetheirvisithistoryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDat

avisualizationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuallyE.g.maps,charts,andcolor-codingE.g.locationswithprob

lemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansareverygoodadetectingvisualpatternsCopyright:Si

lberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLargeorganizationshavecomplexinternalorganiza

tions,andhavedatastoredatdifferentlocations,ondifferentoperational(transactionprocessing)systems,underdifferentschemasDatasource

softenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghi

storicaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderau

nifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtrans

actionprocessingsystemsCopyright:Silberschatz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,Kort

handSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestra

nsmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.atnight)Destinationdrivenarchitec

ture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingt

wo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessin

g(OLTP)systems.WhatschematouseSchemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)DatacleansingE.g.correct

mistakesinaddressesE.g.misspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicate

sKeeponlyoneaddressrecordperhousehold(“householding”)HowtopropagateupdatesWarehouseschemamaybea(materializ

ed)viewofschemafromdatasourcesEfficienttechniquesforupdateofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolarg

etostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryopti

mizertouseaggregatevaluesCopyright:Silberschatz,KorthandSudarshan57DataWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSche

masTypicallywarehousedataismultidimensional,withverylargefacttablesExamplesofdimensions:item-id,date/timeofsale,storewheresal

ewasmade,customeridentifierExamplesofmeasures:numberofitemssold,priceofitemsDimensionvaluesareusual

lyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowfl

akeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyright:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Sil

berschatz,KorthandSudarshan60InformationRetrievalSystemsInformationretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsIn

formationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformationretrievallocatesrelevantdocuments,

onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”C

anbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesIRonWebdocumentshasbecomeextremelyimportantE.g.google,altavist

a,…Copyright:Silberschatz,KorthandSudarshan61InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon‟tdealwith

transactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorga

nizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedansw

ersbyestimateddegreeofrelevanceCopyright:Silberschatz,KorthandSudarshan62KeywordSearchInfulltextretriev

al,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInfo

rmation-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,and

notAndsareimplicit,evenifnotexplicitlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscrit

icalRelevancerankingisbasedonfactorssuchasTermfrequencyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowma

nydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyr

ight:Silberschatz,KorthandSudarshan63RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequ

ency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.Thenreleva

nceofadocumentdtoatermtThelogfactoristoavoidexcessiveweightagetofrequenttermsAndrelevanceofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)n(t)

tQCopyright:Silberschatz,KorthandSudarshan64RelevanceRankingUsingTerms(Cont.)Mostsystemsaddtotheabovemo

delWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanc

eVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalledstopwordsProximity:ifkeywordsinqueryoccurclosetogeth

erinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfe

wdocumentsarereturned,notallCopyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyperlinksWhenusingkey

wordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Numberofdocumentsrelevanttoaquerycanbeenormousifonlyterm

frequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcanaddmanyoccurrencesofthewords“travelage

nt”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWeb

site(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolu

tion:nextslideCopyright:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureof

thepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpage

MosthyperlinksaretorootofsiteSite-popularitycomputationischeaperthanpagepopularitycomputationRefinements

Whencomputingprestigebasedonlinkstoasite,givemoreweightagetolinksfromsitesthatthemselveshavehigherpr

estigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankin

gmechanismCopyright:Silberschatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)Connectionstosocialnetworkingtheoriesthatrankedpres

tigeofpeopleE.g.thepresidentoftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthority

basedrankingAhubisapagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedonp

restigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefin

itionsarecyclic,andcanbegotbysolvinglinearequationsUseauthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSud

arshan68SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofc

ommonwordsE.g.findktermsinAwithhighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscarriesa

weightofr(d,t)SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrie

vedbykeywordquery,andsystemfindsotherdocumentssimilartotheseCopyright:Silberschatz,KorthandSudarshan69S

ynonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizethat“maintenance”a

nd“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“object”hasdifferentmeaningsasnoun

/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounderstand

intendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswellCop

yright:Silberschatz,KorthandSudarshan70IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontai

nthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnu

mberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:do

cumentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”c

analsobeefficientlyimplementedbymergingofsortedlistsCopyright:Silberschatz,KorthandSudarshan71MeasuringRetrievalEffectivenessIRsystemss

avespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotberetrie

ved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermi

tanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarerele

vanttothequery.Recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silberschatz,KorthandSudarshan72MeasuringRetrievalEf

fectiveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativeRecallvs.precisiontradeoff:Canincreaserecall

byretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresofretrievaleffecti

veness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEquivalently,asafunctionofn

umberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:drawagraphofprecisionvsrecall.Problem:whic

hdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorpusofdocumentsandqueries,withhumansdeciding

whichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConference)BenchmarkCopyright:Silberschatz,Kortha

ndSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindot

herdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHandedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawli

ngtheentireWebwouldtakeaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasin

glecrawlCopyright:Silberschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runni

nginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledl

aterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“compl

eted”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachi

nesforloadbalancingCopyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstogetherinalibraryfa

cilitatesbrowsinguserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthato

rganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silbersch

atz,KorthandSudarshan76AClassificationHierarchyForALibrarySystemCopyright:Silberschatz,KorthandSudars

han77ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallo

cationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,Kortha

ndSudarshan78AClassificationDAGForALibraryInformationRetrievalSystemCopyright:Silberschatz,KorthandSudarshan79We

bDirectoriesAWebdirectoryisjustaclassificationdirectoryonWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssu

es:Whatshouldthedirectoryhierarchybe?Givenadocument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClas

sificationofdocumentsintoahierarchymaybedonebasedontermsimilarity

小橙橙
小橙橙
文档分享,欢迎浏览!
  • 文档 25747
  • 被下载 7
  • 被收藏 0
相关资源
广告代码123
若发现您的权益受到侵害,请立即联系客服,我们会尽快为您处理。侵权客服QQ:395972555 (支持时间:9:00-21:00) 公众号
Powered by 太赞文库
×
确认删除?