数据库原理与应用课程组课件

PPT
  • 阅读 146 次
  • 下载 0 次
  • 页数 79 页
  • 大小 731.500 KB
  • 2022-12-05 上传
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
此文档由【小橙橙】提供上传,收益归文档提供者,本网站只提供存储服务。若此文档侵犯了您的版权,欢迎进行违规举报版权认领
数据库原理与应用课程组课件
可在后台配置第一页与第二页中间广告代码
数据库原理与应用课程组课件
可在后台配置第二页与第三页中间广告代码
数据库原理与应用课程组课件
可在后台配置第三页与第四页中间广告代码
数据库原理与应用课程组课件
数据库原理与应用课程组课件
还剩10页未读,继续阅读
【这是免费文档,您可以免费阅读】
/ 79
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
文本内容

【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传

转载请保留链接:https://www.ichengzhen.cn/view-92522.html

以下为本文档部分文字说明:

Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数

据库原理与应用课程组Copyright:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAnalysis

OLAPExtendedaggregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalSystemsInc

ludingWebsearchCopyright:Silberschatz,KorthandSudarshan3DecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisionsoft

enbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:whatitemstostock?Whatinsu

rancepremiumtochange?Whotosendadvertisementsto?ExamplesofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income

,age,sex,etc.)Copyright:Silberschatz,KorthandSudarshan4Decision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecialized

toolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdo

theycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:

S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefieldwillnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatist

icalrulesandpatternsfromLargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitun

deraunifiedschema,atasinglesite.Importantforlargebusinesseswhichgeneratedatafrommultipledivisions,p

ossiblyatmultiplesitesDatamayalsobepurchasedexternallyCopyright:Silberschatz,KorthandSudarshan5DataAna

lysisandOLAPAggregatefunctionssummarizelargevolumesofdataOnlineAnalyticalProcessing(OLAP)Interactiveanalysisofdata,allowingdatatobesummari

zedandviewedindifferentwaysinanonlinefashion(withnegligibledelay)Datathatcanbemodeledasdimensionattributesan

dmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordataanalysis,wecanidentifysomeofitsattri

butesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthesalesrelationisame

asureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensio

nattributes,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarsh

an6CrossTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferredtoasapivot-table.Acro

ss-tabisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattributeformthecolumnheadersOtherdimensio

nattributesarelistedontopValuesinindividualcellsare(aggregatesof)thevaluesofthedimensionattributesthatspecifythecell.C

opyright:Silberschatz,KorthandSudarshan7RelationalRepresentationofCrosstabsCrosstabscanberepresentedasrelationsTheva

lueallisusedtorepresentaggregatesTheSQL:1999standardactuallyusesnullvaluesinplaceofallMoreonthislater….Copyright:Sil

berschatz,KorthandSudarshan8Three-DimensionalDataCubeAdatacubeisamultidimensionalgeneralizationofacro

sstabCannotviewathree-dimensionalobjectinitsentiretybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyti

calProcessingTheoperationofchangingthedimensionsusedinacross-tabiscalledpivotingSupposeananalystwish

estoseeacross-tabonitem-nameandcolorforafixedvalueofsize,forexample,large,insteadofthesumacrossallsizes.Sucha

noperationisreferredtoasslicing.Theoperationissometimescalleddicing,particularlywhenvaluesformultipledimensionsarefixed.Theoperatio

nofmovingfromfiner-granularitydatatoacoarsergranularityiscalledarollup.Theoppositeoperation-thatofmovingfromcoarse

r-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silberschatz,KorthandSudarshan10Hiera

rchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.thedimensionD

ateTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschatz,KorthandSudarshan11Cross

TabulationWithHierarchyCrosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorrolluponahierarchyCopyright:Silberschatz,Korth

andSudarshan12OLAPImplementationTheearliestOLAPsystemsusedmultidimensionalarraysinmemorytostoredatacubes,andarereferredtoasmu

ltidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalledrela

tionalOLAP(ROLAP)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesinarelat

ionaldatabase,arecalledhybridOLAP(HOLAP)systems.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPsystemsprecomputedal

lpossibleaggregatesinordertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgroupbyItsufficestoprecomputesome

aggregates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(i

tem-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingitfr

omscratchSeveraloptimizationsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,

size)Cancomputeaggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthebasedataCopyright:Silberschatz,KorthandS

udarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeitherveryhardorimpossibletospecifyDatacube

Complexaggregates(median,variance)binaryaggregates(correlation,regressioncurves)rankingqueries(“a

ssigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyofaggregationfunctionstoaddressabov

elimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,KorthandSudarshan15Extend

edAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthequeryselectitem-name,c

olor,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdifferentgroupi

ngsofthesalesrelation:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name),(color),(size),()}where()denotesanemptygrou

pbylist.Foreachgrouping,theresultcontainsthenullvalueforattributesnotpresentinthegrouping.Copyright:Silb

erschatz,KorthandSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearlier,butwit

hnullinplaceofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,col

or)Thefunctiongrouping()canbeappliedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallothercases.selectitem-name,c

olor,size,sum(number),grouping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grouping(size)assize-flag,f

romsalesgroupbycube(item-name,color,size)Canusethefunctiondecode()intheselectclausetoreplacesuchnullsbyavaluesuchasallE.g.replaceitem-namei

nfirstquerybydecode(grouping(item-name),1,„all‟,item-name)Copyright:Silberschatz,KorthandSudarshan17Extende

dAggregation(Cont.)TherollupconstructgeneratesuniononeveryprefixofspecifiedlistofattributesE.g.selectitem-name,color,size,sum(number)fromsalesg

roupbyrollup(item-name,color,size)Generatesunionoffourgroupings:{(item-name,color,size),(item-name,color),(item-name),()}Rollup

canbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,supposetableitemcategory(item-name,category)givesthecateg

oryofeachitem.Thenselectcategory,item-name,sum(number)fromsales,itemcategorywheresales.item-name=itemcategory.item-namegroupbyro

llup(category,item-name)wouldgiveahierarchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSudarshan18ExtendedAg

gregation(Cont.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupbylists,crossproductofse

tsgivesoverallsetofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,size)generates

thegroupings{item-name,()}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,size),(color),()}Copyright:Silb

erschatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-marks(

student-id,marks)findtherankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfroms

tudent-marksAnextraorderbyclauseisneededtogettheminsortedorderselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstu

dent-marksorderbys-rankRankingmayleavegaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_ran

kdoesnotleavegaps,sonextdenserankwouldbe2Copyright:Silberschatz,KorthandSudarshan20Ranking(Cont.)Ra

nkingcanbedonewithinpartitionofthedata.“Findtherankofstudentswithineachsection.”selectstudent-id,section,rank()over(partitionbysection

orderbymarksdesc)assec-rankfromstudent-marks,student-sectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultip

lerankclausescanoccurinasingleselectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:Findstudentswithtop

nranksManysystemsprovidespecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelationstudent-course-

marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_rank(withinpartition,ifpa

rtitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprecedingvaluesrow_number(non-determ

inisticinpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(orderbymarksdescnullslast)ass-rank

fromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Foragivenconstantn,therankingthefunctionntile(n)takesthetuplesin

eachpartitioninthespecifiedorder,anddividesthemintonbucketswithqualnumbersoftuples.Forinstance,weansortempl

oyeesbysalary,andusentile(3)tofindwhichrange(bottomthird,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesi

neachrange:selectthreetile,sum(salary)from(selectsalary,ntile(3)over(orderbysalary)asthreetilefromemployee)assgroupb

ythreetileCopyright:Silberschatz,KorthandSudarshan23WindowingE.g.:“Givensalesvaluesforeachdate,calculateforeachdatetheaverageofthesalesonthatda

y,thepreviousday,andthenextday”Suchmovingaveragequeriesareusedtosmoothoutrandomvariations.Incontras

ttogroupby,thesametuplecanexistinmultiplewindowsWindowspecificationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.givenrela

tionsales(date,value)selectdate,sum(value)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofo

therwindowspecifications:betweenrowsunboundedprecedingandcurrentrowsunboundedprecedingrangebetween10preceding

andcurrentrowAllrowswithvaluesbetweencurrentrowvalue–10tocurrentvaluerangeinterval10dayprecedingNotincludingcurr

entrowCopyright:Silberschatz,KorthandSudarshan24Windowing(Cont.)CandowindowingwithinpartitionsE.g.Givenarelationtransaction(account-number,

date-time,value),wherevalueispositiveforadepositandnegativeforawithdrawal“Findtotalbalanceofeachacco

untaftereachtransactionontheaccount”selectaccount-number,date-time,sum(value)over(partitionbyaccount-numberorderbydate-timerowsunbo

undedpreceding)asbalancefromtransactionorderbyaccount-number,date-timeCopyright:Silberschatz,KorthandSudarshan25Dat

aMiningCopyright:Silberschatz,KorthandSudarshan26DataMiningBroadlyspeaking,dataminingistheprocessofsemi-

automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificialintelligenceda

taminingdiscoversstatisticalrulesandpatternsDiffersfrommachinelearninginthatitdealswithlargevolumesofdatastoredprimarilyondisk.Somety

pesofknowledgediscoveredfromadatabasecanberepresentedbyasetofrules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikel

ytobuysportscars”Othertypesofknowledgerepresentedbyequations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-process

ingofdata,choiceofwhichtypeofpatterntofind,postprocessingtofindnovelpatternsCopyright:Silberschatz,KorthandSudarshan27Appli

cationsofDataMiningPredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(inc

ome,jobtype,age,..)andpasthistoryPredictifacustomerislikelytoswitchbrandloyaltyPredictifacustomerislikelytor

espondto“junkmail”PredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenatr

ainingsetconsistingofitemsbelongingtodifferentclasses,andanewitemwhoseclassisunknown,predictwhichclassitbelongstoRegressionformulaegivenas

etofparameter-valuetofunction-resultmappingsforanunknownfunction,predictthefunction-resultforanewparameter

-valueCopyright:Silberschatz,KorthandSudarshan28ApplicationsofDataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbyth

esamecustomers.Ifanewcustomerbuysonesuchbook,suggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.

AssociationsmayalsobeusedasafirststepindetectingcausationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacprob

lemsClustersE.g.typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsim

portantindetectingepidemicsCopyright:Silberschatz,KorthandSudarshan29ClassificationRulesClassificationruleshelpassignnewobjectstoasetofclasses.E.g.,

givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificat

ionrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationallevelofapplicant,salaryofapplicant,age

ofapplicant,etc.personP,P.degree=mastersandP.income>75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000and

P.income75,000)P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassific

ationrulescanbecompactlyshownasadecisiontree.Copyright:Silberschatz,KorthandSudarshan30DecisionTreeCopyright:Silberschatz,K

orthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasampleinwhichthegroupingforeachtupleisalreadyknown.C

onsidercreditriskexample:Supposedegreeischosentopartitionthedataattheroot.Sincedegreehasasmallnumberofpossiblevalues,onechildiscreatedforeachvalue

.Ateachchildnodeoftheroot,furtherclassificationisdoneifrequired.Here,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,so

menumberofintervalsarechosen,andonechildcreatedforeachinterval.Differentclassificationalgorithmsusedifferentwaysofchoosi

ngwhichattributetopartitiononateachnode,andwhattheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differen

tnodesatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32Constructi

onofDecisionTrees(Cont.)Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedat

aintogroupsbasedonapartitioningattribute,andapartitioningconditionforthenodeMoreonchoosingpartioningattribute/condit

ionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmoreofthetreeisconstructedThedataatanodeisnotpartitionedfurtherifei

therall(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.Suchanodeisalea

fnode.Otherwisethedataatthenodeispartitionedfurtherbypickinganattributeforpartitioningdataatthenode.Copyright:Silberschatz

,KorthandSudarshan33BestSplitsIdea:evaluatedifferentattributesandpartitioningconditionsandpictheonethatbestimprovesthe“purity”ofthetrainingsetexam

plesTheinitialtrainingsethasamixtureofinstancesfromdifferentclassesandisthusrelativelyimpureE.g.ifdegreeexactlypredictscreditrisk,partitioningo

ndegreewouldresultineachildhavinginstancesofonlyoneclassI.e.,thechildnodeswouldbepureThepurityofasetSoftraininginstanc

escanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofins

tancesinclassi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0,whileitreachesitsma

ximum(of1–1/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,KorthandSudarshan34BestSplits(Con

t.)Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityof

theresultantsetofsetsas:purity(S1,S2,…..,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….

,Sr)=purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2piCopyright:Silberschatz,KorthandSudarshan35BestSplits(Cont.)Measureof“cost”ofa

split:Information-content(S,{S1,S2,…..,Sr}))=–Information-gainratio=Information-gain(S,{S1,S2,……,Sr})Information-content(S,

{S1,S2,…..,Sr})ThebestsplitforanattributeistheonethatgivesthemaximuminformationgainratioContinuousvaluedattributesCanbeorderedinafashionmeani

ngfultoclassificatione.g.integerandrealvaluesCategoricalattributesCannotbemeaningfullyordered(e.g.country,school/university,item-col

or,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,KorthandSudarshan36FindingBestSplitsCategoricalattributes:Mul

ti-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecasesBinarysplit:tryallpossiblebreakup

ofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheinstances,tryeacha

sasplitpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,s

eebibliographicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberscha

tz,KorthandSudarshan37Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenretu

rn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPartition(Si);Copyri

ght:Silberschatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)VarietyofalgorithmshavebeendevelopedtoReduceCPUcos

tand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybeoverfitted,i.e.,overlytuned

togiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofewtraininginstancesWhenasubtreeispruned,aninternalnod

ebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePruningcanbedonebyusingapartofthetrainingse

ttobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclassificationonsecondpartCopyr

ight:Silberschatz,KorthandSudarshan39OtherTypesofClassifiersFurthertypesofclassifiersNeuralnetclassifiers

BayesianclassifiersNeuralnetclassifiersusethetrainingdatatotrainartificialneuralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseB

ayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabi

lityofgeneratinginstancedgivenclasscj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:S

ilberschatz,KorthandSudarshan40NaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj

)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassif

iersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahis

togramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandsto

reCopyright:Silberschatz,KorthandSudarshan41RegressionRegressiondealswiththepredictionofavalue,rathert

hanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,…,

ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindinga

curvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherel

ationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:Silberschatz,KorthandSudarshan42Associa

tionRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.Someonewhobuysbreadisquitelikelyalso

tobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformat

ioncanbeusedinseveralways.E.g.whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmi

lkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thep

opulationconsistsofasetofinstancesE.g.eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulat

ionCopyright:Silberschatz,KorthandSudarshan43AssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureof

whatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincludemilk

andscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.Weusuallywantruleswithareasonablyhighsupport

RuleswithlowsupportareusuallynotveryusefulConfidenceisameasureofhowoftentheconsequentistruewhentheant

ecedentistrue.E.g.therulebreadmilkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadal

soincludemilk.Usuallywantruleswithreasonablylargeconfidence.Arulewithalowconfidenceisnotmeaningful.Notethattheconfidenceofbre

admilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesamesupports.Copyright:Silberscha

tz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighs

upport(e.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachsetfinditssupport(i.e.counthowmanytransac

tionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargeitemsetstogenerateassociationrules.1.FromitemsetAgeneratether

uleA-{b}bforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFe

witemsets:determinesupportofallitemsetsviaasinglepassonsetoftransactionsAcountismaintainedforeachitemset,initia

llysetto0.Whenatransactionisfetched,thecountisincrementedforeachsetofitemsthatiscontainedinthetransaction.Largeitemsets:sets

withahighcountattheendofthepassManyitemsets:Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsin

eachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.The

aprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everys

etofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidatesCopyright:Silbersc

hatz,KorthandSudarshan46OtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilitya

remoreinterestingE.g.ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchaseboth(prob1*prob2)

WeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredict

edNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.wheneverbond

sgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromasteadygrowthE.g.salesofwinterweargodowninsummerNotsurprisin

g,partofaknownpattern.LookfordeviationfromvaluepredictedusingpastpatternsCopyright:Silberschatz,KorthandSudarshan47ClusteringClustering:Intu

itively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdista

ncemetricsinseveralwaysE.g.Grouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupismini

mizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpoin

tsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniqu

esthatcanhandleverylargedatasetsE.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48Hie

rarchicalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredict

ionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)Ag

glomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithm

sStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudars

han49ClusteringAlgorithmsClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.theBircha

lgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifi

slessthansomedistanceawayIftherearemoreleafnodesthanfitinmemory,mergeexistingclustersthatareclosetoeachotherAtth

eendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumberofclustersCop

yright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…apersonmaybeinterestedin,onthebasisofPastprefer

encesofthepersonOtherpeoplewithsimilarpastpreferencesThepreferencesofsuchpeopleforanewmovie/book/…OneapproachbasedonrepeatedclusteringClust

erpeopleonthebasisofpreferencesformoviesThenclustermoviesonthebasisofbeinglikedbythesameclustersofpeopleAgainclusterpeopleba

sedontheirpreferencesfor(thenewlycreatedclustersof)moviesRepeatabovetillequilibriumAboveproblemisaninstanceof

collaborativefiltering,whereuserscollaborateinthetaskoffilteringinformationtofindinformationofintere

stCopyright:Silberschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdocumentsE.g.clusterWebpagestofindrelated

pagesE.g.clusterpagesauserhasvisitedtoorganizetheirvisithistoryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDatavisualiz

ationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuallyE.g.maps,charts,andcolor-cod

ingE.g.locationswithproblemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansareverygo

odadetectingvisualpatternsCopyright:Silberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLarg

eorganizationshavecomplexinternalorganizations,andhavedatastoredatdifferentlocations,ondifferentoperational(transactionprocessing)systems,underd

ifferentschemasDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,i

ncludinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlys

implifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystemsCopyright:Silberscha

tz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,KorthandSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchite

cture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.atnight)Destinationdrivenarchitecture:wa

rehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingtwo-phasecommit)istooexpe

nsiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.Wh

atschematouseSchemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)Dat

acleansingE.g.correctmistakesinaddressesE.g.misspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicatesKeepon

lyoneaddressrecordperhousehold(“householding”)HowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasources

EfficienttechniquesforupdateofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)ofte

nsufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevaluesCopyright:Silberschatz,KorthandSudarshan57Dat

aWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSchemasTypicallywarehousedataismultidimensional,withverylargefacttablesExample

sofdimensions:item-id,date/timeofsale,storewheresalewasmade,customeridentifierExamplesofmeasures:number

ofitemssold,priceofitemsDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaisca

lledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyrigh

t:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Silberschatz,KorthandSudarshan60InformationRetrievalSystemsInformationretrieva

l(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformatio

nretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”C

anbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesIRonWebdocumentshasbecomeextremel

yimportantE.g.google,altavista,…Copyright:Silberschatz,KorthandSudarshan61InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsys

temsdon‟tdealwithtransactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstruc

tureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatese

archingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevanceCopyright:Silberschatz,KorthandSudarshan62Keywo

rdSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInforma

tion-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplici

tlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequen

cyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccur

sinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyright:Silberscha

tz,KorthandSudarshan63RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Le

tn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.ThenrelevanceofadocumentdtoatermtThelo

gfactoristoavoidexcessiveweightagetofrequenttermsAndrelevanceofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)

n(t)tQCopyright:Silberschatz,KorthandSudarshan64RelevanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,aut

horlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimpo

rtanceVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalledstopwordsProximity:ifkeywords

inqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescor

eUsuallyonlytopfewdocumentsarereturned,notallCopyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyper

linksWhenusingkeywordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Numberofdocumentsrelevanttoaquerycanbe

enormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcanaddmanyoccurrencesofthew

ords“travelagent”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularity

ofWebsite(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslideCopyright

:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureofthepopularityorpresti

geofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpageMosthyperlinksaretorootofsiteSite-popula

ritycomputationischeaperthanpagepopularitycomputationRefinementsWhencomputingprestigebasedonlinkstoasite,givemoreweightagetolinksfromsitesthatth

emselveshavehigherprestigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechan

ismCopyright:Silberschatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)Connectionstosocialnetworkingth

eoriesthatrankedprestigeofpeopleE.g.thepresidentoftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultiplepresti

giouspeoplehashighprestigeHubandauthoritybasedrankingAhubisapagethatstoreslinkstomanypages(onatopic)Anauthorityisapagethatcontainsactualinfor

mationonatopicEachpagegetsahubprestigebasedonprestigeofauthoritiesthatitpointstoEachpagegetsanauth

orityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,andcanbegotbysolvinglinearequationsUsea

uthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSudarshan68SimilarityBasedRetrievalSimilaritybasedretri

eval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofcommonwordsE.g.findktermsinAwithh

ighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscarriesaweightofr(d,t)Similaritycanbeusedtor

efineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedbykeywordquery,andsystemfindsotherdocumentssimilartothes

eCopyright:Silberschatz,KorthandSudarshan69SynonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizeth

at“maintenance”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“object”hasdifferentmeaningsasnoun/v

erbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounder

standintendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswellCopyright:Silberschatz,KorthandSudarsha

n70IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontainthekeywordDocumentsident

ifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomput

eTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K

2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsort

edlistsCopyright:Silberschatz,KorthandSudarshan71MeasuringRetrievalEffectivenessIRsystemssavespacebyusingindex

structuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotber

etrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermitanyfalsedrops,

butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarerelevanttothequery.Recall-what

percentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silberschatz,KorthandSudarshan72MeasuringRetrievalEffect

iveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativeRecallvs.precisiontradeoff:Canincrea

serecallbyretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresof

retrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEqui

valently,asafunctionofnumberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:drawagraphofp

recisionvsrecall.Problem:whichdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorpus

ofdocumentsandqueries,withhumansdecidingwhichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConfe

rence)BenchmarkCopyright:Silberschatz,KorthandSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursi

velyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHan

dedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldta

keaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawlCopyright:Silb

erschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runninginparallel

SetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledlaterIndexingprocessalsorunsonmultiplemachinesCreatesa

newcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“completed”newindexbec

omes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachinesforloadbalancingCo

pyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstogetherinalibraryfacilitatesbrowsinguserscanseeno

tonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthatorganizeslogicallyrelateddo

cumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silberschatz,KorthandSudarshan76AClassificationHierarchyForALibra

rySystemCopyright:Silberschatz,KorthandSudarshan77ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrieval

system,sincephysicallocationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,KorthandS

udarshan78AClassificationDAGForALibraryInformationRetrievalSystemCopyright:Silberschatz,KorthandSudarshan79WebDirectoriesAWebdirectoryisjustacla

ssificationdirectoryonWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenado

cument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedontermsimi

larity

小橙橙
小橙橙
文档分享,欢迎浏览!
  • 文档 6209
  • 被下载 8
  • 被收藏 0
相关资源
广告代码123
若发现您的权益受到侵害,请立即联系客服,我们会尽快为您处理。侵权客服QQ:395972555 (支持时间:9:00-21:00) 公众号
Powered by 太赞文库
×
确认删除?