数据库原理与应用课程组课件

PPT
  • 阅读 130 次
  • 下载 0 次
  • 页数 79 页
  • 大小 731.500 KB
  • 2022-12-05 上传
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
此文档由【小橙橙】提供上传,收益归文档提供者,本网站只提供存储服务。若此文档侵犯了您的版权,欢迎进行违规举报版权认领
数据库原理与应用课程组课件
可在后台配置第一页与第二页中间广告代码
数据库原理与应用课程组课件
可在后台配置第二页与第三页中间广告代码
数据库原理与应用课程组课件
可在后台配置第三页与第四页中间广告代码
数据库原理与应用课程组课件
数据库原理与应用课程组课件
还剩10页未读,继续阅读
【这是免费文档,您可以免费阅读】
/ 79
  • 收藏
  • 违规举报
  • © 版权认领
下载文档30.00 元 加入VIP免费下载
文本内容

【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传

转载请保留链接:https://www.ichengzhen.cn/view-92522.html

以下为本文档部分文字说明:

Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数据库原理与应用课程组Copyrig

ht:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAna

lysisOLAPExtendedaggregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalS

ystemsIncludingWebsearchCopyright:Silberschatz,KorthandSudarshan3DecisionSupportSystemsDecision-supportsy

stemsareusedtomakebusinessdecisionsoftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusi

nessdecisions:whatitemstostock?Whatinsurancepremiumtochange?Whotosendadvertisementsto?Examplesofdatausedformak

ingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,sex,etc.)Copyright:Silberschatz,KorthandSudarshan4Decision-SupportSy

stems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeach

region,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategorya

ndeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefieldw

illnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromLargedatabases.Adatawarehousearchivesin

formationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusi

nesseswhichgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternallyCopyrig

ht:Silberschatz,KorthandSudarshan5DataAnalysisandOLAPAggregatefunctionssummarizelargevolumesofdataOnlineAnalyticalProc

essing(OLAP)Interactiveanalysisofdata,allowingdatatobesummarizedandviewedindifferentwaysinanonlinefashion(withnegligibledelay)

Datathatcanbemodeledasdimensionattributesandmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordataanalysis,wecanident

ifysomeofitsattributesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthe

salesrelationisameasureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensionattr

ibutes,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarshan6Cro

ssTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferre

dtoasapivot-table.Across-tabisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattribu

teformthecolumnheadersOtherdimensionattributesarelistedontopValuesinindividualcellsare(aggregatesof)t

hevaluesofthedimensionattributesthatspecifythecell.Copyright:Silberschatz,KorthandSudarshan7Relationa

lRepresentationofCrosstabsCrosstabscanberepresentedasrelationsThevalueallisusedtorepresentaggregatesTheSQL:1999standardactuallyuse

snullvaluesinplaceofallMoreonthislater….Copyright:Silberschatz,KorthandSudarshan8Three-DimensionalD

ataCubeAdatacubeisamultidimensionalgeneralizationofacrosstabCannotviewathree-dimensionalobjectinit

sentiretybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyticalProcessingTheoperationofchangingthedime

nsionsusedinacross-tabiscalledpivotingSupposeananalystwishestoseeacross-tabonitem-nameandcolorforafixedvalueofsize,forexample,large,insteadof

thesumacrossallsizes.Suchanoperationisreferredtoasslicing.Theoperationissometimescalleddicing,particularlywhenva

luesformultipledimensionsarefixed.Theoperationofmovingfromfiner-granularitydatatoacoarsergranularityiscalledarollup.Theopposite

operation-thatofmovingfromcoarser-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silberschatz,KorthandS

udarshan10HierarchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.thedimensionDat

eTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschatz,KorthandSudarshan11CrossTabulationWithHierarchyC

rosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorrolluponahierarchyCopyright:Silbersch

atz,KorthandSudarshan12OLAPImplementationTheearliestOLAPsystemsusedmultidimensionalarraysinmemorytostoredatacubes,

andarereferredtoasmultidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalledrelationalOLAP(ROLAP

)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesinarelationaldatabase,arecalledhybridOLAP(HOLAP)system

s.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPsystemsprecomputedallpossibleaggregatesino

rdertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgroupbyItsufficestoprecomputesomeaggre

gates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon

(item-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingi

tfromscratchSeveraloptimizationsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,si

ze)Cancomputeaggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthe

basedataCopyright:Silberschatz,KorthandSudarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeith

erveryhardorimpossibletospecifyDatacubeComplexaggregates(median,variance)binaryaggregates(correlation,regr

essioncurves)rankingqueries(“assigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyof

aggregationfunctionstoaddressabovelimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,Kortha

ndSudarshan15ExtendedAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthe

queryselectitem-name,color,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdifferentgroupingsofthesalesrela

tion:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name),(color),(size),()}where()denotesanemptygrou

pbylist.Foreachgrouping,theresultcontainsthenullvalueforattributesnotpresentinthegrouping.Copyright:Silberschatz,Kortha

ndSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearlier,butwithnullinplace

ofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,color)Thefunctiongrouping()canbeapp

liedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallothercases.selectitem-name,color,size,sum(number),grou

ping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grouping(size)assize-flag,fromsalesgroupbycube(item-name

,color,size)Canusethefunctiondecode()intheselectclausetoreplacesuchnullsbyavaluesuchasallE.g.replaceitem-

nameinfirstquerybydecode(grouping(item-name),1,„all‟,item-name)Copyright:Silberschatz,KorthandSudarshan17ExtendedAggr

egation(Cont.)TherollupconstructgeneratesuniononeveryprefixofspecifiedlistofattributesE.g.selectitem-name,color,size,sum(nu

mber)fromsalesgroupbyrollup(item-name,color,size)Generatesunionoffourgroupings:{(item-name,color,size),(item-name,c

olor),(item-name),()}Rollupcanbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,supposetableitemca

tegory(item-name,category)givesthecategoryofeachitem.Thenselectcategory,item-name,sum(number)fromsales,itemcategorywheresales.item-name=itemcategor

y.item-namegroupbyrollup(category,item-name)wouldgiveahierarchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSuda

rshan18ExtendedAggregation(Cont.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupb

ylists,crossproductofsetsgivesoverallsetofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,s

ize)generatesthegroupings{item-name,()}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,si

ze),(color),()}Copyright:Silberschatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-ma

rks(student-id,marks)findtherankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksAnextraorderby

clauseisneededtogettheminsortedorderselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksorderbys-rankRan

kingmayleavegaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_rankdoesnotleav

egaps,sonextdenserankwouldbe2Copyright:Silberschatz,KorthandSudarshan20Ranking(Cont.)Rankingcanbedonewi

thinpartitionofthedata.“Findtherankofstudentswithineachsection.”selectstudent-id,section,rank()over(partitionbysectionorderbymarksdesc)assec-r

ankfromstudent-marks,student-sectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultiplerankclausescanocc

urinasingleselectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:FindstudentswithtopnranksManysystemsprovidesp

ecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelatio

nstudent-course-marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_

rank(withinpartition,ifpartitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprec

edingvaluesrow_number(non-deterministicinpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(o

rderbymarksdescnullslast)ass-rankfromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Forag

ivenconstantn,therankingthefunctionntile(n)takesthetuplesineachpartitioninthespecifiedorder,anddividesthem

intonbucketswithqualnumbersoftuples.Forinstance,weansortemployeesbysalary,andusentile(3)tofindwhichrange(bottomthir

d,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesineachrange:selectthreetile,sum(salary)from(selectsalary,ntile(3)o

ver(orderbysalary)asthreetilefromemployee)assgroupbythreetileCopyright:Silberschatz,KorthandSudarshan23Windo

wingE.g.:“Givensalesvaluesforeachdate,calculateforeachdatetheaverageofthesalesonthatday,thepreviousday,andthenextday”Suchmovingaverageq

ueriesareusedtosmoothoutrandomvariations.Incontrasttogroupby,thesametuplecanexistinmultiplewindowsWindowspecific

ationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.givenrelationsales(date,value)selectdate,sum(val

ue)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofotherwindowspecifications:betweenrowsunboundedprecedingand

currentrowsunboundedprecedingrangebetween10precedingandcurrentrowAllrowswithvaluesbetweencurrentrowvalue–10tocurrentvalu

erangeinterval10dayprecedingNotincludingcurrentrowCopyright:Silberschatz,KorthandSudarshan24Windowing(Cont.)Candowindowingwithinpartitions

E.g.Givenarelationtransaction(account-number,date-time,value),wherevalueispositiveforadepositandnegativeforawithdrawal“Find

totalbalanceofeachaccountaftereachtransactionontheaccount”selectaccount-number,date-time,sum(value)over(partitionbyaccount-numberorderbyda

te-timerowsunboundedpreceding)asbalancefromtransactionorderbyaccount-number,date-timeCopyright:Silberschatz,KorthandSudarshan25Da

taMiningCopyright:Silberschatz,KorthandSudarshan26DataMiningBroadlyspeaking,dataminingistheprocessofsemi-

automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificialintelligencedataminingdiscoversstatisticalrulesand

patternsDiffersfrommachinelearninginthatitdealswithlargevolumesofdatastoredprimarilyondisk.Sometypesofknowledgediscoveredfromadatab

asecanberepresentedbyasetofrules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikelytobuysportscars”Othertypesofknowledgerepresen

tedbyequations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-processingofdata,choiceofwhichtypeofpatternt

ofind,postprocessingtofindnovelpatternsCopyright:Silberschatz,KorthandSudarshan27ApplicationsofDataMiningPredictionbasedonpasthistor

yPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasth

istoryPredictifacustomerislikelytoswitchbrandloyaltyPredictifacustomerislikelytorespondto“junkmail”Predictifapatternofphonecallingcardus

ageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenatrainingsetconsistingofitemsbelongingt

odifferentclasses,andanewitemwhoseclassisunknown,predictwhichclassitbelongstoRegressionformulaegivenasetofparameter-va

luetofunction-resultmappingsforanunknownfunction,predictthefunction-resultforanewparameter-valueCopyright:Silberschat

z,KorthandSudarshan28ApplicationsofDataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbythesamecustomers.Ifanewcust

omerbuysonesuchbook,suggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.Associations

mayalsobeusedasafirststepindetectingcausationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacproblemsClustersE.

g.typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemicsCopyright:Silb

erschatz,KorthandSudarshan29ClassificationRulesClassificationruleshelpassignnewobjectstoasetofclasses.

E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?C

lassificationrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationallevelofapplicant,salaryofapplicant,ageofapplicant,etc.pers

onP,P.degree=mastersandP.income>75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=

goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbecompactlyshownasadecisiontree.Copyright:Silber

schatz,KorthandSudarshan30DecisionTreeCopyright:Silberschatz,KorthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasamplei

nwhichthegroupingforeachtupleisalreadyknown.Considercreditriskexample:Supposedegreeischosentopartitionthedataatth

eroot.Sincedegreehasasmallnumberofpossiblevalues,onechildiscreatedforeachvalue.Ateachchildnodeofther

oot,furtherclassificationisdoneifrequired.Here,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,somenumberof

intervalsarechosen,andonechildcreatedforeachinterval.Differentclassificationalgorithmsusedifferentwaysofchoosingwhichattributetopartiti

ononateachnode,andwhattheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differentno

desatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32ConstructionofDecisi

onTrees(Cont.)Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,an

dapartitioningconditionforthenodeMoreonchoosingpartioningattribute/conditionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmor

eofthetreeisconstructedThedataatanodeisnotpartitionedfurtherifeitherall(ormost)oftheitemsatthenodebelongtothesameclass,or

allattributeshavebeenconsidered,andnofurtherpartitioningispossible.Suchanodeisaleafnode.Otherwisethedataatt

henodeispartitionedfurtherbypickinganattributeforpartitioningdataatthenode.Copyright:Silberschatz,KorthandSudarshan33BestSplitsIdea:evalu

atedifferentattributesandpartitioningconditionsandpictheonethatbestimprovesthe“purity”ofthetrainingsetexamp

lesTheinitialtrainingsethasamixtureofinstancesfromdifferentclassesandisthusrelativelyimpureE.g.ifde

greeexactlypredictscreditrisk,partitioningondegreewouldresultineachildhavinginstancesofonlyoneclassI.e.,thechildnodeswouldbepureThepurit

yofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesin

classi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0,whileitreachesitsmaximum(of1–1

/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,KorthandSudarshan34BestSplits(Cont.)Anothermeasureofpurityi

stheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityofthere

sultantsetofsetsas:purity(S1,S2,…..,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….,Sr)=

purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2piCopyright:Silberschatz,KorthandSudarshan35BestSpli

ts(Cont.)Measureof“cost”ofasplit:Information-content(S,{S1,S2,…..,Sr}))=–Information-gainratio=Information-gain(S,{S1,

S2,……,Sr})Information-content(S,{S1,S2,…..,Sr})ThebestsplitforanattributeistheonethatgivesthemaximuminformationgainratioContinuousvalu

edattributesCanbeorderedinafashionmeaningfultoclassificatione.g.integerandrealvaluesCategoricalattrib

utesCannotbemeaningfullyordered(e.g.country,school/university,item-color,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,Kortha

ndSudarshan36FindingBestSplitsCategoricalattributes:Multi-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecas

esBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheins

tances,tryeachasasplitpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,seebibliog

raphicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberschatz,KorthandSudarshan3

7Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenreturn;foreachattribut

eAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPartition(Si);Copyright:Silber

schatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)Varietyofalgorithmshavebeendevelopedto

ReduceCPUcostand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybe

overfitted,i.e.,overlytunedtogiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofew

traininginstancesWhenasubtreeispruned,aninternalnodebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePrunin

gcanbedonebyusingapartofthetrainingsettobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclass

ificationonsecondpartCopyright:Silberschatz,KorthandSudarshan39OtherTypesofClassifiersFurthertypesofclassifie

rsNeuralnetclassifiersBayesianclassifiersNeuralnetclassifiersusethetrainingdatatotrainartificialneu

ralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseBayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)w

herep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgivenclassc

j,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:Silberschatz,KorthandSudarshan40NaïveBayesianClassifiersBay

esianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifyt

hetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d

2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramson

multipleattributesaremoreexpensivetocomputeandstoreCopyright:Silberschatz,KorthandSudarshan41RegressionRegressiondealswit

hthepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariab

leY.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingener

al,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipis

notexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:Silberschatz

,KorthandSudarshan42AssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.S

omeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuy

thebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.whenacustomerbuysaparticularbook,anonlineshopmaysuggestas

sociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassoci

ationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.eachtransaction(s

ale)atashopisaninstance,andthesetofalltransactionsisthepopulationCopyright:Silberschatz,KorthandSuda

rshan43AssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfract

ionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincl

udemilkandscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.WeusuallywantruleswithareasonablyhighsupportRuleswithlowsupportareusuallynotv

eryusefulConfidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.therulebreadmilkhasa

confidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.Usuallywantruleswithreasonablylargeconfidence.A

rulewithalowconfidenceisnotmeaningful.Notethattheconfidenceofbreadmilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesam

esupports.Copyright:Silberschatz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasona

blyhighsupport(e.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachse

tfinditssupport(i.e.counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargeitemsetstoge

nerateassociationrules.1.FromitemsetAgeneratetheruleA-{b}bforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/supp

ort(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFewitemsets:determinesupportofallitemsetsviaasinglepassonsetoftransactionsAco

untismaintainedforeachitemset,initiallysetto0.Whenatransactionisfetched,thecountisincrementedforeachsetofitemsthatiscontainedinthet

ransaction.Largeitemsets:setswithahighcountattheendofthepassManyitemsets:Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,con

sideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneed

stobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.Eliminatetho

seitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportof

allcandidatesStopiftherearenocandidatesCopyright:Silberschatz,KorthandSudarshan46OtherTypesofAssociationsB

asicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.ifmanypeoplepurchasebread,a

ndmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchaseboth(prob1*prob2)Weareinterestedinpositiveaswellasnegativecorrelationsbetweensets

ofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequ

enceassociations/correlationsE.g.wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromastea

dygrowthE.g.salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpattern

sCopyright:Silberschatz,KorthandSudarshan47ClusteringClustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpo

intslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysE.g.Grouppointsintoksets(fo

ragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageof

coordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHa

sbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasets

E.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48HierarchicalClusteringExamplefrombiologicalclassi

fication(thewordclassificationheredoesnotmeanapredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodile

sOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclus

ters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repe

atedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudarshan49ClusteringAlgorithmsClusteringalgorithmshavebeend

esignedtohandleverylargedatasetsE.g.theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneata

timeintotheR-tree,merginganewpointwithanexistingclusterifislessthansomedistanceawayIftherearemoreleafnodesthanfitinmemory,mergeexisti

ngclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumb

erofclustersCopyright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…ap

ersonmaybeinterestedin,onthebasisofPastpreferencesofthepersonOtherpeoplewithsimilarpastpreferencesThepre

ferencesofsuchpeopleforanewmovie/book/…OneapproachbasedonrepeatedclusteringClusterpeopleonthebasisofpreferencesformovi

esThenclustermoviesonthebasisofbeinglikedbythesameclustersofpeopleAgainclusterpeoplebasedontheirpreferencesfor(the

newlycreatedclustersof)moviesRepeatabovetillequilibriumAboveproblemisaninstanceofcollaborativefiltering,whereuserscollaborateinthetaskoffilteringi

nformationtofindinformationofinterestCopyright:Silberschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdo

cumentsE.g.clusterWebpagestofindrelatedpagesE.g.clusterpagesauserhasvisitedtoorganizetheirvisithis

toryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDatavisualizationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuall

yE.g.maps,charts,andcolor-codingE.g.locationswithproblemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansarev

erygoodadetectingvisualpatternsCopyright:Silberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLar

georganizationshavecomplexinternalorganizations,andhavedatastoredatdifferentlocations,ondifferentoperational(transact

ionprocessing)systems,underdifferentschemasDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedv

iewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultipleso

urces,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransact

ionprocessingsystemsCopyright:Silberschatz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,Kortha

ndSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuou

slyorperiodically(e.g.atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformati

onfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-date

dataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseS

chemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)DatacleansingE.g.correctmistakesinaddressesE.g.mi

sspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicatesKeeponlyoneaddressrecordperhousehold(“householdi

ng”)HowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesEfficienttechniquesforupda

teofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQu

eriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevaluesCopyright:Silberschatz,KorthandSuda

rshan57DataWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSchemasTypicallywarehousedataismultidimensional,withverylargefacttab

lesExamplesofdimensions:item-id,date/timeofsale,storewheresalewasmade,customeridentifierExamplesofmeasures:numberofitemssold,p

riceofitemsDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastars

chemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyri

ght:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Silberschatz,KorthandSudarshan60InformationRetrievalSystemsInformatio

nretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschema

Informationretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocu

mentscontainingthewords“databasesystems”Canbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuch

asimagesIRonWebdocumentshasbecomeextremelyimportantE.g.google,altavista,…Copyright:Silberschatz,KorthandSudarshan61Informa

tionRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon‟tdealwithtransactionalupdates(includingconcurrencycontrola

ndrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddres

sedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevanceCopyright:Silber

schatz,KorthandSudarshan62KeywordSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywo

rds.WeusethewordtermtorefertothewordsinadocumentInformation-retrievalsystemstypicallyallowqueryexpressi

onsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplicitlyspecified

RankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequencyFreq

uencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperl

inkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyright:Silberschatz,KorthandSudarshan63RelevanceR

ankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,

t)=numberofoccurrencesoftermtinthedocumentd.ThenrelevanceofadocumentdtoatermtThelogfactoristoavoidexcessiveweightagetofrequenttermsAndrelevance

ofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)n(t)tQCopyright:Silberschatz,KorthandSudarshan64Rel

evanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimport

anceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanceVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalled

stopwordsProximity:ifkeywordsinqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethan

iftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfewdocumentsarereturned,notallCo

pyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyperlinksWhenusingkeywordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Number

ofdocumentsrelevanttoaquerycanbeenormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcana

ddmanyoccurrencesofthewords“travelagent”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWe

bsite(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslideCopyri

ght:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteas

ameasureofthepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpageMosthyperl

inksaretorootofsiteSite-popularitycomputationischeaperthanpagepopularitycomputationRefinementsWhencomputingprestigebas

edonlinkstoasite,givemoreweightagetolinksfromsitesthatthemselveshavehigherprestigeDefinitioniscircular

SetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechanismCopyright:Silber

schatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)ConnectionstosocialnetworkingtheoriesthatrankedprestigeofpeopleE.g.thepresiden

toftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthoritybasedrankingAhubisa

pagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedon

prestigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,and

canbegotbysolvinglinearequationsUseauthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSudarsha

n68SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedontheb

asisofcommonwordsE.g.findktermsinAwithhighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscar

riesaweightofr(d,t)SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedb

ykeywordquery,andsystemfindsotherdocumentssimilartotheseCopyright:Silberschatz,KorthandSudarshan69Sy

nonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizethat“maintenan

ce”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“obj

ect”hasdifferentmeaningsasnoun/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesa

utomaticallyusingsynonymscanbeproblematicNeedtounderstandintendedmeaninginordertoinfersynonymsOrverifysy

nonymswithuserSynonymsmayhaveothermeaningsaswellCopyright:Silberschatz,KorthandSudarshan70IndexingofDocumentsAninvertedindexmapseachkeywordKitoa

setofdocumentsSithatcontainthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowp

roximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.I

ntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficien

tintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsortedlistsCopyright:Silberschatz,KorthandSudarshan71Mea

suringRetrievalEffectivenessIRsystemssavespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsed

rop)-somerelevantdocumentsmaynotberetrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpe

rmitanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarereleva

nttothequery.Recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silbersch

atz,KorthandSudarshan72MeasuringRetrievalEffectiveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativ

eRecallvs.precisiontradeoff:Canincreaserecallbyretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldb

efetched,reducingprecisionMeasuresofretrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisiona

safunctionofrecallEquivalently,asafunctionofnumberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:dra

wagraphofprecisionvsrecall.Problem:whichdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorp

usofdocumentsandqueries,withhumansdecidingwhichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConferen

ce)BenchmarkCopyright:Silberschatz,KorthandSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgathe

rinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHa

ndedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldtakeaverylargea

mountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawlCopy

right:Silberschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplem

achines,runninginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawle

dlaterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAft

eracrawlis“completed”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferent

machinesforloadbalancingCopyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstoget

herinalibraryfacilitatesbrowsinguserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedb

yclassificationsystemthatorganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silberschatz,Kort

handSudarshan76AClassificationHierarchyForALibrarySystemCopyright:Silberschatz,KorthandSudarshan77ClassificationDAG

Documentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallocationisnotimportant.Classificationhi

erarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,KorthandSudarshan78AClassificationDAGForALibraryInformationRetrievalSy

stemCopyright:Silberschatz,KorthandSudarshan79WebDirectoriesAWebdirectoryisjustaclassificationdirectoryo

nWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenadocument,whic

hnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedonterms

imilarity

小橙橙
小橙橙
文档分享,欢迎浏览!
  • 文档 25747
  • 被下载 7
  • 被收藏 0
相关资源
广告代码123
若发现您的权益受到侵害,请立即联系客服,我们会尽快为您处理。侵权客服QQ:395972555 (支持时间:9:00-21:00) 公众号
Powered by 太赞文库
×
确认删除?