【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传
转载请保留链接:https://www.ichengzhen.cn/view-92522.html
以下为本文档部分文字说明:
Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数
据库原理与应用课程组Copyright:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAnalysis
OLAPExtendedaggregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalSystemsInc
ludingWebsearchCopyright:Silberschatz,KorthandSudarshan3DecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisionsoft
enbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:whatitemstostock?Whatinsu
rancepremiumtochange?Whotosendadvertisementsto?ExamplesofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income
,age,sex,etc.)Copyright:Silberschatz,KorthandSudarshan4Decision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecialized
toolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdo
theycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:
S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefieldwillnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatist
icalrulesandpatternsfromLargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitun
deraunifiedschema,atasinglesite.Importantforlargebusinesseswhichgeneratedatafrommultipledivisions,p
ossiblyatmultiplesitesDatamayalsobepurchasedexternallyCopyright:Silberschatz,KorthandSudarshan5DataAna
lysisandOLAPAggregatefunctionssummarizelargevolumesofdataOnlineAnalyticalProcessing(OLAP)Interactiveanalysisofdata,allowingdatatobesummari
zedandviewedindifferentwaysinanonlinefashion(withnegligibledelay)Datathatcanbemodeledasdimensionattributesan
dmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordataanalysis,wecanidentifysomeofitsattri
butesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthesalesrelationisame
asureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensio
nattributes,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarsh
an6CrossTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferredtoasapivot-table.Acro
ss-tabisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattributeformthecolumnheadersOtherdimensio
nattributesarelistedontopValuesinindividualcellsare(aggregatesof)thevaluesofthedimensionattributesthatspecifythecell.C
opyright:Silberschatz,KorthandSudarshan7RelationalRepresentationofCrosstabsCrosstabscanberepresentedasrelationsTheva
lueallisusedtorepresentaggregatesTheSQL:1999standardactuallyusesnullvaluesinplaceofallMoreonthislater….Copyright:Sil
berschatz,KorthandSudarshan8Three-DimensionalDataCubeAdatacubeisamultidimensionalgeneralizationofacro
sstabCannotviewathree-dimensionalobjectinitsentiretybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyti
calProcessingTheoperationofchangingthedimensionsusedinacross-tabiscalledpivotingSupposeananalystwish
estoseeacross-tabonitem-nameandcolorforafixedvalueofsize,forexample,large,insteadofthesumacrossallsizes.Sucha
noperationisreferredtoasslicing.Theoperationissometimescalleddicing,particularlywhenvaluesformultipledimensionsarefixed.Theoperatio
nofmovingfromfiner-granularitydatatoacoarsergranularityiscalledarollup.Theoppositeoperation-thatofmovingfromcoarse
r-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silberschatz,KorthandSudarshan10Hiera
rchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.thedimensionD
ateTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschatz,KorthandSudarshan11Cross
TabulationWithHierarchyCrosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorrolluponahierarchyCopyright:Silberschatz,Korth
andSudarshan12OLAPImplementationTheearliestOLAPsystemsusedmultidimensionalarraysinmemorytostoredatacubes,andarereferredtoasmu
ltidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalledrela
tionalOLAP(ROLAP)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesinarelat
ionaldatabase,arecalledhybridOLAP(HOLAP)systems.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPsystemsprecomputedal
lpossibleaggregatesinordertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgroupbyItsufficestoprecomputesome
aggregates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(i
tem-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingitfr
omscratchSeveraloptimizationsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,
size)Cancomputeaggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthebasedataCopyright:Silberschatz,KorthandS
udarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeitherveryhardorimpossibletospecifyDatacube
Complexaggregates(median,variance)binaryaggregates(correlation,regressioncurves)rankingqueries(“a
ssigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyofaggregationfunctionstoaddressabov
elimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,KorthandSudarshan15Extend
edAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthequeryselectitem-name,c
olor,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdifferentgroupi
ngsofthesalesrelation:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name),(color),(size),()}where()denotesanemptygrou
pbylist.Foreachgrouping,theresultcontainsthenullvalueforattributesnotpresentinthegrouping.Copyright:Silb
erschatz,KorthandSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearlier,butwit
hnullinplaceofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,col
or)Thefunctiongrouping()canbeappliedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallothercases.selectitem-name,c
olor,size,sum(number),grouping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grouping(size)assize-flag,f
romsalesgroupbycube(item-name,color,size)Canusethefunctiondecode()intheselectclausetoreplacesuchnullsbyavaluesuchasallE.g.replaceitem-namei
nfirstquerybydecode(grouping(item-name),1,„all‟,item-name)Copyright:Silberschatz,KorthandSudarshan17Extende
dAggregation(Cont.)TherollupconstructgeneratesuniononeveryprefixofspecifiedlistofattributesE.g.selectitem-name,color,size,sum(number)fromsalesg
roupbyrollup(item-name,color,size)Generatesunionoffourgroupings:{(item-name,color,size),(item-name,color),(item-name),()}Rollup
canbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,supposetableitemcategory(item-name,category)givesthecateg
oryofeachitem.Thenselectcategory,item-name,sum(number)fromsales,itemcategorywheresales.item-name=itemcategory.item-namegroupbyro
llup(category,item-name)wouldgiveahierarchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSudarshan18ExtendedAg
gregation(Cont.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupbylists,crossproductofse
tsgivesoverallsetofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,size)generates
thegroupings{item-name,()}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,size),(color),()}Copyright:Silb
erschatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-marks(
student-id,marks)findtherankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfroms
tudent-marksAnextraorderbyclauseisneededtogettheminsortedorderselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstu
dent-marksorderbys-rankRankingmayleavegaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_ran
kdoesnotleavegaps,sonextdenserankwouldbe2Copyright:Silberschatz,KorthandSudarshan20Ranking(Cont.)Ra
nkingcanbedonewithinpartitionofthedata.“Findtherankofstudentswithineachsection.”selectstudent-id,section,rank()over(partitionbysection
orderbymarksdesc)assec-rankfromstudent-marks,student-sectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultip
lerankclausescanoccurinasingleselectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:Findstudentswithtop
nranksManysystemsprovidespecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelationstudent-course-
marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_rank(withinpartition,ifpa
rtitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprecedingvaluesrow_number(non-determ
inisticinpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(orderbymarksdescnullslast)ass-rank
fromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Foragivenconstantn,therankingthefunctionntile(n)takesthetuplesin
eachpartitioninthespecifiedorder,anddividesthemintonbucketswithqualnumbersoftuples.Forinstance,weansortempl
oyeesbysalary,andusentile(3)tofindwhichrange(bottomthird,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesi
neachrange:selectthreetile,sum(salary)from(selectsalary,ntile(3)over(orderbysalary)asthreetilefromemployee)assgroupb
ythreetileCopyright:Silberschatz,KorthandSudarshan23WindowingE.g.:“Givensalesvaluesforeachdate,calculateforeachdatetheaverageofthesalesonthatda
y,thepreviousday,andthenextday”Suchmovingaveragequeriesareusedtosmoothoutrandomvariations.Incontras
ttogroupby,thesametuplecanexistinmultiplewindowsWindowspecificationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.givenrela
tionsales(date,value)selectdate,sum(value)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofo
therwindowspecifications:betweenrowsunboundedprecedingandcurrentrowsunboundedprecedingrangebetween10preceding
andcurrentrowAllrowswithvaluesbetweencurrentrowvalue–10tocurrentvaluerangeinterval10dayprecedingNotincludingcurr
entrowCopyright:Silberschatz,KorthandSudarshan24Windowing(Cont.)CandowindowingwithinpartitionsE.g.Givenarelationtransaction(account-number,
date-time,value),wherevalueispositiveforadepositandnegativeforawithdrawal“Findtotalbalanceofeachacco
untaftereachtransactionontheaccount”selectaccount-number,date-time,sum(value)over(partitionbyaccount-numberorderbydate-timerowsunbo
undedpreceding)asbalancefromtransactionorderbyaccount-number,date-timeCopyright:Silberschatz,KorthandSudarshan25Dat
aMiningCopyright:Silberschatz,KorthandSudarshan26DataMiningBroadlyspeaking,dataminingistheprocessofsemi-
automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificialintelligenceda
taminingdiscoversstatisticalrulesandpatternsDiffersfrommachinelearninginthatitdealswithlargevolumesofdatastoredprimarilyondisk.Somety
pesofknowledgediscoveredfromadatabasecanberepresentedbyasetofrules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikel
ytobuysportscars”Othertypesofknowledgerepresentedbyequations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-process
ingofdata,choiceofwhichtypeofpatterntofind,postprocessingtofindnovelpatternsCopyright:Silberschatz,KorthandSudarshan27Appli
cationsofDataMiningPredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(inc
ome,jobtype,age,..)andpasthistoryPredictifacustomerislikelytoswitchbrandloyaltyPredictifacustomerislikelytor
espondto“junkmail”PredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenatr
ainingsetconsistingofitemsbelongingtodifferentclasses,andanewitemwhoseclassisunknown,predictwhichclassitbelongstoRegressionformulaegivenas
etofparameter-valuetofunction-resultmappingsforanunknownfunction,predictthefunction-resultforanewparameter
-valueCopyright:Silberschatz,KorthandSudarshan28ApplicationsofDataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbyth
esamecustomers.Ifanewcustomerbuysonesuchbook,suggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.
AssociationsmayalsobeusedasafirststepindetectingcausationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacprob
lemsClustersE.g.typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsim
portantindetectingepidemicsCopyright:Silberschatz,KorthandSudarshan29ClassificationRulesClassificationruleshelpassignnewobjectstoasetofclasses.E.g.,
givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificat
ionrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationallevelofapplicant,salaryofapplicant,age
ofapplicant,etc.personP,P.degree=mastersandP.income>75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000and
P.income75,000)P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassific
ationrulescanbecompactlyshownasadecisiontree.Copyright:Silberschatz,KorthandSudarshan30DecisionTreeCopyright:Silberschatz,K
orthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasampleinwhichthegroupingforeachtupleisalreadyknown.C
onsidercreditriskexample:Supposedegreeischosentopartitionthedataattheroot.Sincedegreehasasmallnumberofpossiblevalues,onechildiscreatedforeachvalue
.Ateachchildnodeoftheroot,furtherclassificationisdoneifrequired.Here,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,so
menumberofintervalsarechosen,andonechildcreatedforeachinterval.Differentclassificationalgorithmsusedifferentwaysofchoosi
ngwhichattributetopartitiononateachnode,andwhattheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differen
tnodesatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32Constructi
onofDecisionTrees(Cont.)Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedat
aintogroupsbasedonapartitioningattribute,andapartitioningconditionforthenodeMoreonchoosingpartioningattribute/condit
ionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmoreofthetreeisconstructedThedataatanodeisnotpartitionedfurtherifei
therall(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.Suchanodeisalea
fnode.Otherwisethedataatthenodeispartitionedfurtherbypickinganattributeforpartitioningdataatthenode.Copyright:Silberschatz
,KorthandSudarshan33BestSplitsIdea:evaluatedifferentattributesandpartitioningconditionsandpictheonethatbestimprovesthe“purity”ofthetrainingsetexam
plesTheinitialtrainingsethasamixtureofinstancesfromdifferentclassesandisthusrelativelyimpureE.g.ifdegreeexactlypredictscreditrisk,partitioningo
ndegreewouldresultineachildhavinginstancesofonlyoneclassI.e.,thechildnodeswouldbepureThepurityofasetSoftraininginstanc
escanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofins
tancesinclassi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0,whileitreachesitsma
ximum(of1–1/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,KorthandSudarshan34BestSplits(Con
t.)Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityof
theresultantsetofsetsas:purity(S1,S2,…..,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….
,Sr)=purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2piCopyright:Silberschatz,KorthandSudarshan35BestSplits(Cont.)Measureof“cost”ofa
split:Information-content(S,{S1,S2,…..,Sr}))=–Information-gainratio=Information-gain(S,{S1,S2,……,Sr})Information-content(S,
{S1,S2,…..,Sr})ThebestsplitforanattributeistheonethatgivesthemaximuminformationgainratioContinuousvaluedattributesCanbeorderedinafashionmeani
ngfultoclassificatione.g.integerandrealvaluesCategoricalattributesCannotbemeaningfullyordered(e.g.country,school/university,item-col
or,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,KorthandSudarshan36FindingBestSplitsCategoricalattributes:Mul
ti-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecasesBinarysplit:tryallpossiblebreakup
ofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheinstances,tryeacha
sasplitpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,s
eebibliographicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberscha
tz,KorthandSudarshan37Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenretu
rn;foreachattributeAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPartition(Si);Copyri
ght:Silberschatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)VarietyofalgorithmshavebeendevelopedtoReduceCPUcos
tand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybeoverfitted,i.e.,overlytuned
togiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofewtraininginstancesWhenasubtreeispruned,aninternalnod
ebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePruningcanbedonebyusingapartofthetrainingse
ttobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclassificationonsecondpartCopyr
ight:Silberschatz,KorthandSudarshan39OtherTypesofClassifiersFurthertypesofclassifiersNeuralnetclassifiers
BayesianclassifiersNeuralnetclassifiersusethetrainingdatatotrainartificialneuralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseB
ayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabi
lityofgeneratinginstancedgivenclasscj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:S
ilberschatz,KorthandSudarshan40NaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj
)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassif
iersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahis
togramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandsto
reCopyright:Silberschatz,KorthandSudarshan41RegressionRegressiondealswiththepredictionofavalue,rathert
hanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,…,
ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindinga
curvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherel
ationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:Silberschatz,KorthandSudarshan42Associa
tionRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.Someonewhobuysbreadisquitelikelyalso
tobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformat
ioncanbeusedinseveralways.E.g.whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmi
lkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thep
opulationconsistsofasetofinstancesE.g.eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulat
ionCopyright:Silberschatz,KorthandSudarshan43AssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureof
whatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincludemilk
andscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.Weusuallywantruleswithareasonablyhighsupport
RuleswithlowsupportareusuallynotveryusefulConfidenceisameasureofhowoftentheconsequentistruewhentheant
ecedentistrue.E.g.therulebreadmilkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadal
soincludemilk.Usuallywantruleswithreasonablylargeconfidence.Arulewithalowconfidenceisnotmeaningful.Notethattheconfidenceofbre
admilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesamesupports.Copyright:Silberscha
tz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighs
upport(e.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachsetfinditssupport(i.e.counthowmanytransac
tionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargeitemsetstogenerateassociationrules.1.FromitemsetAgeneratether
uleA-{b}bforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFe
witemsets:determinesupportofallitemsetsviaasinglepassonsetoftransactionsAcountismaintainedforeachitemset,initia
llysetto0.Whenatransactionisfetched,thecountisincrementedforeachsetofitemsthatiscontainedinthetransaction.Largeitemsets:sets
withahighcountattheendofthepassManyitemsets:Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsin
eachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.The
aprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everys
etofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidatesCopyright:Silbersc
hatz,KorthandSudarshan46OtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilitya
remoreinterestingE.g.ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchaseboth(prob1*prob2)
WeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredict
edNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.wheneverbond
sgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromasteadygrowthE.g.salesofwinterweargodowninsummerNotsurprisin
g,partofaknownpattern.LookfordeviationfromvaluepredictedusingpastpatternsCopyright:Silberschatz,KorthandSudarshan47ClusteringClustering:Intu
itively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdista
ncemetricsinseveralwaysE.g.Grouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupismini
mizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpoin
tsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniqu
esthatcanhandleverylargedatasetsE.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48Hie
rarchicalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredict
ionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)Ag
glomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithm
sStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudars
han49ClusteringAlgorithmsClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.theBircha
lgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifi
slessthansomedistanceawayIftherearemoreleafnodesthanfitinmemory,mergeexistingclustersthatareclosetoeachotherAtth
eendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumberofclustersCop
yright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…apersonmaybeinterestedin,onthebasisofPastprefer
encesofthepersonOtherpeoplewithsimilarpastpreferencesThepreferencesofsuchpeopleforanewmovie/book/…OneapproachbasedonrepeatedclusteringClust
erpeopleonthebasisofpreferencesformoviesThenclustermoviesonthebasisofbeinglikedbythesameclustersofpeopleAgainclusterpeopleba
sedontheirpreferencesfor(thenewlycreatedclustersof)moviesRepeatabovetillequilibriumAboveproblemisaninstanceof
collaborativefiltering,whereuserscollaborateinthetaskoffilteringinformationtofindinformationofintere
stCopyright:Silberschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdocumentsE.g.clusterWebpagestofindrelated
pagesE.g.clusterpagesauserhasvisitedtoorganizetheirvisithistoryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDatavisualiz
ationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuallyE.g.maps,charts,andcolor-cod
ingE.g.locationswithproblemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansareverygo
odadetectingvisualpatternsCopyright:Silberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLarg
eorganizationshavecomplexinternalorganizations,andhavedatastoredatdifferentlocations,ondifferentoperational(transactionprocessing)systems,underd
ifferentschemasDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,i
ncludinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlys
implifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystemsCopyright:Silberscha
tz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,KorthandSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchite
cture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.atnight)Destinationdrivenarchitecture:wa
rehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingtwo-phasecommit)istooexpe
nsiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.Wh
atschematouseSchemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)Dat
acleansingE.g.correctmistakesinaddressesE.g.misspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicatesKeepon
lyoneaddressrecordperhousehold(“householding”)HowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasources
EfficienttechniquesforupdateofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)ofte
nsufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevaluesCopyright:Silberschatz,KorthandSudarshan57Dat
aWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSchemasTypicallywarehousedataismultidimensional,withverylargefacttablesExample
sofdimensions:item-id,date/timeofsale,storewheresalewasmade,customeridentifierExamplesofmeasures:number
ofitemssold,priceofitemsDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaisca
lledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyrigh
t:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Silberschatz,KorthandSudarshan60InformationRetrievalSystemsInformationretrieva
l(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformatio
nretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”C
anbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesIRonWebdocumentshasbecomeextremel
yimportantE.g.google,altavista,…Copyright:Silberschatz,KorthandSudarshan61InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsys
temsdon‟tdealwithtransactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstruc
tureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatese
archingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevanceCopyright:Silberschatz,KorthandSudarshan62Keywo
rdSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInforma
tion-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplici
tlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequen
cyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccur
sinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyright:Silberscha
tz,KorthandSudarshan63RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Le
tn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.ThenrelevanceofadocumentdtoatermtThelo
gfactoristoavoidexcessiveweightagetofrequenttermsAndrelevanceofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)
n(t)tQCopyright:Silberschatz,KorthandSudarshan64RelevanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,aut
horlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimpo
rtanceVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalledstopwordsProximity:ifkeywords
inqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescor
eUsuallyonlytopfewdocumentsarereturned,notallCopyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyper
linksWhenusingkeywordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Numberofdocumentsrelevanttoaquerycanbe
enormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcanaddmanyoccurrencesofthew
ords“travelagent”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularity
ofWebsite(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslideCopyright
:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureofthepopularityorpresti
geofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpageMosthyperlinksaretorootofsiteSite-popula
ritycomputationischeaperthanpagepopularitycomputationRefinementsWhencomputingprestigebasedonlinkstoasite,givemoreweightagetolinksfromsitesthatth
emselveshavehigherprestigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechan
ismCopyright:Silberschatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)Connectionstosocialnetworkingth
eoriesthatrankedprestigeofpeopleE.g.thepresidentoftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultiplepresti
giouspeoplehashighprestigeHubandauthoritybasedrankingAhubisapagethatstoreslinkstomanypages(onatopic)Anauthorityisapagethatcontainsactualinfor
mationonatopicEachpagegetsahubprestigebasedonprestigeofauthoritiesthatitpointstoEachpagegetsanauth
orityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,andcanbegotbysolvinglinearequationsUsea
uthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSudarshan68SimilarityBasedRetrievalSimilaritybasedretri
eval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofcommonwordsE.g.findktermsinAwithh
ighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscarriesaweightofr(d,t)Similaritycanbeusedtor
efineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedbykeywordquery,andsystemfindsotherdocumentssimilartothes
eCopyright:Silberschatz,KorthandSudarshan69SynonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizeth
at“maintenance”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“object”hasdifferentmeaningsasnoun/v
erbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounder
standintendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswellCopyright:Silberschatz,KorthandSudarsha
n70IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontainthekeywordDocumentsident
ifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomput
eTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K
2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsort
edlistsCopyright:Silberschatz,KorthandSudarshan71MeasuringRetrievalEffectivenessIRsystemssavespacebyusingindex
structuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotber
etrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermitanyfalsedrops,
butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarerelevanttothequery.Recall-what
percentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silberschatz,KorthandSudarshan72MeasuringRetrievalEffect
iveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativeRecallvs.precisiontradeoff:Canincrea
serecallbyretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresof
retrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEqui
valently,asafunctionofnumberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:drawagraphofp
recisionvsrecall.Problem:whichdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorpus
ofdocumentsandqueries,withhumansdecidingwhichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConfe
rence)BenchmarkCopyright:Silberschatz,KorthandSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursi
velyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHan
dedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldta
keaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawlCopyright:Silb
erschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runninginparallel
SetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledlaterIndexingprocessalsorunsonmultiplemachinesCreatesa
newcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“completed”newindexbec
omes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachinesforloadbalancingCo
pyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstogetherinalibraryfacilitatesbrowsinguserscanseeno
tonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthatorganizeslogicallyrelateddo
cumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silberschatz,KorthandSudarshan76AClassificationHierarchyForALibra
rySystemCopyright:Silberschatz,KorthandSudarshan77ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrieval
system,sincephysicallocationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,KorthandS
udarshan78AClassificationDAGForALibraryInformationRetrievalSystemCopyright:Silberschatz,KorthandSudarshan79WebDirectoriesAWebdirectoryisjustacla
ssificationdirectoryonWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenado
cument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedontermsimi
larity