【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传
转载请保留链接:https://www.ichengzhen.cn/view-92522.html
以下为本文档部分文字说明:
Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数据库原理与应用课程组C
opyright:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAnalysisOLAPExtendedag
gregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalSystemsIncludingWebsearchCopyright:Silberschatz,Kor
thandSudarshan3DecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisionsoftenbasedondatacollectedbyon-linetransaction-proce
ssingsystems.Examplesofbusinessdecisions:whatitemstostock?Whatinsurancepremiumtochange?Whotosendadvertisementsto?Example
sofdatausedformakingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,sex,etc.)Copyright:Silberschatz,Kor
thandSudarshan4Decision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExamplet
asksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsa
bove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabas
esStatisticalanalysisisalargefieldwillnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformof
statisticalrulesandpatternsfromLargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,an
dstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinesseswhichgeneratedatafrommultipledivisions,possiblyatmult
iplesitesDatamayalsobepurchasedexternallyCopyright:Silberschatz,KorthandSudarshan5DataAnalysisandOLAPAggregatefunctionssummarizelarg
evolumesofdataOnlineAnalyticalProcessing(OLAP)Interactiveanalysisofdata,allowingdatatobesummarizedandviewedindifferentwaysinanonlinefashion(withneg
ligibledelay)Datathatcanbemodeledasdimensionattributesandmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordata
analysis,wecanidentifysomeofitsattributesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthesa
lesrelationisameasureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensionattribut
es,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarshan6
CrossTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferredtoasapivot-table.Across-t
abisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattributeformthecolumnheadersOtherdimensi
onattributesarelistedontopValuesinindividualcellsare(aggregatesof)thevaluesofthedimensionattributesthatspecifythec
ell.Copyright:Silberschatz,KorthandSudarshan7RelationalRepresentationofCrosstabsCrosstabscanberepresentedasrelat
ionsThevalueallisusedtorepresentaggregatesTheSQL:1999standardactuallyusesnullvaluesinplaceofallMoreonthislater….Copyright:Silbersc
hatz,KorthandSudarshan8Three-DimensionalDataCubeAdatacubeisamultidimensionalgeneralizationofacrosstabCannotviewathree-dimensionalobjectinitsentiret
ybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyticalProcessingTheoperationofc
hangingthedimensionsusedinacross-tabiscalledpivotingSupposeananalystwishestoseeacross-tabonitem-name
andcolorforafixedvalueofsize,forexample,large,insteadofthesumacrossallsizes.Suchanoperationisreferredtoasslicing.Theoperationissometimescalleddicin
g,particularlywhenvaluesformultipledimensionsarefixed.Theoperationofmovingfromfiner-granularitydatatoacoarsergranularity
iscalledarollup.Theoppositeoperation-thatofmovingfromcoarser-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silbe
rschatz,KorthandSudarshan10HierarchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.t
hedimensionDateTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschat
z,KorthandSudarshan11CrossTabulationWithHierarchyCrosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorro
lluponahierarchyCopyright:Silberschatz,KorthandSudarshan12OLAPImplementationTheearliestOLAPsystemsusedm
ultidimensionalarraysinmemorytostoredatacubes,andarereferredtoasmultidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefe
aturesarecalledrelationalOLAP(ROLAP)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesin
arelationaldatabase,arecalledhybridOLAP(HOLAP)systems.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPs
ystemsprecomputedallpossibleaggregatesinordertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgrou
pbyItsufficestoprecomputesomeaggregates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fro
managgregateon(item-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingitfromscratchSeveraloptimizatio
nsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,size)Cancompute
aggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthebasedataCopyright:Sil
berschatz,KorthandSudarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeith
erveryhardorimpossibletospecifyDatacubeComplexaggregates(median,variance)binaryaggregates(correlation,
regressioncurves)rankingqueries(“assigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyo
faggregationfunctionstoaddressabovelimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,KorthandSud
arshan15ExtendedAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthequeryse
lectitem-name,color,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdiffer
entgroupingsofthesalesrelation:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name)
,(color),(size),()}where()denotesanemptygroupbylist.Foreachgrouping,theresultcontainsthenullvalueforat
tributesnotpresentinthegrouping.Copyright:Silberschatz,KorthandSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearl
ier,butwithnullinplaceofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,color)Thefunction
grouping()canbeappliedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallotherc
ases.selectitem-name,color,size,sum(number),grouping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grou
ping(size)assize-flag,fromsalesgroupbycube(item-name,color,size)Canusethefunctiondecode()intheselectclauset
oreplacesuchnullsbyavaluesuchasallE.g.replaceitem-nameinfirstquerybydecode(grouping(item-name),1,„all‟,item-na
me)Copyright:Silberschatz,KorthandSudarshan17ExtendedAggregation(Cont.)Therollupconstructgeneratesuniononeveryprefixofspecifiedlistofattribute
sE.g.selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name,color,size)Generatesunio
noffourgroupings:{(item-name,color,size),(item-name,color),(item-name),()}Rollupcanbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,su
pposetableitemcategory(item-name,category)givesthecategoryofeachitem.Thenselectcategory,item-name,sum(
number)fromsales,itemcategorywheresales.item-name=itemcategory.item-namegroupbyrollup(category,item-name)wouldgiveahiera
rchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSudarshan18ExtendedAggregation(Co
nt.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupbylists,crossproductofsetsgivesoverall
setofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,size)generatesthegroupings{item-name,()
}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,size),(color),()}Copyright:Silbersc
hatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-marks(student-id,marks)find
therankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksAnextraorderbyclauseisneededtogettheminsortedor
derselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksorderbys-rankRankingmayleave
gaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_rankdoesnotleavegaps,sonextdenserankwouldbe2Copy
right:Silberschatz,KorthandSudarshan20Ranking(Cont.)Rankingcanbedonewithinpartitionofthedata.“Findtherankofstudentsw
ithineachsection.”selectstudent-id,section,rank()over(partitionbysectionorderbymarksdesc)assec-rankfromstudent-marks,student-s
ectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultiplerankclausescanoccurinasinglesel
ectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:FindstudentswithtopnranksManysystemsp
rovidespecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelations
tudent-course-marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_rank(
withinpartition,ifpartitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprecedingvaluesrow_number(non-deterministic
inpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(orderbymarksdescnullsla
st)ass-rankfromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Foragivenconstantn,therankingthefunctio
nntile(n)takesthetuplesineachpartitioninthespecifiedorder,anddividesthemintonbucketswithqualnumbersoftuples.Forinstance,weansortem
ployeesbysalary,andusentile(3)tofindwhichrange(bottomthird,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesineachran
ge:selectthreetile,sum(salary)from(selectsalary,ntile(3)over(orderbysalary)asthreetilefromemployee)assgroupbyt
hreetileCopyright:Silberschatz,KorthandSudarshan23WindowingE.g.:“Givensalesvaluesforeachdate,calculateforeachdat
etheaverageofthesalesonthatday,thepreviousday,andthenextday”Suchmovingaveragequeriesareusedtosmoothoutrandomvariations.Incontrasttogroupby,thes
ametuplecanexistinmultiplewindowsWindowspecificationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.gi
venrelationsales(date,value)selectdate,sum(value)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofother
windowspecifications:betweenrowsunboundedprecedingandcurrentrowsunboundedprecedingrangebetween10precedingandcurrentrowAllrowswithvaluesbetweencur
rentrowvalue–10tocurrentvaluerangeinterval10dayprecedingNotincludingcurrentrowCopyright:Silberschatz,KorthandSudarshan24Windowing
(Cont.)CandowindowingwithinpartitionsE.g.Givenarelationtransaction(account-number,date-time,value),wherevalueispositiveforadepositandnegativeforaw
ithdrawal“Findtotalbalanceofeachaccountaftereachtransactionontheaccount”selectaccount-number,date-ti
me,sum(value)over(partitionbyaccount-numberorderbydate-timerowsunboundedpreceding)asbalancefromtransactionorderbyaccount-numbe
r,date-timeCopyright:Silberschatz,KorthandSudarshan25DataMiningCopyright:Silberschatz,KorthandSudarshan26DataMini
ngBroadlyspeaking,dataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificiali
ntelligencedataminingdiscoversstatisticalrulesandpatternsDiffersfrommachinelearninginthatitdealswithlarg
evolumesofdatastoredprimarilyondisk.Sometypesofknowledgediscoveredfromadatabasecanberepresentedbyasetofr
ules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikelytobuysportscars”Othertypesofknowledgerepresentedbye
quations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-processingofdata,choiceofwhichtypeofpatterntofind,postprocessingtofin
dnovelpatternsCopyright:Silberschatz,KorthandSudarshan27ApplicationsofDataMiningPredictionbasedonpasthistoryPredictifacreditcardapplican
tposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasthistoryPredictifacustomerislikelytoswitch
brandloyaltyPredictifacustomerislikelytorespondto“junkmail”PredictifapatternofphonecallingcardusageislikelytobefraudulentSom
eexamplesofpredictionmechanisms:ClassificationGivenatrainingsetconsistingofitemsbelongingtodifferentclasses,andanewitemwhoseclassisunknown,predict
whichclassitbelongstoRegressionformulaegivenasetofparameter-valuetofunction-resultmappingsforanunknownfunction,pred
ictthefunction-resultforanewparameter-valueCopyright:Silberschatz,KorthandSudarshan28ApplicationsofData
Mining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbythesamecustomers.Ifanewcustomerbuysonesuchbook,s
uggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.Associationsmayalsobeusedasafirststepindetecting
causationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacproblemsClustersE.g.typhoidcaseswereclusteredinanareasurroun
dingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemicsCopyright:Silberschatz,KorthandSudarshan29ClassificationRu
lesClassificationruleshelpassignnewobjectstoasetofclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassif
iedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationalle
velofapplicant,salaryofapplicant,ageofapplicant,etc.personP,P.degree=mastersandP.income>75,000P.credi
t=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=goodRulesarenotnecessaril
yexact:theremaybesomemisclassificationsClassificationrulescanbecompactlyshownasadecisiontree.Copyright:Silberschatz,KorthandSudarsh
an30DecisionTreeCopyright:Silberschatz,KorthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasampleinwhichthegroupingforeac
htupleisalreadyknown.Considercreditriskexample:Supposedegreeischosentopartitionthedataattheroot.Sincedegreehasasmallnumberofpo
ssiblevalues,onechildiscreatedforeachvalue.Ateachchildnodeoftheroot,furtherclassificationisdoneifrequired.Here
,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,somenumberofintervalsarechosen,andonechildcreatedforeachinterval.
Differentclassificationalgorithmsusedifferentwaysofchoosingwhichattributetopartitiononateachnode,andwha
ttheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differentno
desatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32ConstructionofDecisionTrees(Cont.)Greed
ytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningconditionforthen
odeMoreonchoosingpartioningattribute/conditionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmoreofthetreeisconstru
ctedThedataatanodeisnotpartitionedfurtherifeitherall(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenc
onsidered,andnofurtherpartitioningispossible.Suchanodeisaleafnode.Otherwisethedataatthenodeispartitionedfurtherbypickinganattributeforpartitioning
dataatthenode.Copyright:Silberschatz,KorthandSudarshan33BestSplitsIdea:evaluatedifferentattributesandpartitioningconditionsandpictheoneth
atbestimprovesthe“purity”ofthetrainingsetexamplesTheinitialtrainingsethasamixtureofinstancesfromdiffere
ntclassesandisthusrelativelyimpureE.g.ifdegreeexactlypredictscreditrisk,partitioningondegreewouldresultineachildhavinginstancesofonlyonecla
ssI.e.,thechildnodeswouldbepureThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.
Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenalli
nstancesareinasingleclass,theGinivalueis0,whileitreachesitsmaximum(of1–1/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,
KorthandSudarshan34BestSplits(Cont.)Anothermeasureofpurityistheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSisspl
itintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityoftheresultantsetofsetsas:purity(S1,S2,…..,Sr)=Theinfo
rmationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….,Sr)=purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2pi
Copyright:Silberschatz,KorthandSudarshan35BestSplits(Cont.)Measureof“cost”ofasplit:Information-content(S,{S1,S2,…..,Sr}))=–Information-gain
ratio=Information-gain(S,{S1,S2,……,Sr})Information-content(S,{S1,S2,…..,Sr})Thebestsplitforanattribut
eistheonethatgivesthemaximuminformationgainratioContinuousvaluedattributesCanbeorderedinafashionmeaningfultoclassificatione.g.integerandrealv
aluesCategoricalattributesCannotbemeaningfullyordered(e.g.country,school/university,item-color,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,Ko
rthandSudarshan36FindingBestSplitsCategoricalattributes:Multi-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecasesBinarysplit:try
allpossiblebreakupofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheinstances,tryeachasaspli
tpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,seebi
bliographicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberschatz,KorthandSuda
rshan37Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenreturn;foreachattribu
teAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPar
tition(Si);Copyright:Silberschatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)Varietyofalgorithmshavebeendevelope
dtoReduceCPUcostand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybeoverfitted,i.e.,over
lytunedtogiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofewtraininginstancesWhenasubtreeis
pruned,aninternalnodebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePruningcanbedonebyusingapartof
thetrainingsettobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclassificationonsecondpartCopyright:Silberschatz,Korthan
dSudarshan39OtherTypesofClassifiersFurthertypesofclassifiersNeuralnetclassifiersBayesianclassifiersNeura
lnetclassifiersusethetrainingdatatotrainartificialneuralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseBayestheorem,
whichsaysp(cj|d)=p(d|cj)p(cj)p(d)wherep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgiven
classcj,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:Silberschatz,KorthandSuda
rshan40NaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)
p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,a
ndtherebyestimatep(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramisc
omputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstoreCopyright:Silberschatz,KorthandSudarshan41Regression
Regressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariable
Y.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingeneral
,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofno
iseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:S
ilberschatz,KorthandSudarshan42AssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitem
sthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyals
otobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.whenacustomerbuysaparticularbo
ok,anonlineshopmaysuggestassociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:cons
equentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.eachtransaction(sale)atashopi
saninstance,andthesetofalltransactionsisthepopulationCopyright:Silberschatz,KorthandSudarshan43AssociationRules(Cont.
)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfractionofthepopulationsatisfiesbothth
eantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupport
fortheruleismilkscrewdriversislow.WeusuallywantruleswithareasonablyhighsupportRuleswithlowsupportareusuallynotveryus
efulConfidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.therulebreadmilkhasaconfidenceof80percentif80
percentofthepurchasesthatincludebreadalsoincludemilk.Usuallywantruleswithreasonablylargeconfidence.Arulewithalowconfidenceisnotmeaningfu
l.Notethattheconfidenceofbreadmilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesamesupports.Copyright:Silbers
chatz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e
.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachsetfinditssupport(i.e.c
ounthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargei
temsetstogenerateassociationrules.1.FromitemsetAgeneratetheruleA-{b}bforeachbA.Supportofrule=support(A).Confidenceof
rule=support(A)/support(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFewitemsets:determinesupportofallitemsetsviaasinglepas
sonsetoftransactionsAcountismaintainedforeachitemset,initiallysetto0.Whenatransactionisfetched,thecountisincr
ementedforeachsetofitemsthatiscontainedinthetransaction.Largeitemsets:setswithahighcountattheendofthepassManyitemsets:Ifmemo
rynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemse
tiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pas
s1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssucht
hatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidatesCopyright:Silbersc
hatz,KorthandSudarshan46OtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromthee
xpectedprobabilityaremoreinterestingE.g.ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpec
tedtopurchaseboth(prob1*prob2)WeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigher
thanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.wheneverbondsgoup,sto
ckpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromasteadygrowthE.g.salesofwinterweargodowninsummerNotsurprising,partofaknow
npattern.LookfordeviationfromvaluepredictedusingpastpatternsCopyright:Silberschatz,KorthandSudarshan47Clustering
Clustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetri
csinseveralwaysE.g.Grouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:
pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpoints
inaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargeda
tasetsE.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48Hierarch
icalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanis
m)chordatamammaliareptilialeopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)Agglomerativeclusteringa
lgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,
repeatedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudarshan49ClusteringAlgorithmsClusteringalgorithmshavebe
endesignedtohandleverylargedatasetsE.g.theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclustere
dInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifislessthansomedistanceawayIftherearemoreleafnodesthanfitin
memory,mergeexistingclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenu
mberofclustersCopyright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…apersonmaybeintere
stedin,onthebasisofPastpreferencesofthepersonOtherpeoplewithsimilarpastpreferencesThepreferencesofsuchpeopleforanewmovie/book/…Oneapp
roachbasedonrepeatedclusteringClusterpeopleonthebasisofpreferencesformoviesThenclustermoviesonthebasisofbeingliked
bythesameclustersofpeopleAgainclusterpeoplebasedontheirpreferencesfor(thenewlycreatedclustersof)moviesRepeatabovetillequi
libriumAboveproblemisaninstanceofcollaborativefiltering,whereuserscollaborateinthetaskoffilteringinformationtofindinformationofinterestCopyright:Sil
berschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdocumentsE.g.clusterWebpagestofindrelatedpagesE.g.cluste
rpagesauserhasvisitedtoorganizetheirvisithistoryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDat
avisualizationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuallyE.g.maps,charts,andcolor-codingE.g.locationswithprob
lemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansareverygoodadetectingvisualpatternsCopyright:Si
lberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLargeorganizationshavecomplexinternalorganiza
tions,andhavedatastoredatdifferentlocations,ondifferentoperational(transactionprocessing)systems,underdifferentschemasDatasource
softenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghi
storicaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderau
nifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtrans
actionprocessingsystemsCopyright:Silberschatz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,Kort
handSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestra
nsmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.atnight)Destinationdrivenarchitec
ture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingt
wo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessin
g(OLTP)systems.WhatschematouseSchemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)DatacleansingE.g.correct
mistakesinaddressesE.g.misspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicate
sKeeponlyoneaddressrecordperhousehold(“householding”)HowtopropagateupdatesWarehouseschemamaybea(materializ
ed)viewofschemafromdatasourcesEfficienttechniquesforupdateofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolarg
etostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryopti
mizertouseaggregatevaluesCopyright:Silberschatz,KorthandSudarshan57DataWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSche
masTypicallywarehousedataismultidimensional,withverylargefacttablesExamplesofdimensions:item-id,date/timeofsale,storewheresal
ewasmade,customeridentifierExamplesofmeasures:numberofitemssold,priceofitemsDimensionvaluesareusual
lyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowfl
akeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyright:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Sil
berschatz,KorthandSudarshan60InformationRetrievalSystemsInformationretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsIn
formationorganizedasacollectionofdocumentsDocumentsareunstructured,noschemaInformationretrievallocatesrelevantdocuments,
onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocumentscontainingthewords“databasesystems”C
anbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuchasimagesIRonWebdocumentshasbecomeextremelyimportantE.g.google,altavist
a,…Copyright:Silberschatz,KorthandSudarshan61InformationRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon‟tdealwith
transactionalupdates(includingconcurrencycontrolandrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorga
nizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddressedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedansw
ersbyestimateddegreeofrelevanceCopyright:Silberschatz,KorthandSudarshan62KeywordSearchInfulltextretriev
al,allthewordsineachdocumentareconsideredtobekeywords.WeusethewordtermtorefertothewordsinadocumentInfo
rmation-retrievalsystemstypicallyallowqueryexpressionsformedusingkeywordsandthelogicalconnectivesand,or,and
notAndsareimplicit,evenifnotexplicitlyspecifiedRankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscrit
icalRelevancerankingisbasedonfactorssuchasTermfrequencyFrequencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowma
nydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperlinkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyr
ight:Silberschatz,KorthandSudarshan63RelevanceRankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequ
ency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,t)=numberofoccurrencesoftermtinthedocumentd.Thenreleva
nceofadocumentdtoatermtThelogfactoristoavoidexcessiveweightagetofrequenttermsAndrelevanceofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)n(t)
tQCopyright:Silberschatz,KorthandSudarshan64RelevanceRankingUsingTerms(Cont.)Mostsystemsaddtotheabovemo
delWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimportanceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanc
eVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalledstopwordsProximity:ifkeywordsinqueryoccurclosetogeth
erinthedocument,thedocumenthashigherimportancethaniftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfe
wdocumentsarereturned,notallCopyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyperlinksWhenusingkey
wordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Numberofdocumentsrelevanttoaquerycanbeenormousifonlyterm
frequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcanaddmanyoccurrencesofthewords“travelage
nt”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWeb
site(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolu
tion:nextslideCopyright:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteasameasureof
thepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpage
MosthyperlinksaretorootofsiteSite-popularitycomputationischeaperthanpagepopularitycomputationRefinements
Whencomputingprestigebasedonlinkstoasite,givemoreweightagetolinksfromsitesthatthemselveshavehigherpr
estigeDefinitioniscircularSetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankin
gmechanismCopyright:Silberschatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)Connectionstosocialnetworkingtheoriesthatrankedpres
tigeofpeopleE.g.thepresidentoftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthority
basedrankingAhubisapagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedonp
restigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefin
itionsarecyclic,andcanbegotbysolvinglinearequationsUseauthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSud
arshan68SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedonthebasisofc
ommonwordsE.g.findktermsinAwithhighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscarriesa
weightofr(d,t)SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrie
vedbykeywordquery,andsystemfindsotherdocumentssimilartotheseCopyright:Silberschatz,KorthandSudarshan69S
ynonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizethat“maintenance”a
nd“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“object”hasdifferentmeaningsasnoun
/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesautomaticallyusingsynonymscanbeproblematicNeedtounderstand
intendedmeaninginordertoinfersynonymsOrverifysynonymswithuserSynonymsmayhaveothermeaningsaswellCop
yright:Silberschatz,KorthandSudarshan70IndexingofDocumentsAninvertedindexmapseachkeywordKitoasetofdocumentsSithatcontai
nthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowproximitybasedrankingCountsofnu
mberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.IntersectionS1S2.....Snoroperation:do
cumentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficientintersection/unionbymerging“not”c
analsobeefficientlyimplementedbymergingofsortedlistsCopyright:Silberschatz,KorthandSudarshan71MeasuringRetrievalEffectivenessIRsystemss
avespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsedrop)-somerelevantdocumentsmaynotberetrie
ved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpermi
tanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarerele
vanttothequery.Recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silberschatz,KorthandSudarshan72MeasuringRetrievalEf
fectiveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativeRecallvs.precisiontradeoff:Canincreaserecall
byretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldbefetched,reducingprecisionMeasuresofretrievaleffecti
veness:Recallasafunctionofnumberofdocumentsfetched,orPrecisionasafunctionofrecallEquivalently,asafunctionofn
umberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:drawagraphofprecisionvsrecall.Problem:whic
hdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorpusofdocumentsandqueries,withhumansdeciding
whichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConference)BenchmarkCopyright:Silberschatz,Kortha
ndSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgatherinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindot
herdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHandedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawli
ngtheentireWebwouldtakeaverylargeamountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasin
glecrawlCopyright:Silberschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplemachines,runni
nginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawledl
aterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAfteracrawlis“compl
eted”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferentmachi
nesforloadbalancingCopyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstogetherinalibraryfa
cilitatesbrowsinguserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedbyclassificationsystemthato
rganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silbersch
atz,KorthandSudarshan76AClassificationHierarchyForALibrarySystemCopyright:Silberschatz,KorthandSudars
han77ClassificationDAGDocumentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallo
cationisnotimportant.ClassificationhierarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,Kortha
ndSudarshan78AClassificationDAGForALibraryInformationRetrievalSystemCopyright:Silberschatz,KorthandSudarshan79We
bDirectoriesAWebdirectoryisjustaclassificationdirectoryonWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssu
es:Whatshouldthedirectoryhierarchybe?Givenadocument,whichnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClas
sificationofdocumentsintoahierarchymaybedonebasedontermsimilarity