【文档说明】数据库原理与应用课程组课件.ppt,共(79)页,731.500 KB,由小橙橙上传
转载请保留链接:https://www.ichengzhen.cn/view-92522.html
以下为本文档部分文字说明:
Copyright:Silberschatz,KorthandSudarshan1Chapter22:AdvancedQueryingandInformationRetrieval陕西师范大学计算机科学学院数据库原理与应用课程组Copyrig
ht:Silberschatz,KorthandSudarshan2Chapter22:AdvancedQueryingandInformationRetrievalDecision-SupportSystemsDataAna
lysisOLAPExtendedaggregationfeaturesinSQLWindowingandrankingDataMiningDataWarehousingInformation-RetrievalS
ystemsIncludingWebsearchCopyright:Silberschatz,KorthandSudarshan3DecisionSupportSystemsDecision-supportsy
stemsareusedtomakebusinessdecisionsoftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusi
nessdecisions:whatitemstostock?Whatinsurancepremiumtochange?Whotosendadvertisementsto?Examplesofdatausedformak
ingdecisionsRetailsalestransactiondetailsCustomerprofiles(income,age,sex,etc.)Copyright:Silberschatz,KorthandSudarshan4Decision-SupportSy
stems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeach
region,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategorya
ndeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefieldw
illnotstudyithereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromLargedatabases.Adatawarehousearchivesin
formationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusi
nesseswhichgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternallyCopyrig
ht:Silberschatz,KorthandSudarshan5DataAnalysisandOLAPAggregatefunctionssummarizelargevolumesofdataOnlineAnalyticalProc
essing(OLAP)Interactiveanalysisofdata,allowingdatatobesummarizedandviewedindifferentwaysinanonlinefashion(withnegligibledelay)
Datathatcanbemodeledasdimensionattributesandmeasureattributesarecalledmultidimensionaldata.Givenarelationusedfordataanalysis,wecanident
ifysomeofitsattributesasmeasureattributes,sincetheymeasuresomevalue,andcanbeaggregatedupon.Forinstance,theattributenumberofthe
salesrelationisameasureattribute,sinceitmeasuresthenumberofunitssold.Someoftheotherattributesoftherelationareidentifiedasdimensionattr
ibutes,sincetheydefinethedimensionsonwhichmeasureattributes,andsummariesofmeasureattributes,areviewed.Copyright:Silberschatz,KorthandSudarshan6Cro
ssTabulationofsalesbyitem-nameandcolorThetableaboveisanexampleofacross-tabulation(cross-tab),alsoreferre
dtoasapivot-table.Across-tabisatablewherevaluesforoneofthedimensionattributesformtherowheaders,valuesforanotherdimensionattribu
teformthecolumnheadersOtherdimensionattributesarelistedontopValuesinindividualcellsare(aggregatesof)t
hevaluesofthedimensionattributesthatspecifythecell.Copyright:Silberschatz,KorthandSudarshan7Relationa
lRepresentationofCrosstabsCrosstabscanberepresentedasrelationsThevalueallisusedtorepresentaggregatesTheSQL:1999standardactuallyuse
snullvaluesinplaceofallMoreonthislater….Copyright:Silberschatz,KorthandSudarshan8Three-DimensionalD
ataCubeAdatacubeisamultidimensionalgeneralizationofacrosstabCannotviewathree-dimensionalobjectinit
sentiretybutcrosstabscanbeusedasviewsonadatacubeCopyright:Silberschatz,KorthandSudarshan9OnlineAnalyticalProcessingTheoperationofchangingthedime
nsionsusedinacross-tabiscalledpivotingSupposeananalystwishestoseeacross-tabonitem-nameandcolorforafixedvalueofsize,forexample,large,insteadof
thesumacrossallsizes.Suchanoperationisreferredtoasslicing.Theoperationissometimescalleddicing,particularlywhenva
luesformultipledimensionsarefixed.Theoperationofmovingfromfiner-granularitydatatoacoarsergranularityiscalledarollup.Theopposite
operation-thatofmovingfromcoarser-granularitydatatofiner-granularitydata–iscalledadrilldown.Copyright:Silberschatz,KorthandS
udarshan10HierarchiesonDimensionsHierarchyondimensionattributes:letsdimensionstobeviewedatdifferentlevelsofdetailE.g.thedimensionDat
eTimecanbeusedtoaggregatebyhourofday,date,dayofweek,month,quarteroryearCopyright:Silberschatz,KorthandSudarshan11CrossTabulationWithHierarchyC
rosstabscanbeeasilyextendedtodealwithhierarchiesCandrilldownorrolluponahierarchyCopyright:Silbersch
atz,KorthandSudarshan12OLAPImplementationTheearliestOLAPsystemsusedmultidimensionalarraysinmemorytostoredatacubes,
andarereferredtoasmultidimensionalOLAP(MOLAP)systems.OLAPimplementationsusingonlyrelationaldatabasefeaturesarecalledrelationalOLAP(ROLAP
)systemsHybridsystems,whichstoresomesummariesinmemoryandstorethebasedataandothersummariesinarelationaldatabase,arecalledhybridOLAP(HOLAP)system
s.Copyright:Silberschatz,KorthandSudarshan13OLAPImplementation(Cont.)EarlyOLAPsystemsprecomputedallpossibleaggregatesino
rdertoprovideonlineresponseSpaceandtimerequirementsfordoingsocanbeveryhigh2ncombinationsofgroupbyItsufficestoprecomputesomeaggre
gates,andcomputeothersondemandfromoneoftheprecomputedaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon
(item-name,color,size)Forallbutafew“non-decomposable”aggregatessuchasmedianischeaperthancomputingi
tfromscratchSeveraloptimizationsavailableforcomputingmultipleaggregatesCancomputeaggregateon(item-name,color)fromanaggregateon(item-name,color,si
ze)Cancomputeaggregateson(item-name,color,size),(item-name,color)and(item-name)usingasinglesortingofthe
basedataCopyright:Silberschatz,KorthandSudarshan14ExtendedAggregationSQL-92aggregationquitelimitedManyusefulaggregatesareeith
erveryhardorimpossibletospecifyDatacubeComplexaggregates(median,variance)binaryaggregates(correlation,regr
essioncurves)rankingqueries(“assigneachstudentarankbasedonthetotalmarks”SQL:1999OLAPextensionsprovideavarietyof
aggregationfunctionstoaddressabovelimitationsSupportedbyseveraldatabases,includingOracleandIBMDB2Copyright:Silberschatz,Kortha
ndSudarshan15ExtendedAggregationinSQL:1999Thecubeoperationcomputesunionofgroupby‟soneverysubsetofthespecifiedattributesE.g.considerthe
queryselectitem-name,color,size,sum(number)fromsalesgroupbycube(item-name,color,size)Thiscomputestheunionofeightdifferentgroupingsofthesalesrela
tion:{(item-name,color,size),(item-name,color),(item-name,size),(color,size),(item-name),(color),(size),()}where()denotesanemptygrou
pbylist.Foreachgrouping,theresultcontainsthenullvalueforattributesnotpresentinthegrouping.Copyright:Silberschatz,Kortha
ndSudarshan16ExtendedAggregation(Cont.)Relationalrepresentationofcrosstabthatwesawearlier,butwithnullinplace
ofall,canbecomputedbyselectitem-name,color,sum(number)fromsalesgroupbycube(item-name,color)Thefunctiongrouping()canbeapp
liedonanattributeReturns1ifthevalueisanullvaluerepresentingall,andreturns0inallothercases.selectitem-name,color,size,sum(number),grou
ping(item-name)asitem-name-flag,grouping(color)ascolor-flag,grouping(size)assize-flag,fromsalesgroupbycube(item-name
,color,size)Canusethefunctiondecode()intheselectclausetoreplacesuchnullsbyavaluesuchasallE.g.replaceitem-
nameinfirstquerybydecode(grouping(item-name),1,„all‟,item-name)Copyright:Silberschatz,KorthandSudarshan17ExtendedAggr
egation(Cont.)TherollupconstructgeneratesuniononeveryprefixofspecifiedlistofattributesE.g.selectitem-name,color,size,sum(nu
mber)fromsalesgroupbyrollup(item-name,color,size)Generatesunionoffourgroupings:{(item-name,color,size),(item-name,c
olor),(item-name),()}Rollupcanbeusedtogenerateaggregatesatmultiplelevelsofahierarchy.E.g.,supposetableitemca
tegory(item-name,category)givesthecategoryofeachitem.Thenselectcategory,item-name,sum(number)fromsales,itemcategorywheresales.item-name=itemcategor
y.item-namegroupbyrollup(category,item-name)wouldgiveahierarchicalsummarybyitem-nameandbycategory.Copyright:Silberschatz,KorthandSuda
rshan18ExtendedAggregation(Cont.)MultiplerollupsandcubescanbeusedinasinglegroupbyclauseEachgeneratessetofgroupb
ylists,crossproductofsetsgivesoverallsetofgroupbylistsE.g.,selectitem-name,color,size,sum(number)fromsalesgroupbyrollup(item-name),rollup(color,s
ize)generatesthegroupings{item-name,()}X{(color,size),(color),()}={(item-name,color,size),(item-name,color),(item-name),(color,si
ze),(color),()}Copyright:Silberschatz,KorthandSudarshan19RankingRankingisdoneinconjunctionwithanorderbyspecification.Givenarelationstudent-ma
rks(student-id,marks)findtherankofeachstudent.selectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksAnextraorderby
clauseisneededtogettheminsortedorderselectstudent-id,rank()over(orderbymarksdesc)ass-rankfromstudent-marksorderbys-rankRan
kingmayleavegaps:e.g.if2studentshavethesametopmark,bothhaverank1,andthenextrankis3dense_rankdoesnotleav
egaps,sonextdenserankwouldbe2Copyright:Silberschatz,KorthandSudarshan20Ranking(Cont.)Rankingcanbedonewi
thinpartitionofthedata.“Findtherankofstudentswithineachsection.”selectstudent-id,section,rank()over(partitionbysectionorderbymarksdesc)assec-r
ankfromstudent-marks,student-sectionwherestudent-marks.student-id=student-section.student-idorderbysection,sec-rankMultiplerankclausescanocc
urinasingleselectclauseRankingisdoneafterapplyinggroupbyclause/aggregationExercises:FindstudentswithtopnranksManysystemsprovidesp
ecial(non-standard)syntaxfor“top-n”queriesRankstudentsbysumoftheirmarksindifferentcoursesgivenrelatio
nstudent-course-marks(student-id,course,marks)Copyright:Silberschatz,KorthandSudarshan21Ranking(Cont.)Otherrankingfunctions:percent_
rank(withinpartition,ifpartitioningisdone)cume_dist(cumulativedistribution)fractionoftupleswithprec
edingvaluesrow_number(non-deterministicinpresenceofduplicates)SQL:1999permitstheusertospecifynullsfirstornullslastselectstudent-id,rank()over(o
rderbymarksdescnullslast)ass-rankfromstudent-marksCopyright:Silberschatz,KorthandSudarshan22Ranking(Cont.)Forag
ivenconstantn,therankingthefunctionntile(n)takesthetuplesineachpartitioninthespecifiedorder,anddividesthem
intonbucketswithqualnumbersoftuples.Forinstance,weansortemployeesbysalary,andusentile(3)tofindwhichrange(bottomthir
d,middlethird,ortopthird)eachemployeeisin,andcomputethetotalsalaryearnedbyemployeesineachrange:selectthreetile,sum(salary)from(selectsalary,ntile(3)o
ver(orderbysalary)asthreetilefromemployee)assgroupbythreetileCopyright:Silberschatz,KorthandSudarshan23Windo
wingE.g.:“Givensalesvaluesforeachdate,calculateforeachdatetheaverageofthesalesonthatday,thepreviousday,andthenextday”Suchmovingaverageq
ueriesareusedtosmoothoutrandomvariations.Incontrasttogroupby,thesametuplecanexistinmultiplewindowsWindowspecific
ationinSQL:Orderingoftuples,sizeofwindowforeachtuple,aggregatefunctionE.g.givenrelationsales(date,value)selectdate,sum(val
ue)over(orderbydatebetweenrows1precedingand1following)fromsalesExamplesofotherwindowspecifications:betweenrowsunboundedprecedingand
currentrowsunboundedprecedingrangebetween10precedingandcurrentrowAllrowswithvaluesbetweencurrentrowvalue–10tocurrentvalu
erangeinterval10dayprecedingNotincludingcurrentrowCopyright:Silberschatz,KorthandSudarshan24Windowing(Cont.)Candowindowingwithinpartitions
E.g.Givenarelationtransaction(account-number,date-time,value),wherevalueispositiveforadepositandnegativeforawithdrawal“Find
totalbalanceofeachaccountaftereachtransactionontheaccount”selectaccount-number,date-time,sum(value)over(partitionbyaccount-numberorderbyda
te-timerowsunboundedpreceding)asbalancefromtransactionorderbyaccount-number,date-timeCopyright:Silberschatz,KorthandSudarshan25Da
taMiningCopyright:Silberschatz,KorthandSudarshan26DataMiningBroadlyspeaking,dataminingistheprocessofsemi-
automaticallyanalyzinglargedatabasestofindusefulpatternsLikeknowledgediscoveryinartificialintelligencedataminingdiscoversstatisticalrulesand
patternsDiffersfrommachinelearninginthatitdealswithlargevolumesofdatastoredprimarilyondisk.Sometypesofknowledgediscoveredfromadatab
asecanberepresentedbyasetofrules.e.g.,:“Youngwomenwithannualincomesgreaterthan$50,000aremostlikelytobuysportscars”Othertypesofknowledgerepresen
tedbyequations,orbypredictionfunctionsSomemanualinterventionisusuallyrequiredPre-processingofdata,choiceofwhichtypeofpatternt
ofind,postprocessingtofindnovelpatternsCopyright:Silberschatz,KorthandSudarshan27ApplicationsofDataMiningPredictionbasedonpasthistor
yPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasth
istoryPredictifacustomerislikelytoswitchbrandloyaltyPredictifacustomerislikelytorespondto“junkmail”Predictifapatternofphonecallingcardus
ageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenatrainingsetconsistingofitemsbelongingt
odifferentclasses,andanewitemwhoseclassisunknown,predictwhichclassitbelongstoRegressionformulaegivenasetofparameter-va
luetofunction-resultmappingsforanunknownfunction,predictthefunction-resultforanewparameter-valueCopyright:Silberschat
z,KorthandSudarshan28ApplicationsofDataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtbythesamecustomers.Ifanewcust
omerbuysonesuchbook,suggestthathebuystheotherstoo.Othersimilarapplications:cameraaccessories,clothes,etc.Associations
mayalsobeusedasafirststepindetectingcausationE.g.associationbetweenexposuretochemicalXandcancer,ornewmedicineandcardiacproblemsClustersE.
g.typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemicsCopyright:Silb
erschatz,KorthandSudarshan29ClassificationRulesClassificationruleshelpassignnewobjectstoasetofclasses.
E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?C
lassificationrulesforaboveexamplecoulduseavarietyofknowledge,suchaseducationallevelofapplicant,salaryofapplicant,ageofapplicant,etc.pers
onP,P.degree=mastersandP.income>75,000P.credit=excellentpersonP,P.degree=bachelorsand(P.income25,000andP.income75,000)P.credit=
goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbecompactlyshownasadecisiontree.Copyright:Silber
schatz,KorthandSudarshan30DecisionTreeCopyright:Silberschatz,KorthandSudarshan31ConstructionofDecisionTreesTrainingset:adatasamplei
nwhichthegroupingforeachtupleisalreadyknown.Considercreditriskexample:Supposedegreeischosentopartitionthedataatth
eroot.Sincedegreehasasmallnumberofpossiblevalues,onechildiscreatedforeachvalue.Ateachchildnodeofther
oot,furtherclassificationisdoneifrequired.Here,partitionsaredefinedbyincome.Sinceincomeisacontinuousattribute,somenumberof
intervalsarechosen,andonechildcreatedforeachinterval.Differentclassificationalgorithmsusedifferentwaysofchoosingwhichattributetopartiti
ononateachnode,andwhattheintervals,ifany,are.IngeneralDifferentbranchesofthetreecouldgrowtodifferentlevels.Differentno
desatthesamelevelmayusedifferentpartitioningattributes.Copyright:Silberschatz,KorthandSudarshan32ConstructionofDecisi
onTrees(Cont.)Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,an
dapartitioningconditionforthenodeMoreonchoosingpartioningattribute/conditionshortlyAlgorithmisgreedy:thechoiceismadeonceandnotrevisitedasmor
eofthetreeisconstructedThedataatanodeisnotpartitionedfurtherifeitherall(ormost)oftheitemsatthenodebelongtothesameclass,or
allattributeshavebeenconsidered,andnofurtherpartitioningispossible.Suchanodeisaleafnode.Otherwisethedataatt
henodeispartitionedfurtherbypickinganattributeforpartitioningdataatthenode.Copyright:Silberschatz,KorthandSudarshan33BestSplitsIdea:evalu
atedifferentattributesandpartitioningconditionsandpictheonethatbestimprovesthe“purity”ofthetrainingsetexamp
lesTheinitialtrainingsethasamixtureofinstancesfromdifferentclassesandisthusrelativelyimpureE.g.ifde
greeexactlypredictscreditrisk,partitioningondegreewouldresultineachildhavinginstancesofonlyoneclassI.e.,thechildnodeswouldbepureThepurit
yofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,fractionofinstancesin
classi=pi.TheGinimeasureofpurityisdefinedas[Gini(S)=1-Whenallinstancesareinasingleclass,theGinivalueis0,whileitreachesitsmaximum(of1–1
/k)ifeachclassthesamenumberofinstances.ki-1p2iCopyright:Silberschatz,KorthandSudarshan34BestSplits(Cont.)Anothermeasureofpurityi
stheentropymeasure,whichisdefinedasentropy(S)=–WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityofthere
sultantsetofsetsas:purity(S1,S2,…..,Sr)=TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,rInformation-gain(S,{S1,S2,….,Sr)=
purity(S)–purity(S1,S2,…Sr)ri=1|Si||S|purity(Si)ki-1pilog2piCopyright:Silberschatz,KorthandSudarshan35BestSpli
ts(Cont.)Measureof“cost”ofasplit:Information-content(S,{S1,S2,…..,Sr}))=–Information-gainratio=Information-gain(S,{S1,
S2,……,Sr})Information-content(S,{S1,S2,…..,Sr})ThebestsplitforanattributeistheonethatgivesthemaximuminformationgainratioContinuousvalu
edattributesCanbeorderedinafashionmeaningfultoclassificatione.g.integerandrealvaluesCategoricalattrib
utesCannotbemeaningfullyordered(e.g.country,school/university,item-color,.):log2ri-1|Si||S||Si||S|Copyright:Silberschatz,Kortha
ndSudarshan36FindingBestSplitsCategoricalattributes:Multi-waysplit,onechildforeachvaluemayhavetoomanychildreninsomecas
esBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuousvaluedattributeBinarysplit:Sortvaluesintheins
tances,tryeachasasplitpointE.g.ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:morecomplicated,seebibliog
raphicnotesAseriesofbinarysplitsonthesameattributehasroughlyequivalenteffectCopyright:Silberschatz,KorthandSudarshan3
7Decision-TreeConstructionAlgorithmProcedureGrow.Tree(S)Partition(S);ProcedurePartition(S)if(purity(S)>por|S|<s)thenreturn;foreachattribut
eAevaluatesplitsonattributeA;Usebestsplitfound(acrossallattributes)topartitionSintoS1,S2,….,Sr,fori=1,2,…..,rPartition(Si);Copyright:Silber
schatz,KorthandSudarshan38DecisionTreeConstr’nAlgo’s.(Cont.)Varietyofalgorithmshavebeendevelopedto
ReduceCPUcostand/orReduceIOcostwhenhandlingdatasetslargerthanmemoryImproveaccuracyofclassificationDecisiontreemaybe
overfitted,i.e.,overlytunedtogiventrainingsetPruningofdecisiontreemaybedoneonbranchesthathavetoofew
traininginstancesWhenasubtreeispruned,aninternalnodebecomesaleafanditsclassissettothemajorityclassoftheinstancesthatmaptothenodePrunin
gcanbedonebyusingapartofthetrainingsettobuildtree,andasecondparttotestthetreeprunesubtreesthatincreasemisclass
ificationonsecondpartCopyright:Silberschatz,KorthandSudarshan39OtherTypesofClassifiersFurthertypesofclassifie
rsNeuralnetclassifiersBayesianclassifiersNeuralnetclassifiersusethetrainingdatatotrainartificialneu
ralnetsWidelystudiedinAI,won‟tcoverhereBayesianclassifiersuseBayestheorem,whichsaysp(cj|d)=p(d|cj)p(cj)p(d)w
herep(cj|d)=probabilityofinstancedbeinginclasscj,p(d|cj)=probabilityofgeneratinginstancedgivenclassc
j,p(cj)=probabilityofoccurrenceofclasscj,andp(d)=probabilityofinstancedoccuringCopyright:Silberschatz,KorthandSudarshan40NaïveBayesianClassifiersBay
esianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)p(d)canbeignoredsinceitisthesameforallclassesTosimplifyt
hetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimatep(d|cj)=p(d1|cj)*p(d
2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramson
multipleattributesaremoreexpensivetocomputeandstoreCopyright:Silberschatz,KorthandSudarshan41RegressionRegressiondealswit
hthepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariab
leY.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthatY=a0+a1*X1+a2*X2+…+an*XnFindingsuchalinearpolynomialiscalledlinearregression.Ingener
al,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipis
notexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.Copyright:Silberschatz
,KorthandSudarshan42AssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.S
omeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuy
thebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.whenacustomerbuysaparticularbook,anonlineshopmaysuggestas
sociatedbooks.Associationrules:breadmilkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassoci
ationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.eachtransaction(s
ale)atashopisaninstance,andthesetofalltransactionsisthepopulationCopyright:Silberschatz,KorthandSuda
rshan43AssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Supportisameasureofwhatfract
ionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.supposeonly0.001percentofallpurchasesincl
udemilkandscrewdrivers.Thesupportfortheruleismilkscrewdriversislow.WeusuallywantruleswithareasonablyhighsupportRuleswithlowsupportareusuallynotv
eryusefulConfidenceisameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.therulebreadmilkhasa
confidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.Usuallywantruleswithreasonablylargeconfidence.A
rulewithalowconfidenceisnotmeaningful.Notethattheconfidenceofbreadmilkmaybeverydifferentfromtheconfidenceofmilkbread,althoughbothhavethesam
esupports.Copyright:Silberschatz,KorthandSudarshan44FindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasona
blyhighsupport(e.g.supportof2%orgreater)Naïvealgorithm1.Considerallpossiblesetsofrelevantitems.2.Foreachse
tfinditssupport(i.e.counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupport3.Uselargeitemsetstoge
nerateassociationrules.1.FromitemsetAgeneratetheruleA-{b}bforeachbA.Supportofrule=support(A).Confidenceofrule=support(A)/supp
ort(A-{b})Copyright:Silberschatz,KorthandSudarshan45FindingSupportFewitemsets:determinesupportofallitemsetsviaasinglepassonsetoftransactionsAco
untismaintainedforeachitemset,initiallysetto0.Whenatransactionisfetched,thecountisincrementedforeachsetofitemsthatiscontainedinthet
ransaction.Largeitemsets:setswithahighcountattheendofthepassManyitemsets:Ifmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,con
sideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneed
stobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.Eliminatetho
seitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportof
allcandidatesStopiftherearenocandidatesCopyright:Silberschatz,KorthandSudarshan46OtherTypesofAssociationsB
asicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.ifmanypeoplepurchasebread,a
ndmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchaseboth(prob1*prob2)Weareinterestedinpositiveaswellasnegativecorrelationsbetweensets
ofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequ
enceassociations/correlationsE.g.wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.deviationfromastea
dygrowthE.g.salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpattern
sCopyright:Silberschatz,KorthandSudarshan47ClusteringClustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpo
intslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysE.g.Grouppointsintoksets(fo
ragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageof
coordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHa
sbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasets
E.g.theBirchclusteringalgorithm(moreshortly)Copyright:Silberschatz,KorthandSudarshan48HierarchicalClusteringExamplefrombiologicalclassi
fication(thewordclassificationheredoesnotmeanapredictionmechanism)chordatamammaliareptilialeopardshumanssnakescrocodile
sOtherexamples:Internetdirectorysystems(e.g.Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclus
ters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repe
atedlyrefine(break)clustersintosmalleronesCopyright:Silberschatz,KorthandSudarshan49ClusteringAlgorithmsClusteringalgorithmshavebeend
esignedtohandleverylargedatasetsE.g.theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneata
timeintotheR-tree,merginganewpointwithanexistingclusterifislessthansomedistanceawayIftherearemoreleafnodesthanfitinmemory,mergeexisti
ngclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumb
erofclustersCopyright:Silberschatz,KorthandSudarshan50CollaborativeFilteringGoal:predictwhatmovies/books/…ap
ersonmaybeinterestedin,onthebasisofPastpreferencesofthepersonOtherpeoplewithsimilarpastpreferencesThepre
ferencesofsuchpeopleforanewmovie/book/…OneapproachbasedonrepeatedclusteringClusterpeopleonthebasisofpreferencesformovi
esThenclustermoviesonthebasisofbeinglikedbythesameclustersofpeopleAgainclusterpeoplebasedontheirpreferencesfor(the
newlycreatedclustersof)moviesRepeatabovetillequilibriumAboveproblemisaninstanceofcollaborativefiltering,whereuserscollaborateinthetaskoffilteringi
nformationtofindinformationofinterestCopyright:Silberschatz,KorthandSudarshan51OtherTypesofMiningTextmining:applicationofdataminingtotextualdo
cumentsE.g.clusterWebpagestofindrelatedpagesE.g.clusterpagesauserhasvisitedtoorganizetheirvisithis
toryE.g.classifyWebpagesautomaticallyintoaWebdirectoryDatavisualizationsystemshelpusersexaminelargevolumesofdataanddetectpatternsvisuall
yE.g.maps,charts,andcolor-codingE.g.locationswithproblemsshowninredonamapCanvisuallyencodelargeamountsofinformationonasinglescreenHumansarev
erygoodadetectingvisualpatternsCopyright:Silberschatz,KorthandSudarshan52DataWarehousingCopyright:Silberschatz,KorthandSudarshan53DataWarehousingLar
georganizationshavecomplexinternalorganizations,andhavedatastoredatdifferentlocations,ondifferentoperational(transact
ionprocessing)systems,underdifferentschemasDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedv
iewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultipleso
urces,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransact
ionprocessingsystemsCopyright:Silberschatz,KorthandSudarshan54DataWarehousingCopyright:Silberschatz,Kortha
ndSudarshan55ComponentsofDataWarehouseWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuou
slyorperiodically(e.g.atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformati
onfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-date
dataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseS
chemaintegrationCopyright:Silberschatz,KorthandSudarshan56ComponentsofDataWarehouse(Cont.)DatacleansingE.g.correctmistakesinaddressesE.g.mi
sspellings,zipcodeerrorsMergeaddresslistsfromdifferentsourcesandpurgeduplicatesKeeponlyoneaddressrecordperhousehold(“householdi
ng”)HowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesEfficienttechniquesforupda
teofmaterializedviewsWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQu
eriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevaluesCopyright:Silberschatz,KorthandSuda
rshan57DataWarehouseSchemasCopyright:Silberschatz,KorthandSudarshan58WarehouseSchemasTypicallywarehousedataismultidimensional,withverylargefacttab
lesExamplesofdimensions:item-id,date/timeofsale,storewheresalewasmade,customeridentifierExamplesofmeasures:numberofitemssold,p
riceofitemsDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastars
chemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttablesCopyri
ght:Silberschatz,KorthandSudarshan59InformationRetrievalCopyright:Silberschatz,KorthandSudarshan60InformationRetrievalSystemsInformatio
nretrieval(IR)systemsuseasimplerdatamodelthandatabasesystemsInformationorganizedasacollectionofdocumentsDocumentsareunstructured,noschema
Informationretrievallocatesrelevantdocuments,onthebasisofuserinputsuchaskeywordsorexampledocumentse.g.,finddocu
mentscontainingthewords“databasesystems”Canbeusedevenontextualdescriptionsprovidedwithnon-textualdatasuch
asimagesIRonWebdocumentshasbecomeextremelyimportantE.g.google,altavista,…Copyright:Silberschatz,KorthandSudarshan61Informa
tionRetrievalSystems(Cont.)DifferencesfromdatabasesystemsIRsystemsdon‟tdealwithtransactionalupdates(includingconcurrencycontrola
ndrecovery)Databasesystemsdealwithstructureddata,withschemasthatdefinethedataorganizationIRsystemsdealwithsomequeryingissuesnotgenerallyaddres
sedbydatabasesystemsApproximatesearchingbykeywordsRankingofretrievedanswersbyestimateddegreeofrelevanceCopyright:Silber
schatz,KorthandSudarshan62KeywordSearchInfulltextretrieval,allthewordsineachdocumentareconsideredtobekeywo
rds.WeusethewordtermtorefertothewordsinadocumentInformation-retrievalsystemstypicallyallowqueryexpressi
onsformedusingkeywordsandthelogicalconnectivesand,or,andnotAndsareimplicit,evenifnotexplicitlyspecified
RankingofdocumentsonthebasisofestimatedrelevancetoaqueryiscriticalRelevancerankingisbasedonfactorssuchasTermfrequencyFreq
uencyofoccurrenceofquerykeywordindocumentInversedocumentfrequencyHowmanydocumentsthequerykeywordoccursinFewergivemoreimportancetokeywordHyperl
inkstodocumentsMorelinkstoadocumentdocumentismoreimportantCopyright:Silberschatz,KorthandSudarshan63RelevanceR
ankingUsingTermsTF-IDF(Termfrequency/InverseDocumentfrequency)ranking:Letn(d)=numberoftermsinthedocumentdn(d,
t)=numberofoccurrencesoftermtinthedocumentd.ThenrelevanceofadocumentdtoatermtThelogfactoristoavoidexcessiveweightagetofrequenttermsAndrelevance
ofdocumenttoqueryQn(d)n(d,t)1+r(d,t)=logr(d,Q)=r(d,t)n(t)tQCopyright:Silberschatz,KorthandSudarshan64Rel
evanceRankingUsingTerms(Cont.)MostsystemsaddtotheabovemodelWordsthatoccurintitle,authorlist,sectionheadings,etc.aregivengreaterimport
anceWordswhosefirstoccurrenceislateinthedocumentaregivenlowerimportanceVerycommonwordssuchas“a”,“an”,“the”,“it”etcareeliminatedCalled
stopwordsProximity:ifkeywordsinqueryoccurclosetogetherinthedocument,thedocumenthashigherimportancethan
iftheyoccurfarapartDocumentsarereturnedindecreasingorderofrelevancescoreUsuallyonlytopfewdocumentsarereturned,notallCo
pyright:Silberschatz,KorthandSudarshan65RelevanceUsingHyperlinksWhenusingkeywordqueriesontheWeb,thenumberofdocumentsisenormous(manybillions)Number
ofdocumentsrelevanttoaquerycanbeenormousifonlytermfrequenciesaretakenintoaccountUsingtermfrequenciesmakes“spamming”easyE.g.atravelagentcana
ddmanyoccurrencesofthewords“travelagent”tohispagetomakeitsrankveryhighMostofthetimepeoplearelookingforpagesfrompopularsitesIdea:usepopularityofWe
bsite(e.g.howmanypeoplevisitit)toranksitepagesthatmatchgivenkeywordsProblem:hardtofindactualpopularityofsiteSolution:nextslideCopyri
ght:Silberschatz,KorthandSudarshan66RelevanceUsingHyperlinks(Cont.)Solution:usenumberofhyperlinkstoasiteas
ameasureofthepopularityorprestigeofthesiteCountonlyonehyperlinkfromeachsite(why?)Popularitymeasureisforsite,notforindividualpageMosthyperl
inksaretorootofsiteSite-popularitycomputationischeaperthanpagepopularitycomputationRefinementsWhencomputingprestigebas
edonlinkstoasite,givemoreweightagetolinksfromsitesthatthemselveshavehigherprestigeDefinitioniscircular
SetupandsolvesystemofsimultaneouslinearequationsAboveideaisbasisoftheGooglePageRankrankingmechanismCopyright:Silber
schatz,KorthandSudarshan67RelevanceUsingHyperlinks(Cont.)ConnectionstosocialnetworkingtheoriesthatrankedprestigeofpeopleE.g.thepresiden
toftheUShasahighprestigesincemanypeopleknowhimSomeoneknownbymultipleprestigiouspeoplehashighprestigeHubandauthoritybasedrankingAhubisa
pagethatstoreslinkstomanypages(onatopic)AnauthorityisapagethatcontainsactualinformationonatopicEachpagegetsahubprestigebasedon
prestigeofauthoritiesthatitpointstoEachpagegetsanauthorityprestigebasedonprestigeofhubsthatpointtoitAgain,prestigedefinitionsarecyclic,and
canbegotbysolvinglinearequationsUseauthorityprestigewhenrankinganswerstoaqueryCopyright:Silberschatz,KorthandSudarsha
n68SimilarityBasedRetrievalSimilaritybasedretrieval-retrievedocumentssimilartoagivendocumentSimilaritymaybedefinedontheb
asisofcommonwordsE.g.findktermsinAwithhighestr(d,t)andusethesetermstofindrelevanceofotherdocuments;eachofthetermscar
riesaweightofr(d,t)SimilaritycanbeusedtorefineanswersettokeywordqueryUserselectsafewrelevantdocumentsfromthoseretrievedb
ykeywordquery,andsystemfindsotherdocumentssimilartotheseCopyright:Silberschatz,KorthandSudarshan69Sy
nonymsandHomonymsSynonymsE.g.document:“motorcyclerepair”,query:“motorcyclemaintenance”needtorealizethat“maintenan
ce”and“repair”aresynonymsSystemcanextendqueryas“motorcycleand(repairormaintenance)”HomonymsE.g.“obj
ect”hasdifferentmeaningsasnoun/verbCandisambiguatemeanings(tosomeextent)fromthecontextExtendingqueriesa
utomaticallyusingsynonymscanbeproblematicNeedtounderstandintendedmeaninginordertoinfersynonymsOrverifysy
nonymswithuserSynonymsmayhaveothermeaningsaswellCopyright:Silberschatz,KorthandSudarshan70IndexingofDocumentsAninvertedindexmapseachkeywordKitoa
setofdocumentsSithatcontainthekeywordDocumentsidentifiedbyidentifiersInvertedindexmayrecordKeywordlocationswithindocumenttoallowp
roximitybasedrankingCountsofnumberofoccurrencesofkeywordtocomputeTFandoperation:FindsdocumentsthatcontainallofK1,K2,...,Kn.I
ntersectionS1S2.....Snoroperation:documentsthatcontainatleastoneofK1,K2,…,Knunion,S1S2.....Sn,.EachSiiskeptsortedtoallowefficien
tintersection/unionbymerging“not”canalsobeefficientlyimplementedbymergingofsortedlistsCopyright:Silberschatz,KorthandSudarshan71Mea
suringRetrievalEffectivenessIRsystemssavespacebyusingindexstructuresthatsupportonlyapproximateretrieval.Mayresultin:falsenegative(falsed
rop)-somerelevantdocumentsmaynotberetrieved.falsepositive-someirrelevantdocumentsmayberetrieved.Formanyapplicationsagoodindexshouldnotpe
rmitanyfalsedrops,butmaypermitafewfalsepositives.Relevantperformancemetrics:Precision-whatpercentageoftheretrieveddocumentsarereleva
nttothequery.Recall-whatpercentageofthedocumentsrelevanttothequerywereretrieved.Copyright:Silbersch
atz,KorthandSudarshan72MeasuringRetrievalEffectiveness(Cont.)Rankingordercanalsoresultinfalsepositives/falsenegativ
eRecallvs.precisiontradeoff:Canincreaserecallbyretrievingmanydocuments(downtoalowlevelrelevanceranking),butmanyirrelevantdocumentswouldb
efetched,reducingprecisionMeasuresofretrievaleffectiveness:Recallasafunctionofnumberofdocumentsfetched,orPrecisiona
safunctionofrecallEquivalently,asafunctionofnumberofdocumentsfetchedE.g.“precisionof75%atrecallof50%,and60%atarecallof75%”Ingeneral:dra
wagraphofprecisionvsrecall.Problem:whichdocumentsareactuallyrelevant,andwhichanotUsualsolution:humanjudgesCreateacorp
usofdocumentsandqueries,withhumansdecidingwhichdocumentsarerelevanttowhichqueriesTREC(TextREtrievalConferen
ce)BenchmarkCopyright:Silberschatz,KorthandSudarshan73WebCrawlingWebcrawlersareprogramsthatlocateandgathe
rinformationontheWebRecursivelyfollowhyperlinkspresentinknowndocuments,tofindotherdocumentsStartingfromaseedsetofdocumentsFetcheddocumentsHa
ndedovertoanindexingsystemCanbediscardedafterindexing,orstoreasacachedcopyCrawlingtheentireWebwouldtakeaverylargea
mountoftimeSearchenginestypicallycoveronlyapartoftheWeb,notallofitTakemonthstoperformasinglecrawlCopy
right:Silberschatz,KorthandSudarshan74WebCrawling(Cont.)Crawlingisdonebymultipleprocessesonmultiplem
achines,runninginparallelSetoflinkstobecrawledstoredinadatabaseNewlinksfoundincrawledpagesaddedtothisset,tobecrawle
dlaterIndexingprocessalsorunsonmultiplemachinesCreatesanewcopyofindexinsteadofmodifyingoldindexOldindexisusedtoanswerqueriesAft
eracrawlis“completed”newindexbecomes“old”indexMultiplemachinesusedtoanswerqueriesIndicesmaybekeptinmemoryQueriesmayberoutedtodifferent
machinesforloadbalancingCopyright:Silberschatz,KorthandSudarshan75BrowsingStoringrelateddocumentstoget
herinalibraryfacilitatesbrowsinguserscanseenotonlyrequesteddocumentbutalsorelatedones.Browsingisfacilitatedb
yclassificationsystemthatorganizeslogicallyrelateddocumentstogether.Organizationishierarchical:classificationhierarchyCopyright:Silberschatz,Kort
handSudarshan76AClassificationHierarchyForALibrarySystemCopyright:Silberschatz,KorthandSudarshan77ClassificationDAG
Documentscanresideinmultipleplacesinahierarchyinaninformationretrievalsystem,sincephysicallocationisnotimportant.Classificationhi
erarchyisthusDirectedAcyclicGraph(DAG)Copyright:Silberschatz,KorthandSudarshan78AClassificationDAGForALibraryInformationRetrievalSy
stemCopyright:Silberschatz,KorthandSudarshan79WebDirectoriesAWebdirectoryisjustaclassificationdirectoryo
nWebpagesE.g.Yahoo!Directory,OpenDirectoryprojectIssues:Whatshouldthedirectoryhierarchybe?Givenadocument,whic
hnodesofthedirectoryarecategoriesrelevanttothedocumentOftendonemanuallyClassificationofdocumentsintoahierarchymaybedonebasedonterms
imilarity