【文档说明】机械学习ー.pptx,共(80)页,956.725 KB,由精品优选上传
转载请保留链接:https://www.ichengzhen.cn/view-260293.html
以下为本文档部分文字说明:
機械学習とデータマイニング岡田孝関西学院大学情報メディア教育センターTUT'2000/06/071はじめにデータマイニングとはデータベースからの知識発見知的データ解析紙オムツと缶ビール大容量データベースデータベースマーケティングビジネ
スインテリジェンス対象の多様化Web上のテキストマルチメディアTUT'2000/06/072知識発見プロセス基幹DB抽出変換統合外部DBデータウェアハウスデータマイニングパターン評価可視化知識TUT'200
0/06/073マイニング技法統計学パターン認識クラスタリングニューラルネット決定木Roughset相関ルールGraphBasedInduction帰納論理プログラミング変数選択データの可視化TUT'2000/06/074Whatissupervisedlearning?Inp
utinstancescontainsClassattributesExplanationattributes.Generaterulestodescribeclassdescriptionsinductively.IFconditionsT
HENclassLearningfromexamples,Incorporationofbackgroundknowledgecf.regression,discriminantanalysis,neuralnetwork,nearestneig
hborTUT'2000/06/075TypicalapplicationsKnowledgeacquisitiontobeusedinplantoperatingexpertsystemActionpredictionofopponentteamsinsports
matchDiagnosisfrommedicaltestsDiscoveryofactivemotifsinchemicalcompoundsfromstructureactivityrelationshipdatasetsTUT'2000/06/076ClassificationofPr
oblemsTypeOutputUnderstandingExampleClassificationdefiniteanswerstoallquestionsUnnecessaryplantoperation,characterrecognitionGuessprob
ableanswerstosomequestionsUnnecessarysportsactionprediction,stockpricepredictionUnderstandingprobabilitytoallquestionsNecessarymedicaldiagnos
is,grammaracquisitionTUT'2000/06/077StreamsinlearningresearchI.ClassificationTUT'2000/06/078決定木の方法眼の色身長髪の色目的変数青低黒-青高黒-茶高黒-茶高ブロンド-茶低ブロンド-青低
ブロンド+青高ブロンド+青高赤+ソースデータTUT'2000/06/079決定木高,赤,青:+茶青ブロンド赤黒低,黒,青:-高,黒,青:-高,黒,茶:-低,ブロンド,青:+高,ブロンド,青:+高,ブロンド,茶:-低,ブロンド,茶:-髪の
色目の色TUT'2000/06/0710平均情報量による変数選択平均情報量分類前)(log)()(log)(),(22npnnpnnppnppnpI++−++−=bitnpI954.085log8583log83),(22=−−=TUT'2000/06/0711分類による平均情報量
の利得身長による分類0.003bit髪の色による分類0.454bit眼の色による分類0.347bit0.003bit0.9510.954利得:0.951bit0.918830.97185平均:0.918bit32log3231log31低い:0.971bit53l
og5352log52高い:2222=−=+=−−=−−TUT'2000/06/0712数値属性間結合ルールによる糖尿病診断木TUT'2000/06/0713ProgressinDecisionTreeVariablewithcontinuousvaluesEntrop
ygainratio,GiniindexSamplingPruningBagging,BoostingUserinterfaceInteractiveexpansionofatreeVisualizationRulesTUT'2000/06/071400.10.20.30.40.50.
60.70.80.9100.040.080.120.160.20.240.280.320.360.40.440.48entropyginiindexVarianceGini-index=∑Pi・(1-Pi)=1-∑Pi2Giniindexvs.EntropyTUT'2000/06/071
5決定木の方法秋葉,アルモアリム,金田:例からの学習技術の応用に向けて,情報処理学会誌,Vol.39,No.2,pp.145-151;No.3,pp.245-251(1998).Breiman,L.,Friedman,J.H.,Olshen,R.A.&Stone,C.J.:Classific
ationandRegressionTrees,TheWadsworth&Brooks/Cole(1984).[CART]Quinlan,J.R.:C4.5:ProgramsforMachineLearning,MorganKaufmann(1993).古川訳:AIによるデータ解析,トッパン
(1995).TUT'2000/06/0716StreamsinlearningresearchIII.RoughsetCharacteristicsNonexploratoryMethodologyfordecisi
ontableAnalysisofvariabledependenciesNPhardtoattributes&valuesReferencesPawlak,Z.:RoughSets:TheoreticalAspectsofReasoningaboutData,KluwerA
cademicPublishers(1991).W.Ziarko:ReviewofBasicsofRoughSetsintheContextofDataMining,Proc.FourthInternationalWor
kshoponRoughSets,FuzzySets,andMachineDiscovery,pp.447-457,Tokyo(1996).Datalogic/R:ReductSystemsInc.TUT'2000/06/0717RoughsetとはPositiveregionBoun
daryregionNegativeregionTUT'2000/06/0718ClassSHECT100100210211311100402111512101610100712211800211計算過程1:離散化
・クラス分類ObjSHECT123…955595569557955812.018.219.0…17.518.019.615.7132.2148.0175.8…199.1111.0186.6103.2171530151826…1917200142
2238327.032.311.2…4.017.129.941.17513060…14395152161Reduct1={Size,Height,Energy}Reduct2={Size,Height,Current}Core={Size,Height}TUT'2000/06/0719説明
変数Pと目的変数QP={Size,Height,Energy,Current}Q={Temperature}Reduct1(P,Q)={Height,Energy}Reduct2(P,Q)={Height,Current}Core(P,Q)={Height}HeightEner
gyTemperature010021110211221ClassSHECT100100210211311100402111512101610100712211800211TUT'2000/06/0720計算過程2:DecisionmatrixによるRule導出j123iOBJe1e3e61
e2(S,1)(E,2)(C,1)(H,0)(E,2)(C,1)(E,2)(C,1)2e4(H,2)(C,1)(S,0)(H,2)(C,1)(S,0)(H,2)(C,1)3e5(S,1)(H,2)(H,2)(H,2)
4e7(S,1)(H,2)(E,2)(C,1)(H,2)(E,2)(C,1)(H,2)(E,2)(C,1)5e8(E,2)(C,1)(S,0)(H,0)(E,2)(C,1)(S,0)(E,2)(C,1)ClassSHECTe1
00100e210211B11=((S,1)∨(E,2)∨(C,1))∧((H,0)∨(E,2)∨(C,1))∧((E,2)∨(C,1))=(E,2)∨(C,1)B12=((H,2)∨(C,1))∧((S,0)∨(H,2)∨(C
,1))∧((S,0)∨(H,2)∨(C,1))=(H,2)∨(C,1)B13=((S,1)∨(H,2))∧((H,2))∧((H,2))=(H,2)B14=((S,1)∨(H,2)∨(E,2)∨(C,1))∧((H,2)∨(E,2)∨(C,1))∧((H,2)∨(E,2)∨(C,1))=(H
,2)∨(E,2)∨(C,1)B15=((E,2)∨(C,1))∧((S,0)∨(H,0)∨(E,2)∨(C,1))∧((S,0)∨(E,2)∨(C,1))=(E,2)∨(C,1)(Energy=2)→(Temperature=1)(Current=1)→(Temperature=1)
(Height=2)→(Temperature=1)TUT'2000/06/0721VariablePrecisionRoughSetModelPositiveregionBoundaryregionNegativeregionTUT'2000/06/0722Variab
leDependencyAnalysisNecessaryandSufficientVariableSetsReduct2CoreReduct3Reduct1Reduct5Reduct4TUT'2000/06/0
723CarsexampleReducts(1)cyl,fuelsys,comp,power,weight(2)size,fuelsys,comp,power,weight(3)size,fuelsys,displace,weight(4)size,cy
l,fuelsys,power,weight(5)cyl,turbo,fuelsys,displace,comp,trans,weight(6)size,cyl,fuelsys,comp,weight(7)size,
cyl,turbo,fuelsys,trans,weightCore:{fuelsys,weight}NoSizeCylTurboFuelsysDisplaceCompPowerTransWeightMileage1compact6yesEFImediumhighhighaut
omediummedium2compact6noEFImediummediumhighmanualmediummedium3compact4noEFImediumhighhighmanualmediummedium4compact6yesEF
Imediumhighhighmanuallighthigh5compact6noEFImediummediummediummanualmediummedium6compact6no2-BBLmediummediummediumautoheav
ylow7compact6noEFImediummediumhighmanualheavylow8subcompact4no2-BBLsmallhighlowmanuallighthigh9compact4no2-BBLsmallhigh
lowmanualmediummedium10compact4no2-BBLsmallhighmediumautomediummedium11subcompact4noEFIsmallhighlowmanuallighthigh12subcompact4noEFImediummed
iummediummanualmediumhigh13compact4no2-BBLmediummediummediummanualmediummedium14subcompact4yesEFIsmallhighhighmanualmediumhigh15subcompact4no2-B
BLsmallmediumlowmanualmediumhigh16compact4yesEFImediummediumhighmanualmediummedium17compact6noEFImediummediumhighautomedi
ummedium18compact4noEFImediummediumhighautomediummedium19subcompact4noEFIsmallhighmediummanualmediumhigh20compa
ct4noEFIsmallhighmediummanualmediumhigh21compact4no2-BBLsmallhighmediummanualmediummediumZiarko:Thediscovery,analysis,andrepresentationofdatadependen
ciesindatabases,KnowledgeDiscoveryinDatabases,pp.195-209,Piatetsky-Shapiro&Frawleyed.AAAIPress(1991).TUT'2000/06/0724Reduct&CoreEffectstoSumo
fSquaresSizecylturbofuelsysdisplacecomppowertransweightNet-power121086420VariablesTUT'2000/06/0725RoughSetMethodasaToolofDataAnaly
sisVerygoodrulesforunderstandingDespiteToomanyreductsNumberofreductschangeswithconfidencevalueinVPRSMDisregardoffrequenciesTUT'2000/06/0726Roug
hsetPawlak,Z.:RoughSets:TheoreticalAspectsofReasoningaboutData,KluwerAcademicPublishers(1991).W.Ziarko:ReviewofBas
icsofRoughSetsintheContextofDataMining,Proc.FourthInternationalWorkshoponRoughSets,FuzzySets,andMachineDiscovery,pp.447-45
7,Tokyo(1996).Datalogic/R:ReductSystemsInc.方法論の特徴離散表現に対する方法論共起的な分布からの知識獲得が可能計算量ケース数にN,属性数と属性値数にexp(N)TUT'2000/06/0727StreamsinlearningresearchII.C
haracteristicRulesEvaluationbyUsefulnessPatternswithAccuracy&SupportStatisticalestimationofgeneralityandaccuracy鈴木(1999):データベースからの特徴的ルール発見のための一般性と正確性
の信頼性同時評価手法、人工知能学会誌、14,139-147.Exceptionsasinterestingness鈴木、志村(1997):情報理論的手法を用いたデータベースからの例外的知識の発見、人工知能学会誌、12,305
-312.RatingusefulnessbyhumanestimationRulegenerationbyGeneticAlgorithmTerano,T.andIshino,Y.(1996):Interactiveknowle
dgediscoveryfrommarketingquestionaireusingsimulatedbreedingandinductivelearningmethods,Proc.KDD-96,279-282.Marketbask
etanalysisTUT'2000/06/0728相関ルールの抽出Associationrulesminingソースデータ相関ルール番号購入アイテム101ジュース、コーラ、ビール102ジュース、コーラ、ワイン10
3ジュース、ジン104コーラ、ビール105ジュース、コーラ、ビール、ワイン106ジン107ジュース、コーラ、ワイン、ジン108ワインサポート確信度ルール37.5%100.0%ジュース、ワイン→コーラ37.5%100.0%ビール→コーラ50.0%80.
0%ジュース→コーラ50.0%80.0%コーラ→ジュース37.5%75.0%ワイン→コーラ37.5%75.0%ジュース、コーラ→ワイン37.5%75.0%ワイン→ジュース、コーラ37.5%75.0%ワ
イン→ジュース37.5%75.0%ジュース、コーラ→ワインTUT'2000/06/0729Apriorialgorithm1アイテム集合サポート2アイテム集合サポート3アイテム集合サポート{ジュース}62.5%{ジュース、コーラ}50.0%{コーラ}62.5%{ジュース、ヒ
゙ール}25.0%{ジュース、コーラ、ワイン}37.5%{ビール}37.5%{ジュース、ワイン}37.5%{ワイン}50.0%{ジュース、ジン}25.0%{ジュース、コーラ、ビール}候補から除外{ジン}37.
5%{コーラ、ビール}37.5%{コーラ、ワイン}37.5%{コーラ、ビール、ワイン}候補から除外{コーラ、ジン}12.5%{ビール、ワイン}12.5%{ビール、ジン}0.0%{ワイン、ジン}12.5%ラーシ
゙アイテム集合スモールアイテム集合候補アイテム集合TUT'2000/06/0730時系列データの解析(1)名前Tid品目Yamada105ビールYamada210ブランディHirosawa010ジュース、コーラHirosawa012ビールHirosa
wa109ワイン、水、サイダーMita103ビール、ジン、サイダーYoshino002ビールYoshino106ワイン、サイダーYoshino205ブランディHaneda011ブランディTUT'2000/06/0731時系列データの解
析(2)名前(Tid品目…)Yamada(105ビール)(210ブランディ)Hirosawa(010ジュース、コーラ)(012ビール)(109ワイン、水、サイダー)Mita(103ビール、ジン、サイダー)Yoshino(002ビール
)(106ワイン、サイダー)(205ブランディ)Haneda(011ブランディ)サポートパターン40.0%(ビール)→(ブランディ)40.0%(ビール)→(ワイン、サイダー)TUT'2000/06/0732分類階層構造の導入ワイン
ジンコーラビールブランディジュース飲料清涼飲料弱い酒類強い酒類TUT'2000/06/0733数値属性の取り扱い離散化Max-supportを越えるrangeまで統合複数のrangeFrequentitemset計算Rule導出RuleInte
restにより刈り込みPartialcompleteness概念Interval設定、健全性確保Srikant,R.&Agrawal,R.:MiningQuantitativeAssociationRulesinL
argeRelationalTables,Proc.ACMSIGMOD,pp.1-12(1996).IDAgeMarried#Cars10023No120025Yes130029No040034Yes250
038Yes2Interval20..2425..2930..3435..39ItemsetSupport{<Age:20..29>}3{<Age:30..39>}2{<Married:Yes>}3{<Married:No>}2{<#Cars:0..1>}3{<Age:30
..39>,<Married:Yes>}2RuleSupportConfidence<Age:30..39>and<Married:Yes>➔<#Cars:2>40%100%<Age:20..29>➔<#Cars:0..1>60%66.6
%PeopleFrequentitemset(part)TUT'2000/06/0734仮想トランザクション導入による要因分析沼尾、清水:流通業におけるデータマイニング,人工知能学会誌,Vol.12,No.4,pp.
528-535(1997).顧客#年齢性別職業125男学生232女OL…………顧客#日付購買商品100/00/00S–男100/00/00A–20代197/01/30CD–X197/02/05CD–Y1……200/00/00S–女200/00/00A–30代297/01/15Vi
deo–A297/03/03Video–B…TUT'2000/06/0735時系列記号化によるパターン認識教師付き帰納学習異常発生最大遅れ時間圧力温度データベース化時刻変数パターンT1温度上昇T2圧力上昇T
3温度下降T4圧力下降T5温度上昇T6圧力上昇T7温度上昇T8温度下降T9圧力上昇………IF圧力上昇AND温度下降THEN異常発生:確率80%佐藤:データマイニング向けルールインダクション技術とその応用,情報処理学会関西支部平成9年度第1回ソフトウェア研究会時系列データマ
イニング事例TUT'2000/06/0736グラフ構造への拡張:WWWアクセス履歴の分析BCADBCAHBICADEJKMLNFG猪口他:人工知能学会基礎論研究会SIG-FAI-9801-10,pp.55-60(1998).TUT'2000/06/0737相関ルールの
探索TUT'2000/06/07384213A:B:抽出パターンC:B7137Setp3:ペア選択GraphBasedInduction逐次ペア拡張アルゴリズム吉田、元田:逐次ペア拡張に基づく帰納推論人工知能学
会誌Vol.12,pp.58-67(1997).45216374268639入力Step1:入力の書き換えA5B67A68639A5B6A686B78339Step2:入力中のペアの数え上げTUT'2000/06/07
39GBIのコマンド操作履歴解析への応用emacslprdvi2psxdvilatexpaper.pspaper.dvipaper.texcommandfile手法最頻直前線形識別1-NNCARTGBI精度22.6%20.7%22.6%20
.8%34.6%57.8%TUT'2000/06/0740GraphBasedInductionの特徴高速に構造化オブジェクトの解析可概念獲得,分類規則学習,推論高速化の何れにも適用可能Sequence(DNA,protein)への応用Negativeな条件表現に工夫が必要Ordered
Graphに限定規則・概念は連結グラフに限定複製障害により複雑なオブジェクトの取り扱い困難TUT'2000/06/0741帰納論理プログラミング最も簡単な実行例前提知識parent(1,2).parent(1,3)....正例他は負例grandparent(1,4).grandpar
ent(1,5).…結果grandparent(X,Y):-parent(X,Z),parent(Z,Y).①②③④⑤⑥⑦TUT'2000/06/0742Versionspace中での仮説探索Grandparen
t(X,Y):-???被覆集合アルゴリズム新たなリテラルの付加変数の定数化記述長最少原理で仮説選択FOIL:Quinlan(1990)entropyによる最良探索Progol:Muggleton(1995)逆伴意(Inverseentailment)による探
索空間の縮小採用仮説棄却仮説正例負例TUT'2000/06/0743Progolによる変異原性物質の識別230種のニトロ化合物:Amestestpositive138/negative92,Debnathetal:J.Med.Chem.34:786
-797(1991).188種:重回帰分析実施Progol:188(12hr)/42(6hr)に分割解析atm(compound,atom,element,type,charge).bond(compound
,atom1,atom2,bondtype).9種のRule分類精度は同様指示変数の自動的発見Phenanthrene骨格、例外的acetylene使用法の困難性、長い計算時間Kinget.al.:Relatingchemicalacti
vitytostructure:anexaminationofILPsuccess,NewGenerationComputing,Vol.13,pp.411-433(1995).CH2H2CH2CH2COOONO2NO2NO2NO2NO2ClClClC
lNO2O2NCHNNNCH2NHOOABCDCDEFVWXUYZInputRuleTUT'2000/06/0744帰納論理プログラミングInductivelogicprogramming人工知能学会誌小特集:帰納論理プログラミング,Vol.12,N
o.5,pp.654-688(1997).Lavrac&Dzeroski:InductiveLogicProgramming:TechniquesandApplications,Hertfordshire,El
lisHorwood(1994).Dzeroski,S.:InductiveLogicProgrammingandKnowledgeDiscoveryinDatabases,InFayyadet.al.AdvancesinKnowledgeDiscoveryandDataMining,p
p.117-152,AAAIPress(1996).Quinlan,J.R.:LearningLogicalDefinitionsfromRelations,MachineLearning,Vol.5
,pp.239-266(1990).Muggleton,S.:InductiveLogicProgramming,NewGenerationComputing,Vol.8,pp.295-318(1991);InverseEntai
lmentandProgol,ibid.Vol.13,pp.245-286(1995).King,R.D.et.al.:RelatingChemicalActivitytoStructure:anex
aminationofILPsuccesses,ibid.Vol.13,pp.411-433(1995).http://gruffle.comlab.ox.ac.uk/oucl/groups/mach
learn/TUT'2000/06/0745参考資料TUT'2000/06/0746赤目四十八滝三重県TUT'2000/06/0747RuleInductionasDataAnalysisToolRulesaccura
te?Yes.Softwareavailable?Yes.Computingfast?Yes.Easyunderstanding?Yes.Popular?No.TUT'2000/06/0748PossibleReasonsConservativeusersUnixenvir
onmentNofamiliarexamplesToomanymethodsToomanyrulesSelf-evidentrulesImpressions:adhocmethodsexploratoryTUT'2000/06/0749
ResponseofUsersfromExpectedResultsRegressionbyafewvariablesTSS=ESS+RSS100%99%1%Hypothesisconfirmed=>Satisfa
ctoryRuleinductionAfewsimplerulesAverageaccuracy:99%Sumofcoverage:99%Self-evidentrules=>UnsatisfactorywithDatascapewithoutDatascapeTUT'200
0/06/0750WhatisDatascape?QuantificationProblemquantificationSolutionquantificationMultipledatadependenciesExplanati
onfrompluralviewpointsCorrelationamongexplanationvariablesConcise&levelwisedeepeningdescriptionsViewsofsolutionInspectionofindividualdatu
mSurroundingsofsolutionTUT'2000/06/0751AnswerstoDatascapebyCascadeModelQuantificationbySSSS:sumofsquaresDatadependenciesDetection
oflocalinteractionsUnifiedmechanismforDiscriminationrulesCharacteristicrulesLevelwisecreationofrulesetsTUT'2
000/06/0752Problemindecisiontree1Heuristicsearchisusedtogetthebesttree.ABCClassa1b1c1pa1b1c2na1b2c1na1b2c2na2b1c1pa2b1c2pa2b2c1na2b2c2pA4/4a
1b1B3/1C1/11/00/10/22/0B1/3C1/10/11/0a2b1b2b2c1c1c2c2C4/4c1a1B2/22/00/20/2A2/22/0c2b1b2a2TUT'2000/06/0753Problemindecisiontree2DistinctionbetweenMain
andPre-conditionsC4/4c1a1B2/22/00/20/2A2/22/0c2b1b2a2IFc1andb1THENpositiveAccuracy:100%Cover:25%IFb1addedonc
1THENpositiveAccuracy:50%→100%Cover:25%TUT'2000/06/0754Problemindecisiontree3DiscriminationPowerofaRuleHowcanweorderrules?Rulesetpru
nedAccuracy:85%Coverage:100%IFa1andb1THENpositiveAccuracy:80%Cover:25%IFa1andb2THENnegativeAccuracy:100%Cov
er:25%IFa2andb1THENpositiveAccuracy:100%Cover:15%IFa2andb2THENnegativeAccuracy:71%Cover:35%A11/9a1b1B4/6C4/11/03/10/53/0
B7/3C2/51/31/2a2b1b2b2c1c1c2c2TUT'2000/06/0755SARDiscoveryontheMutagenicityofAromaticNitroCompoundsStudiedbytheCascadeModelTakashiOkadaKwansei
GakuinUniversityokada@kwansei.ac.jpTUT'2000/06/0756ContentsPreviousworkonthisdatasetRegressionInductiveLogicProgramm
ingAimofthisworkCascademodelItemgenerationRelativeindexingofgraphverticesTypesofitemexpressionsResultsInterpretationofrulesCha
racteristicstructuralpatternsTUT'2000/06/0757Regressionanalysis230aromaticandheteroaromaticnitrocompoun
dsDebnathetal:J.Med.Chem.34:786-797(1991).Divisionto2datasets188compoundssubjecttoregression42compoundswithdiversestructu
resConclusionImportanceofhydrophobicity(logP)Electron-attractingelementconjugatingwithnitrogroupenhancemutagenis
ityCompoundswith3ormorefusedrings(I1=1)aremuchmoremutagenicLessactiveaceanthrylenes(Ia=1)15.489.288.138.1)110log
(90.2log65.0log1log−−+−+−=aLUMOPIIPActivityCH2H2CH2CH2COOONO2NO2NO2NO2NO2ClClClClNO2O2NCHNNNCH2NHOOABCDTUT'2000/06/0758AnalysisbyP
rogol(InductiveLogicProgramming)Kingetal:Proc.Natl.Acad.Sci.USAVol.93,pp.438-442(1996).atm(compound,atom,element,type,charge).bond(com
pound,atom1,atom2,bondtype).Progolprocessing5rulesfrom188compounds(12hrs.)1rulefrom42compounds(6hrs.)accuracycomparablet
oregressionAutomaticdiscoveryofindicatorvariables3ormorefusedringsaceanthrylenesOHWXUVYZ>0.01<0.005<-0.406=0.146TUT'2000/06/0
759AimofThisWorkUniformtreatmentof230compoundsUsing2DstructuralformulaeFeasibilitystudyofthecascademodeltolargesc
alestructuraldataminingLargetoxicologydatabaseHugeSARdatabasefromhighthroughputscreeningTUT'2000/06/0760SchemeofAnalysisStructuralformulae
(SDFfile)RelativeindexingofgraphverticesCompoundActivityPtrn-1Ptrn-2…Ptrn-99911.5YY…N23.2NY…NCascadeModelRulesInterpretati
onTUT'2000/06/0761Cascademodel1ItemsetLatticeitemsetasnodeitemsinclusionaslinkclassdistributionasnodeproper
ty[]4/4[a2b2]0/2[a2b1]2/0[a1b2]1/2[a1b1]1/0[b2]1/4[b1]3/0[a2]2/2[a1]2/2BAb1a1a2a1,b1:pa2,b1:pa2,b1:pb2a1,b2:na1,b2:na1,b2
:pa2,b2:na2,b2:nTUT'2000/06/0762Cascademodel2[a2][a1][a1b2][b2][][a2b2][a1b1][b1][a2b1]Questions•Potentialdefini
tion•Powerdefinition•Cascadeconstruction•SelectionofwaterfallsNodesasLakeswithPotentialLinksasWaterfallswithPowerHighPowerWaterfallsasRulesPotentia
l:ClasspurityMixedPureTUT'2000/06/0763SumofSquaresDecompositionforCategoricalDataUsingSSDefinitionbyGiniTSSWSS(1)WSS(2)BSS(1)BSS(2)TSS=WSS
(1)+BSS(1)+WSS(2)+BSS(2)()()−=−=221212gggpnWSSpnTSS()()()−=22ppnBSSgggTUT'2000/06/0764WSSaspotential&BSSaspowergiv
enbyTSSDecomposition#positives/#negativesSamplevarianceSumofsquaresBSS(1):18800/2000.16TSS:160760/400.0475WSS(1):38
40/1600.16WSS(2):32BSS(2):7218+38+72+32=160TUT'2000/06/0765IF[B:y]addedon[A:y]THEN[D:y;E:n]Cases:100→60[D
:y]60%→93%,BSS=6.67[E:n]60%→90%,BSS=5.40BSSB9.60C0.00D6.67E5.40ynWSS2B604024.0.24C505025.0.25D604024.0.24E406024.0.24A:yA:y,B:yALink,Distr
ibutionofVeiledItems&aRuleynWSS2B6000.00.000C303015.00.250D5643.73.062E6545.40.090Noneedtogenerate[A:y,B:y,D:y
,E:n]TUT'2000/06/0766ItemgenerationschemeItemsC3H=C3HC3H-C4HC4H-N3HC3H=C-C4HC3H-C-O2HC4H-C-N3HC3H=C-C-O2HC3H-C-C4HN3H-C-C-O2HC
3H=C-C-C4HC3H-C-C-N3HC4H-O2HC3H=C-C-C-N3HC4H-C-O2HH2CCHCHHOCH2NH2<1><2><3><4><5><6>LinearsubstructurepatternTwoterminalatom
partCoordinationNo.and/orhydrogenConnectingpartalongtheshortestpathElementnameand/orbondtypeCoordinationNo.and/orhydrogenNumb
erofbonds<1>=<2>-<3>-<5>-<6>TUT'2000/06/0767ItemTypesandResultsTerminalatompartConnectingpartelement&bondtypebondtype#bonds&bondtypesatterm
inalscoord.no.&hydrogen2044➔7710.7(7)1710➔8513.0(8)822➔7214.1(10)coord.no.1678➔469.8(7)1198➔5914.2(9)531➔4711.4(
4)element1587➔518.5(7)1062➔5714.2(10)412➔5111.6(5)#featuresgenerated➔#featuresanalyzedSSexplained(#rules)inthefirstrulesetusingminsup=0.1,th
res=0.1,thr-BSS=0.01TUT'2000/06/0768ComputationbyDISCASCategorizationto4activityclassesinactive,low:y<0.
0,medium:0.0y<3.0,high:y3.0Parametersminsup=0.05,thres=0.1,thr-BSS=0.01LatticegenerationPCwith266MHzPentiumII,256MB109sec
.for6939nodes209rulesin10rulesets10rulesinthefirstruleset#items0123#nodes19119104937TUT'2000/06/0769InterpretationofaRule(1
)BasicContentsIF[C3HrCrC-CrCrCrC-N3:y]addedon[C3rCrCrCrC3:n]THENActivity=low40.8%->14.0%;BSS:3.25;Case
s:157->430.100.410.410.08==>0.000.140.580.28inactlowmedhighinactlowmedhighINHIITUT'2000/06/0770InterpretationofaRul
e(2)anOptionalRHSTermIF[C3HrCrC-CrCrCrC-N3:y]IIaddedon[C3rCrCrCrC3:n]THENC3rCrCrC-N-O1=yIII68.2%->100.0%;BSS:4.36;Cases:157->43NOIIINHIINOHIVNHNOTUT
'2000/06/0771InterpretationofaRule(3)IntegrationofInformationIncorporationofotheroptionalRHStermswith100%
confidenceConsultationtothesupportinginstancesindatabasePossiblesubstructurepatternsfordiscriminationSelectionbychemist’sintuitionHHN+O?OV43compo
unds114compoundsN+O?OVIHHN+O?ON+O?ON+O桇ON+O?ON+O?OHHHN+O?OTUT'2000/06/0772InterpretationofaRule(4)AbsenceofaSubstr
uctureIF[C3rCrCrC-N3:n]addedon[C3HrCrCrC-N-O1:y]THENActivity=low30.7%->56.4%;BSS:3.03;Cases:137->390.130.310.380.18==>0.260.560.
180.00CooccurrenceofAbsenceof[C3rCrCrC-N3]Presenceof[C3HrCrCrC-N3]Allsubstituentsatthe4-thpositionfromNarehydrogen.
N+O桇OHHHN+O?OHHHVIIVIIITUT'2000/06/0773CharacteristicSubstructurePatterns(1)LowerActivityL1:0.130.310.
380.18==>0.260.560.180.00Cases:137->39Allsubstitutentsatthe4-thpositionfromNO2groupmustbehydrogenlikeVII.L2:0.120.220.420.25==>0.1
40.710.140.00Cases:69->14Allsubstituentsatthe5-thpositionfromtheNO2groupmustbehydrogenlikeIXasfaraspolycyclicaromaticnitrocompoundswit
hhydrogenatthe4-thpositionareconcernedN+O桇OHHHVIIN+O?OHHHIX•Specificeffectsbyhydrogenatoms?•Stoppingtheexpansionofpolycyclicaromaticrings?
TUT'2000/06/0774CharacteristicSubstructurePatterns(2)LowerActivityL3:0.110.200.450.23==>0.280.560.170.00Cases:115->18Polycyclicaromaticcomp
oundswithnitrogen.L4:0.110.320.400.17==>0.250.750.000.00Cases:128->12MonocyclicaromaticnitrocompoundsNXN+朞OOHHXI-5-4-3-2-10123456-
1012345678LogPTUT'2000/06/0775CharacteristicSubstructurePatterns(3)MediumtoHigherActivityM1:0.100.410.4
10.08==>0.000.140.580.28Cases:157->434-nitrobiphenylVIamongaromaticcompoundswithoutfusedringsystems.M2:0.180.430.390.
00==>0.000.120.880.00Cases:56->17NO2substitutionatheteroaromaticringordoublebondinacenaphthylenes,aceanthrylenes.N+O?OVIN+O?OXIITUT'2000/06/07
76CharacteristicSubstructurePatterns(4)MediumtoHigherActivityM3:0.100.410.410.08==>0.000.190.550.26Cases:157->47Almostsameas
M1,butincludesXIII.M4:0.100.410.410.08==>0.000.170.540.29Cases:157->41Abiphenylstructure.Onesideismonocyclicbenzenewithorthosu
bstituent.MostcompoundshaveoneoftheskeletonsXIV,XV,XVI.CON+O?OXIIIXIVXVIXXVTUT'2000/06/0777CharacteristicS
ubstructurePatterns(5)HigherActivityH1:0.100.310.440.15==>0.090.090.480.34Cases:230->65Polycyclicaromaticnitrocompounds.T
he5-thpositionfromNO2groupissubstitutedtoformalargerstructureasinXVII.H2:0.120.300.360.21==>0.000.000.350.65Cases:56->17Compou
ndswiththeskeletonXVIIIarethesupportofthisrule.N+O?OXVIIN+O?OXVIII-4-3-2-10123456TUT'2000/06/0778ConcludingRemarks-Positive
aspects-NewfeaturegenerationmethodforgraphsEfficientlatticeconstructionUsingthousandsoffeaturesEmployingahundredfeaturesDe
tectionofsubstructurepatternsComprehensiblebyexpertsTractablenumberofrulesTUT'2000/06/0779ConcludingRemarks-Problems-Missofgrou
psconsistingofasmallnumberofcompounds17compoundswithaNH2groupAllmediumorlowactivityInterpretationofru
lesneedsmorehelpfromcomputerRemoveoptionalRHStermswithredundantsubstructuresMaximalcommonsubgraphFuturedevelopmentIncorporationofregressionfunction
Extensionto3DstructuresTUT'2000/06/0780