;;; -*- Mode: LISP; Syntax: Common-lisp; Package: USER; Base: 10 -*- ; ;****************************************************************************** ; ; HOW TO RUN AUTOCLASS II - Automatic Classification of Data ; ; Authors : Peter Cheeseman - RIACS - Research Institute for Advanced ; Computer Science ; Matthew Self - Sterling Software ; ; John Stutz - RIA - Artificial Intelligence Research Branch ; ; William Taylor - Sterling Software ; ; Address : MS 244-20, NASA Ames Research Center, Moffett Field, CA 94035 ; ; Phone : (415) 694-3364 - FTS 464-3364 ; ; Arpanet : taylor@pluto.arc.nasa.gov ; ; Environment : Common Lisp ; ; Revision History : 11 Jan 88 - original released version - AutoClass 30D ; 15 Aug 88 - revised version - AutoClass 30E ; ;****************************************************************************** ;;; ;;; File : >u>taylor>ui>autoclass-x>nograph-ui>READ-ME.TEXT ;;; ;;; Description : AUTOCLASS II DATA BASE PROCESSING INSTRUCTIONS ;;; ;;; Purpose : This file is an introductory presentation of AutoClass II ;;; processing, with sample execution forms. ;;; ;;;*************************************************** ;;; FILE LIST FOR AUTOCLASS CARRY TAPE ; COPYRIGHT NOTICE & DOCUMENTATION ("cha:>u>taylor>ui>autoclass>doc>riacs-policy.text.newest" "cha:>u>taylor>ui>autoclass>doc>how-to-run-autoclass.text.newest" ; AUTOCLASS PROGRAM "cha:>u>taylor>ui>autoclass-x>ac>*.lisp.newest" "cha:>u>taylor>ui>autoclass-x>ac>*.bin.newest" ; GRAPHICAL USER INTERFACE "cha:>sys>site>ui-library.system.newest" "cha:>sys>site>ui-library.translations.newest" "cha:>u>taylor>ui>library>*.*.newest" "cha:>sys>site>autoclass-ui.system.newest" "cha:>sys>site>autoclass-ui.translations.newest" "cha:>u>taylor>ui>autoclass>**>*.*.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-pre-processing.lisp.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-data-mods.lisp.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>dynamic-meta-basic-classes-mapping.lisp.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-pre-processing.bin.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-data-mods.bin.newest" ; NON-GRAPHICAL USER INTERFACE "cha:>u>taylor>ui>autoclass-x>nograph-ui>*.*.newest" ; SMALL DATA BASE COMMAND FILES & DATA BASES "cha:>u>marshall>autoclass>iris>*.*.newest" "cha:>u>taylor>ui>autoclass-x>data>soybean>*.*.newest" "cha:>u>taylor>ui>autoclass-x>data>med>*.*.newest" ; LARGE DATA BASE COMMAND FILES "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>run-it.lisp.newest" "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>run-it-splits.lisp.newest" "cha:>u>marshall>autoclass>psc>run-it.lisp.newest" ) ;;;*************************************************** 1.0 PREPARATION OF DATA BASES FOR AUTOCLASS An AutoClass II data base consists N-DATA data objects referred to individually as datum. Each datum consists of N-ATTRIBUTES attributes. AutoClass II handles three types of datum attributes: IGNORE, DISCRETE and REAL-PT. IGNORE attributes can be single symbols (e.g. NASA), numbers (e.g. 23, 2.3e-12, etc) or string constants containing multiple symbols (e.g. "NASA Ames Research Center"). They are not used in the classification, and are used only for identification of datum after the classification is completed. DISCRETE attributes are integer numbers, which are "1 based". That means that their smallest value is 1, not 0. The value 0 is reserved for use by AutoClass II to denote unknown discrete values. Typically all values are potentially usable within the range of the discrete attribute. The value "?" also denotes an unknown value. REAL-PT attributes are real numbers (any Common Lisp notation) or integers. Unknown values are designated by "?". To describe each attribute; its type (*VARIABLE-TYPES*), range (*VARIABLE-RANGES*), and description (*VARIABLE-DESCRIPTIONS*) must be specified. Example: A data base consists of 10 datum, each with four attributes. The first attribute is the observation number, the second is a person's name, the third is the answer to the question "Do you drink Cocoa-cola?" (0 = unknown, 1 = yes, 2 = no) and the fourth is the number of bottles/cans of Cocoa-cola consumed in one year. The information required by AutoClass II, in addition to the actual data is the following: N-DATA : 10 N-ATTRIBUTES : 4 *VARIABLE-TYPES* : (ignore ignore discrete real-pt) *VARIABLE-RANGES* : ((0 0) (0 0) (1 2) (0 10000)) *VARIABLE-DESCRIPTIONS* : ("observation number" "name of person" "drink Cocoa-cola?" "Cokes per year") data (in file ..>coke-usage.dat) : 1 "Joe Piscapoe" 1 23 2 "Ronald Reagan" 2 0 3 "Gerald Ford" ? ? 4 "Ollie North" 1 9999 5 "Alan Cranston" 1 ? 6 "Pete Wilson" 2 0 7 "Diane Feinstein" 1 365 8 "Jane Goodall" 0 0 9 "Cher" 1 52 10 "Sonny Bono" 1 666 The data base definitional information is used to build the data base header file, which provides AutoClass II with the necessary information to process the data base. In order to properly process the data base, it must go through two pre-processing steps (.dat -> .db & .db -> .base). Finally, the header file (-hd.db) and the pre-processed data base (.base) processed by AutoClass II to generate the classifications and reports. 2.0 PRE-PROCESSING DATA FOR AUTOCLASS II Load AutoClass II functions (required for all phases: pre-processing, classification & report generation) (send *terminal-io* :set-more-p nil) ; turn off *MORE* processing (send *terminal-io* :set-deexposed-typeout-action :permit) ; allow Lisp Listener to typeout when de-exposed (load "cha:>u>taylor>ui>autoclass-x>AC>development-30E.bin") (load "cha:>u>taylor>ui>autoclass-x>AC>dev-will.bin") 2.1 PRE-PROCESS DATA - STEP 1 : DATA BASE PREPARATION Edit a file to contain the assignment of the four following global variables (see ..>autoclass-x>data>med>prepare.lisp) Variable-types (*VARIABLE-TYPES*) list of type of each variable (attribute) : IGNORE, DISCRETE OR REAL-PT. IGNORE attributes are normally put at beginning of attribute ordering. (Entered by hand.) Variable Ranges (*VARIABLE-RANGES*) list of sublists of minimum range values & maximum range values. For type = discrete which have minimum range of 0, enter its index in *DISCRETES-ADJUSTED+1*. PROCESS-DATA-RANGES & GENERATE-HEADER-FILE will take care of adjusting its values by +1. (0 0) is used for 'ignore' variable types. (Entered by hand.) Discrete variables whose range includes 0 (*DISCRETES-ADJUSTED+1*) reserved by AutoClass II for unknown values (Entered by hand.) Variable-descriptions (*VARIABLE-DESCRIPTIONS*) a list of strings containing a short description of each attribute. (Entered by hand.) 2.2 PRE-PROCESS DATA - STEP 2 : RANGE CHECKING - .DAT -> .DB Assumes that .dat is in Lisp readable format. :log-file contains a log of the changes, if any, to the data base. (process-data-ranges "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DAT" "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DB" :n-variables 279 :log-file "cha:>u>taylor>ui>autoclass-x>data>med>range-ck-changes.log") 2.3 PRE-PROCESS DATA - STEP 3 : BUILD DATA BASE HEADER FILE (generate-header-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-HD.DB" :n-data 150 :title "Medical Cancer History Data Base Header from UCSF - Jonathan") The following four variables are generated by GENERATE-HEADER-FILE and placed into the file. Attribute ranges for discretes (*DISC-VAR-RANGES*) a list of integers for discretes or NIL for IGNORE or REAL-PT attributes -- for discretes : number of possible integer values that attribute can assume. Based on assumption that discrete variables are 1 based, not 0 based. NOTE that the value of 0 is used to designate unknown values, so if a discrete attribute uses 0 for a meaningful value, then all its data values must be increased by 1. NOTE that READ-DB adds 1 to all discrete variable range values that it reads in to handle the 0 values it assigns to unknown values (?). E.g. An attribute with values 0 and 1 would have an input range of 2 and altered input datum values of 1 and 2. Another example: possible datum values: 1 and 45 => range of 45. (Generated by GENERATE-*DISC-VAR-RANGES*) Discrete priors (*DISC-PRIORS*) normally the form NIL, which indicates that defaults are used Continuous variable priors for real-pts (*REAL-PRIORS*) a list of NIL for IGNORE or DISCRETE attributes or a sub-list of default mean and standard deviation, e.g. (1 1) for REAL-PT attributes. After the initial real priors are computed (COLLECT-REAL-PRIORS-FROM-DATA), this list is replaced by the value of *REAL-PRIORS* for the classification processing (FIND-BEST-N). (Generated by GENERATE-*REAL-PRIORS*) No. of datum (*N-DATA*) the number of data items containing (length *variable-types*) attributes - an integer 2.4 PRE-PROCESS DATA - STEP 4 : REAL-PT CHECKING - .DB -> .BASE This step is to identify and correct (by transforming then to DISCRETEs), REAL-PT variables which are in fact "discrete scalars". :log-file contains a log of the changes, if any, to the data base. (process-real-pt-data "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DB" "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.BASE" :n-data 150 :variable-descriptions t :header-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-hd.db" :log-file "cha:>u>taylor>ui>autoclass-x>data>med>real-pt-changes.log") 3.0 GENERATE BEST CLASSIFICATIONS These forms are to be run in the Lisp Listener, if on a Lisp machine. The *current-results* file does not initially need to exist to begin the generation of classifications. It is created by AutoClass II, if it does not exist. The number of classes to be used by the partition is an upper bound selected by the user, depending on the nature of the data. (it appears in MAKE-PART&CLASSES, *CURRENT-RESULTS* file name and FIND-BEST-N-2 [ :N-classes ]. The upper bound selected should result in at least one empty class in the completed classification. 3.1 FOR SMALL DATA BASES: (* N-ATTRIBUTES N-DATA) < 10,000) (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base") (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-20.wt-set") (setf *partition* (make-part&classes 20)) (find-best-n-2 5 ; N-held number-of-best-classifications-to-save *partition* ; classification partition 'DISPERSE-CYCLE ; search-method 5 ; N-cycles needed to check for termination *current-results* ; weight set results written here :N-classes 20 ; number-of-class-bins :file-header"Cancer history data base" ; description for top of *current-results* ) The FIND-BEST-N-2 run should be stopped by typing ~~~ in the Lisp Listener. This will allow the current cycle to finish smoothly and the classification results to be written to the .wt-set file, if appropriate. Since the classification results are written to this file incrementally, the process can be stopped (by ~~~) and re-started (the previous .WT-SET results will be retained) at any time. This form can be run as long or as many times as desired to get the best of a large set of randomly seeded classifications. About 15 such classifications is a sufficient number. This can easily be determined by "viewing" the beginning of the .WT-SET file (as the form is running). The second item in the file is the number of classifications which have been generated up to the time that the last best of N-HELD results were written to the file. 3.2 FOR LARGE DATA BASES: (* N-ATTRIBUTES N-DATA) > 10,000 Since the "disperse-cycle" search method is particularly time intensive for large data bases, a two step search strategy is used: multiple "dynamic cycle" searches followed by a "disperse-cycle" search. (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-5425.base") (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-80.wt-set") (setf *partition* (make-part&classes 80)) (find-best-n-2 5 ; N-held number-of-best-classifications-to-save *partition* ; classification partition 'DYNAMIC-CYCLE ; search-method 5 ; N-cycles needed to check for termination *current-results* ; weight set results written here :N-classes 80 ; number-of-class-bins :file-header "LRS-5425 data base" ; description for top of *current-results* :split-.wt-set-file t) ; creates separate files for each classification The :split-.wt-set-file keyword reduces the i/o time of writing large .wt-set files with N-held (5) classifications in it, by creating individual files for each classification and naming the files *current-results* suffixed with "-n" (n = 0->4). The .wt-set file with the lowest MML (mean message length), the third item in the .wt-set file, is selected for the "disperse-cycle" run, in this case the "-4" file was selected: (read-spectral-*base-data* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-5425.base") (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>disperse>SPECTRA-80-4.wt-set") (setf *partition* (make-part&classes 80)) (update-.wt-set-file *partition* *current-results* :update-method 'disperse-cycle :n-cycles 2 :cache-file *current-results* :split-.wt-set-file t) 4.0 GENERATE REPORTS FOR BEST CLASSIFICATION FOR NON-GRAPHICAL USER INTERFACE: (AutoClass must be loaded first) (load "cha:>u>taylor>ui>autoclass-x>NOGRAPH-UI>load-it.lisp") FOR GRAPHICAL USER INTERFACE: (AutoClass must be loaded first) (load "cha:>u>taylor>ui>autoclass>load-it.lisp") 4.1 NON-GRAPHICAL METRICS AND CROSS-REFERENCE REPORTS The following is used after all classifications are complete to generate reports based on the best classification in the .WT-SET file. There are three reports: metrics, xref-by-class & xref-by-case. Additional datum variables can be added to the xref reports by extending the code in DATA-MODS.LISP. (generate-reports-for-best-classification :base-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base" ; nil -> already loaded :results-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-10.wt-set" :N-classes 10 ; number-of-class-bins :metrics-report-file "cha:>u>taylor>ui>autoclass-x>data>med>metrics-sample.text" :xref-by-class-file "cha:>u>taylor>ui>autoclass-x>data>med>xref-by-class.text" :xref-by-case-file "cha:>u>taylor>ui>autoclass-x>data>med>xref-by-case.text") 4.2 GRAPHICAL METRICS AND CROSS-REFERENCE REPORTS If the current (loaded) classification is not the one desired, then it must be generated: (*current-results* can be either non-split(below) or split .wt-set file) (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base") (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-20.wt-set") (setf *partition* (make-part&classes 20)) (get-best-of-weightings *current-results*) (collect-weights *partition*) (update-partition-mmls *partition*) (setf *class-wt-ordering* (get-class-weight-ordering)) (setf *class-assignments* (get-max-weight-classes *curr-wts*)) The graphical user interface provides screen hardcopy capability of displays of the class populations, the attribute influence coefficients (listed numerically in the metrics report) and several data plots. The user interface also allows mouse-clicked initiation of the metrics report. The cross-reference reports are invoked by executing the desired functions. See the large data base command files for examples. The IRAS-DATA-MODS.LISP file demonstrates how >NOGRAPH-UI>DATA-MODS.LISP can be extended.