;;; -*- Mode: LISP; Syntax: Common-lisp; Package: USER; Base: 10 -*-

;
;******************************************************************************
;
;  HOW TO RUN AUTOCLASS II - Automatic Classification of Data
;
;  Authors : Peter Cheeseman - RIACS - Research Institute for Advanced
;					Computer Science
;	     Matthew Self    - Sterling Software
;
;	     John Stutz      - RIA - Artificial Intelligence Research Branch
;
;	     William Taylor  - Sterling Software
;
;  Address : MS 244-20, NASA Ames Research Center, Moffett Field, CA 94035
;
;  Phone : (415) 694-3364   -   FTS 464-3364
;
;  Arpanet : taylor@pluto.arc.nasa.gov
;  
;  Environment : Common Lisp
;
;  Revision History : 11 Jan 88 - original released version - AutoClass 30D
;	              15 Aug 88 - revised version - AutoClass 30E
;
;******************************************************************************

;;;
;;; File : >u>taylor>ui>autoclass-x>nograph-ui>READ-ME.TEXT
;;;
;;; Description : AUTOCLASS II DATA BASE PROCESSING INSTRUCTIONS
;;;
;;; Purpose : This file is an introductory presentation of AutoClass II
;;;		processing, with sample execution forms.
;;;
;;;***************************************************

;;; FILE LIST FOR AUTOCLASS CARRY TAPE

 ; COPYRIGHT NOTICE & DOCUMENTATION 
("cha:>u>taylor>ui>autoclass>doc>riacs-policy.text.newest"
 "cha:>u>taylor>ui>autoclass>doc>how-to-run-autoclass.text.newest"
 ; AUTOCLASS PROGRAM
 "cha:>u>taylor>ui>autoclass-x>ac>*.lisp.newest"
 "cha:>u>taylor>ui>autoclass-x>ac>*.bin.newest"
 ; GRAPHICAL USER INTERFACE
 "cha:>sys>site>ui-library.system.newest"
 "cha:>sys>site>ui-library.translations.newest"
 "cha:>u>taylor>ui>library>*.*.newest"
 "cha:>sys>site>autoclass-ui.system.newest"
 "cha:>sys>site>autoclass-ui.translations.newest"
 "cha:>u>taylor>ui>autoclass>**>*.*.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-pre-processing.lisp.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-data-mods.lisp.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>dynamic-meta-basic-classes-mapping.lisp.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-pre-processing.bin.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>iras-data-mods.bin.newest"
 ; NON-GRAPHICAL USER INTERFACE
 "cha:>u>taylor>ui>autoclass-x>nograph-ui>*.*.newest"
 ; SMALL DATA BASE COMMAND FILES & DATA BASES
 "cha:>u>marshall>autoclass>iris>*.*.newest"
 "cha:>u>taylor>ui>autoclass-x>data>soybean>*.*.newest"
 "cha:>u>taylor>ui>autoclass-x>data>med>*.*.newest"
 ; LARGE DATA BASE COMMAND FILES
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>run-it.lisp.newest"
 "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>run-it-splits.lisp.newest"
 "cha:>u>marshall>autoclass>psc>run-it.lisp.newest"
 )
;;;***************************************************

 1.0 PREPARATION OF DATA BASES FOR AUTOCLASS

An AutoClass II data base consists N-DATA data objects referred
to individually as datum.  Each datum consists of N-ATTRIBUTES
attributes.  AutoClass II handles three types of datum attributes:
IGNORE, DISCRETE and REAL-PT.

IGNORE attributes can be single symbols (e.g. NASA), numbers
(e.g. 23, 2.3e-12, etc) or string constants containing multiple symbols
(e.g. "NASA Ames Research Center").  They are not used in the classification,
and are used only for identification of datum after the classification
is completed.

DISCRETE attributes are integer numbers, which are "1 based". That means
that their smallest value is 1, not 0. The value 0 is reserved for use by
AutoClass II to denote unknown discrete values.  Typically all values are
potentially usable within the range of the discrete attribute. The value "?"
also denotes an unknown value.

REAL-PT attributes are real numbers (any Common Lisp notation) or integers.
Unknown values are designated by "?".

To describe each attribute; its type (*VARIABLE-TYPES*),
range (*VARIABLE-RANGES*), and description (*VARIABLE-DESCRIPTIONS*)
must be specified.

Example: A data base consists of 10 datum, each with four attributes.
The first attribute is the observation number, the second is a person's
name, the third is the answer to the question "Do you drink Cocoa-cola?"
(0 = unknown, 1 = yes, 2 = no) and the fourth is the number of
bottles/cans of Cocoa-cola consumed in one year.

The information required by AutoClass II, in addition to the actual
data is the following:

N-DATA : 10
N-ATTRIBUTES : 4
*VARIABLE-TYPES* : (ignore ignore discrete real-pt)
*VARIABLE-RANGES* : ((0 0) (0 0) (1 2) (0 10000))
*VARIABLE-DESCRIPTIONS* : ("observation number" "name of person"
	 "drink Cocoa-cola?" "Cokes per year")

data (in file ..>coke-usage.dat) :

 1	"Joe Piscapoe"      1	  23
 2	"Ronald Reagan"     2	   0
 3	"Gerald Ford"       ?	   ?
 4	"Ollie North"       1	9999
 5	"Alan Cranston"     1	   ?
 6	"Pete Wilson"       2	   0
 7	"Diane Feinstein"   1	 365
 8	"Jane Goodall"      0	   0
 9	"Cher"              1	  52
10	"Sonny Bono"        1	 666

The data base definitional information is used to build the data base header
file, which provides AutoClass II with the necessary information to process
the data base. In order to properly process the data base, it must go through
two pre-processing steps (<file>.dat -> <file>.db & <file>.db -> <file>.base).
Finally, the header file (<file>-hd.db) and the pre-processed data base
(<file>.base) processed by AutoClass II to generate the classifications and
reports.


 2.0 PRE-PROCESSING DATA FOR AUTOCLASS II 

   Load AutoClass II functions (required for all phases: pre-processing,
	classification & report generation)

   (send *terminal-io* :set-more-p nil)				; turn off *MORE* processing
   (send *terminal-io* :set-deexposed-typeout-action :permit)	; allow Lisp Listener to typeout when de-exposed

   (load "cha:>u>taylor>ui>autoclass-x>AC>development-30E.bin")
   (load "cha:>u>taylor>ui>autoclass-x>AC>dev-will.bin")

 2.1 PRE-PROCESS DATA - STEP 1 : DATA BASE PREPARATION

   Edit a file to contain the assignment of the four
	following global variables (see ..>autoclass-x>data>med>prepare.lisp)

   Variable-types  (*VARIABLE-TYPES*)
	list of type of each variable (attribute) : IGNORE, DISCRETE OR REAL-PT.
	IGNORE attributes are normally put at beginning of attribute ordering.
       (Entered by hand.)

   Variable Ranges  (*VARIABLE-RANGES*)
	list of sublists of minimum range values & maximum range values.
        For type = discrete which have minimum range of 0, 
	   enter its index in *DISCRETES-ADJUSTED+1*.  PROCESS-DATA-RANGES &
	   GENERATE-HEADER-FILE will take care of adjusting its values by +1.
	(0 0) is used for 'ignore' variable types.
	(Entered by hand.)

   Discrete variables whose range includes 0 (*DISCRETES-ADJUSTED+1*)
	reserved by AutoClass II for unknown values 
	(Entered by hand.)

   Variable-descriptions (*VARIABLE-DESCRIPTIONS*)
	a list of strings containing a short description of each attribute.
	(Entered by hand.)


 2.2 PRE-PROCESS DATA - STEP 2 : RANGE CHECKING - <FILE>.DAT -> <FILE>.DB 

   Assumes that <file>.dat is in Lisp readable format. :log-file contains a log of the
   changes, if any, to the data base.

   (process-data-ranges "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DAT"
		        "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DB"
		        :n-variables 279
		        :log-file "cha:>u>taylor>ui>autoclass-x>data>med>range-ck-changes.log")

 2.3 PRE-PROCESS DATA - STEP 3 : BUILD DATA BASE HEADER FILE

   (generate-header-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-HD.DB"
	                 :n-data 150
	                 :title "Medical Cancer History Data Base Header from UCSF - Jonathan")


   The following four variables are generated by GENERATE-HEADER-FILE and placed into the file.

   Attribute ranges for discretes (*DISC-VAR-RANGES*)
	a list of integers for discretes or NIL for IGNORE or REAL-PT attributes --
	for discretes : number of possible integer values that attribute can assume.
        Based on assumption that discrete variables are 1 based, not 0 based.
	NOTE that the value of 0 is used to designate unknown values, so if a
	discrete attribute uses 0 for a meaningful value, then all its data values
	must be increased by 1. NOTE that READ-DB adds 1 to all discrete variable range
        values that it reads in to handle the 0 values it assigns to unknown values (?).
        E.g. An attribute with values 0 and 1 would have an input range of 2 and
        altered input datum values of 1 and 2.
        Another example: possible datum values: 1 and 45 => range of 45.
	(Generated by GENERATE-*DISC-VAR-RANGES*)

   Discrete priors (*DISC-PRIORS*)
	normally the form NIL, which indicates that defaults are used

   Continuous variable priors for real-pts (*REAL-PRIORS*)
	a list of NIL for IGNORE or DISCRETE attributes or a sub-list of default mean and
	standard deviation, e.g. (1 1) for REAL-PT attributes. After the initial real priors
	are computed (COLLECT-REAL-PRIORS-FROM-DATA), this list is replaced by the value
	of *REAL-PRIORS* for the classification processing (FIND-BEST-N).
	(Generated by GENERATE-*REAL-PRIORS*)

   No. of datum (*N-DATA*)
	the number of data items containing (length *variable-types*) attributes - an integer

	
 2.4 PRE-PROCESS DATA - STEP 4 : REAL-PT CHECKING  - <FILE>.DB -> <FILE>.BASE

   This step is to identify and correct (by transforming then to DISCRETEs),
	REAL-PT variables which are in fact "discrete scalars".
   	:log-file contains a log of the changes, if any, to the data base.

   (process-real-pt-data
    "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.DB"
    "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.BASE"
    :n-data 150
    :variable-descriptions t
    :header-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-hd.db"
    :log-file "cha:>u>taylor>ui>autoclass-x>data>med>real-pt-changes.log")

 3.0 GENERATE BEST CLASSIFICATIONS

   These forms are to be run in the Lisp Listener, if on a Lisp machine.
   The *current-results* file does not initially need to exist to begin the generation
   of classifications. It is created by AutoClass II, if it does not exist.

   The number of classes to be used by the partition is an upper bound selected by the
   user, depending on the nature of the data. (it appears in MAKE-PART&CLASSES,
   *CURRENT-RESULTS* file name and FIND-BEST-N-2 [ :N-classes ]. The upper bound selected
   should result in at least one empty class in the completed classification.

 3.1 FOR SMALL DATA BASES: (* N-ATTRIBUTES N-DATA) < 10,000)

  (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base")
  (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-20.wt-set")
  (setf *partition* (make-part&classes 20))
  (find-best-n-2
	5					; N-held number-of-best-classifications-to-save
	*partition*				; classification partition
	'DISPERSE-CYCLE				; search-method
	5					; N-cycles needed to check for termination
	*current-results*			; weight set results written here
	:N-classes 20				; number-of-class-bins
	:file-header"Cancer history data base"  ; description for top of *current-results*
	)

   The FIND-BEST-N-2 run should be stopped by typing ~~~ in the Lisp Listener.
   This will allow the current cycle to finish smoothly and the classification results
   to be written to the .wt-set file, if appropriate.   Since the classification
   results are written to this file incrementally, the process can be stopped
   (by ~~~) and re-started (the previous .WT-SET results will be retained)
   at any time.  This form can be run as long or as many times as desired to
   get the best of a large set of randomly seeded classifications.  About 15
   such classifications is a sufficient number.  This can easily be determined
   by "viewing" the beginning of the .WT-SET file (as the form is running).
   The second item in the file is the number of classifications which have
   been generated up to the time that the last best of N-HELD results were
   written to the file.

 3.2  FOR LARGE DATA BASES: (* N-ATTRIBUTES N-DATA) > 10,000

   Since the "disperse-cycle" search method is particularly time intensive for large data
   bases, a two step search strategy is used: multiple "dynamic cycle" searches
   followed by a "disperse-cycle" search.

  (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-5425.base")
  (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-80.wt-set")
  (setf *partition* (make-part&classes 80))
  (find-best-n-2
	5					; N-held number-of-best-classifications-to-save
	*partition*				; classification partition
	'DYNAMIC-CYCLE				; search-method
	5					; N-cycles needed to check for termination
	*current-results*			; weight set results written here
	:N-classes 80				; number-of-class-bins
	:file-header "LRS-5425 data base"  	; description for top of *current-results*
	:split-.wt-set-file t)			; creates separate files for each classification


   The :split-.wt-set-file keyword reduces the i/o time of writing large .wt-set
   files with N-held (5) classifications in it, by creating individual files for
   each classification and naming the files *current-results* suffixed with "-n" (n = 0->4).

   The .wt-set file with the lowest MML (mean message length), the third item in the
   .wt-set file, is selected for the "disperse-cycle" run, in this case the "-4" file
   was selected:

  (read-spectral-*base-data* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>spectra-5425.base")
  (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>lrs-5425>disperse>SPECTRA-80-4.wt-set")
  (setf *partition* (make-part&classes 80))
  (update-.wt-set-file *partition* *current-results*
		       :update-method 'disperse-cycle :n-cycles 2
		       :cache-file *current-results*
		       :split-.wt-set-file t)

 4.0 GENERATE REPORTS FOR BEST CLASSIFICATION

   FOR NON-GRAPHICAL USER INTERFACE: (AutoClass must be loaded first)

   (load "cha:>u>taylor>ui>autoclass-x>NOGRAPH-UI>load-it.lisp")

   FOR GRAPHICAL USER INTERFACE: (AutoClass must be loaded first)

   (load "cha:>u>taylor>ui>autoclass>load-it.lisp")


4.1 NON-GRAPHICAL METRICS AND CROSS-REFERENCE REPORTS

   The following is used after all classifications are complete to generate
   reports based on the best classification in the .WT-SET file.  There are
   three reports: metrics, xref-by-class & xref-by-case.  Additional datum
   variables can be added to the xref reports by extending the code in
   DATA-MODS.LISP.

   (generate-reports-for-best-classification
    :base-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base"	; nil -> already loaded
    :results-file "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-10.wt-set"
    :N-classes 10				; number-of-class-bins 
    :metrics-report-file "cha:>u>taylor>ui>autoclass-x>data>med>metrics-sample.text"
    :xref-by-class-file "cha:>u>taylor>ui>autoclass-x>data>med>xref-by-class.text"
    :xref-by-case-file "cha:>u>taylor>ui>autoclass-x>data>med>xref-by-case.text")


 4.2 GRAPHICAL METRICS AND CROSS-REFERENCE REPORTS

   If the current (loaded) classification is not the one desired, then it must be generated:
   (*current-results* can be either non-split(below) or split .wt-set file)

   (read-*base-data* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history.base")
   (setf *current-results* "cha:>u>taylor>ui>autoclass-x>data>med>cancer-history-20.wt-set")
   (setf *partition* (make-part&classes 20))
   (get-best-of-weightings *current-results*)
   (collect-weights *partition*)
   (update-partition-mmls *partition*)
   (setf *class-wt-ordering* (get-class-weight-ordering))
   (setf *class-assignments* (get-max-weight-classes *curr-wts*))

   The graphical user interface provides screen hardcopy capability of displays
   of the class populations, the attribute influence coefficients (listed
   numerically in the metrics report) and several data plots. The user interface
   also allows mouse-clicked initiation of the metrics report.

   The cross-reference reports are invoked by executing the desired functions.
   See the large data base command files for examples. The IRAS-DATA-MODS.LISP
   file demonstrates how >NOGRAPH-UI>DATA-MODS.LISP can be extended.