C5.0: An Informal Tutorial - RuleQuest
Welcome to C5.0, a system that extracts informative patterns from data. The following sections show how to prepare data files for C5.0 and illustrate the options for using the system.
In this tutorial, file names and C5.0 input appear in blue fixed-width font while file extensions and other general forms are shown highlighted in green.
- Preparing Data for C5.0
- Application files
- Names file
- What's in a name?
- Specifying the classes
- Explicitly-defined attributes
- Attributes defined by formulas
- Dates, times, and timestamps
- Selecting the attributes that can appear in classifiers
- Data file
- Test and cases files (optional)
- Costs file (optional)
- Constructing Classifiers
- Decision trees
- Evaluation
- Discrete value subsets
- Rulesets
- Rule utility ordering
- Boosting
- Winnowing attributes
- Soft thresholds
- Advanced pruning options
- Sampling from large datasets
- Cross-validation trials
- Differential misclassification costs
- Weighting individual cases
- Using Classifiers
- Segmentation fault errors
- Linux GUI
- Linking to Other Programs
- Appendix: Summary of Options
Preparing Data for C5.0
We will illustrate C5.0 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are three examples: Attribute Case 1 Case 2 Case 3 ..... age 41 23 46 sex F F M on thyroxine f f f query on thyroxine f f f on antithyroid medication f f f sick f f f pregnant f f not applicable thyroid surgery f f f I131 treatment f f f query hypothyroid f f f query hyperthyroid f f f lithium f f f tumor f f f goitre f f f hypopituitary f f f psych f f f TSH 1.3 4.1 0.98 T3 2.5 2 unknown TT4 125 102 109 T4U 1.14 unknown 0.91 FTI 109 unknown unknown referral source SVHC other other diagnosis negative negative negative ID 3733 1442 2965
This is exactly the sort of task for which C5.0 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided, although some cases may have unknown or non-applicable values for some attributes. There are 24 attributes in this example, but C5.0 can deal with any number of attributes.
C5.0's job is to find how to predict a case's class from the values of the other attributes. C5.0 does this by constructing a classifier that makes this prediction. As we will see, C5.0 can construct classifiers expressed as decision trees or as sets of rules.
Application files
Every C5.0 application has a short name called a filestem; we will use the filestem hypothyroid for this illustration. All files read or written by C5.0 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file.Here is a summary table of the extensions used by C5.0 (to be described in later sections):
| names | description of the application's attributes | [required] |
| data | cases used to generate a classifier | [required] |
| test | unseen cases used to test a classifier | [optional] |
| cases | cases to be classified subsequently | [optional] |
| costs | differential misclassification costs | [optional] |
| tree | decision tree classifier produced by C5.0 | [output] |
| rules | ruleset classifier produced by C5.0 | [output] |
Names file
Two files are essential for all C5.0 applications and there are three further optional files, each identified by its extension. The first essential file is the names file (e.g. hypothyroid.names) that describes the attributes and classes. There are two important subgroups of attributes:- The value of an explicitly-defined attribute is given directly in the data in one of several forms. A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, a date attribute holds a calendar date, a time attribute holds a clock time, a timestamp attribute holds a date and time, and a label attribute serves only to identify a particular case.
- The value of an implicitly-defined attribute is specified by a formula.
The file hypothyroid.names looks like this: diagnosis. | the target attribute age: continuous. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. tumor: f, t. goitre: f, t. hypopituitary: f, t. psych: f, t. TSH: continuous. T3: continuous. TT4: continuous. T4U: continuous. FTI:= TT4 / T4U. referral source: WEST, STMW, SVHC, SVI, SVHD, other. diagnosis: primary, compensated, secondary, negative. ID: label.
What's in a name?
Names, labels, classes, and discrete values are represented by arbitrary strings of characters, with some fine print:- Tabs and spaces are permitted inside a name or value, but C5.0 collapses every sequence of these characters to a single space.
- Special characters (comma, colon, period, vertical bar `|') can appear in names and values, but must be prefixed by the escape character `\'. For example, the name "Filch, Grabbit, and Co." would be written as `Filch\, Grabbit\, and Co\.'. (Colons in times and periods in numbers do not need to be escaped.)
Specifying the classes
The first entry in the names file specifies the classes in one of three formats:- A list of class names separated by commas, e.g.
- primary, compensated, secondary, negative.
- The name of a discrete attribute (the target attribute) that contains the class value, e.g.:
- diagnosis.
- The name of a continuous target attribute followed by a colon and one or more thresholds in increasing order and separated by commas. If there are t thresholds X1, X2, ..., Xt then the values of the attribute are divided into t+1 ranges:
- less than or equal to X1
- greater than X1 and less than or equal to X2
- . . .
- greater than Xt.
- age: 12, 19.
Next, predict asks whether the same case is to be tried again with changed attribute values (a kind of `what if' scenario), a new case is to be classified, or all cases are complete. If a case is retried, each prompt for an attribute value shows the previous value in square brackets. A new value can be entered, followed by the enter key, or the enter key alone can be used to indicate that the value is unchanged.
Classifiers can also be used in batch mode. The sample application provided in the public source code reads cases from a cases file and shows the predicted class and the confidence for each.
Segmentation fault errors
For applications with very many cases or attributes, C5.0 may crash with a message like Segmentation fault (core dumped). This usually occurs because a C5.0 thread has exhausted its allocated stack space.
Different releases of Linux have varying default stack sizes; for example, Fedora Core uses a default 10MB while Ubuntu uses 8MB. If you experience a segmentation fault error, you must override the default stack size setting by typing the following before running C5.0:
- if you are using csh: limit stacksize new limit
- if you are using sh: ulimit -Ss new limit
Please note! You should not set the stack size limit to unlimited -- this will not change the default stack size limit for subsidiary threads. You must use a specific value in KB. A little experimentation may be necessary to find a value that works with your application.
Linux GUI
Linux users who have installed a recent version of Wine can invoke a slightly simplified version of the See5 user interface. The executable program gui starts the graphical user interface whose main window is similar to See5's, with five buttons: Locate Data invokes a browser to find the files for your application, or to change the current application; Construct Classifier selects the type of classifier to be constructed and sets other options; Stop interrupts the classifier-generating process; Review Output re-displays the output from the last classifier construction (if any), saved automatically in a file filestem.out; and Cross-Reference shows how cases in training or test data relate to (parts of) a classifier and vice versa. For more details on these, please see the See5 tutorial.
The graphical interface calls C5.0 directly, so use of the GUI has minimal impact on performance when generating a classifier.
Please note: C5.0 should be run for the first time from the command-line interface, not the GUI. The first run installs the licence in C5.0 -- after that, C5.0 can be used from either interface.
Linking to Other Programs
The classifiers generated by C5.0 are retained in files filestem.tree (for decision trees) and filestem.rules (for rulesets). Free C source code is available to read these classifier files and to make predictions with them, enabling you to use C5.0 classifiers in other programs.
As an example, the source includes a program sample.c to input new cases and to show how each is classified by boosted or single trees or rulesets. The program reads the application's names file, the tree or rules file generated by C5.0, and an optional costs file. It then reads cases from a cases file in a format similar to a data file, except that a case's class can be given as `?' meaning "unknown". For each case, the program outputs the given class, the class predicted by the classifier, and the confidence with which this prediction is made.
Please see the file sample.c for compilation instructions and program options.
Click here to download a gzipped tar file containing the public source code.
Appendix: Summary of Options
-f filestem select the application -s partition discrete values into subsets -r generate rule-based classifiers -u bands sort rules by their utility into bands -b use boosting with 10 trials -t trials use boosting with the specified number of trials -w winnow the attributes before constructing a classifier -p show soft thresholds -g do not use global tree pruning -c CF set the CF value for pruning trees -m cases set the minimum cases for at least two branches of a split -S x use a sample of x% for training and a disjoint sample for testing -I seed set the sampling seed value -X folds carry out a cross-validation -e ignore any costs file -h print a short summary of the options © RULEQUEST RESEARCH 2019 Last updated April 2019 home products download evaluations prices purchase contact us
Từ khóa » C5.0
-
C5.0 Classification Models
-
[PDF] C50: C5.0 Decision Trees And Rule-Based Models
-
C5.0 Node - IBM
-
C5.0 Decision Trees And Rule-Based Models - Github Sites
-
C5.0 Decision Trees And Rule-Based Models • C50 - Github Sites
-
C5.0 Decision Tree Algorithm - RPubs
-
C5.fault: C5.0 Decision Trees And Rule-Based Models
-
[PDF] Decision Tree Classification Of Products Using C5.0 And Prediction ...
-
1.10. Decision Trees — Scikit-learn 1.1.1 Documentation
-
C4.5 Algorithm - Wikipedia
-
C5.0 Classification Algorithm And Application On Individual Credit ...
-
An R Package For Fitting Quinlan's C5.0 Classification Model - GitHub
-
Classification Of Data Using Decision Tree And Regression Tree Methods