C5.0: An Informal Tutorial - RuleQuest

Có thể bạn quan tâm

C5.0: An Informal Tutorial

Welcome to C5.0, a system that extracts informative patterns from data. The following sections show how to prepare data files for C5.0 and illustrate the options for using the system.

In this tutorial, file names and C5.0 input appear in blue fixed-width font while file extensions and other general forms are shown highlighted in green.

Preparing Data for C5.0
- Application files
- Names file
  - What's in a name?
  - Specifying the classes
  - Explicitly-defined attributes
  - Attributes defined by formulas
  - Dates, times, and timestamps
  - Selecting the attributes that can appear in classifiers
- Data file
- Test and cases files (optional)
- Costs file (optional)
Constructing Classifiers
- Decision trees
- Evaluation
- Discrete value subsets
- Rulesets
  - Rule utility ordering
- Boosting
- Winnowing attributes
- Soft thresholds
- Advanced pruning options
- Sampling from large datasets
- Cross-validation trials
- Differential misclassification costs
- Weighting individual cases
Using Classifiers
Segmentation fault errors
Linux GUI
Linking to Other Programs
Appendix: Summary of Options

Preparing Data for C5.0

We will illustrate C5.0 using a medical application -- mining a database of thyroid assays from the Garvan Institute of Medical Research, Sydney, to construct diagnostic rules for hypothyroidism. Each case concerns a single referral and contains information on the source of the referral, assays requested, patient data, and referring physician's comments. Here are three examples: Attribute Case 1 Case 2 Case 3 ..... age 41 23 46 sex F F M on thyroxine f f f query on thyroxine f f f on antithyroid medication f f f sick f f f pregnant f f not applicable thyroid surgery f f f I131 treatment f f f query hypothyroid f f f query hyperthyroid f f f lithium f f f tumor f f f goitre f f f hypopituitary f f f psych f f f TSH 1.3 4.1 0.98 T3 2.5 2 unknown TT4 125 102 109 T4U 1.14 unknown 0.91 FTI 109 unknown unknown referral source SVHC other other diagnosis negative negative negative ID 3733 1442 2965

This is exactly the sort of task for which C5.0 was designed. Each case belongs to one of a small number of mutually exclusive classes (negative, primary, secondary, compensated). Properties of every case that may be relevant to its class are provided, although some cases may have unknown or non-applicable values for some attributes. There are 24 attributes in this example, but C5.0 can deal with any number of attributes.

C5.0's job is to find how to predict a case's class from the values of the other attributes. C5.0 does this by constructing a classifier that makes this prediction. As we will see, C5.0 can construct classifiers expressed as decision trees or as sets of rules.

Application files

Every C5.0 application has a short name called a filestem; we will use the filestem hypothyroid for this illustration. All files read or written by C5.0 for an application have names of the form filestem.extension, where filestem identifies the application and extension describes the contents of the file.

Here is a summary table of the extensions used by C5.0 (to be described in later sections):

names	description of the application's attributes	[required]
data	cases used to generate a classifier	[required]
test	unseen cases used to test a classifier	[optional]
cases	cases to be classified subsequently	[optional]
costs	differential misclassification costs	[optional]
tree	decision tree classifier produced by C5.0	[output]
rules	ruleset classifier produced by C5.0	[output]

Names file

Two files are essential for all C5.0 applications and there are three further optional files, each identified by its extension. The first essential file is the names file (e.g. hypothyroid.names) that describes the attributes and classes. There are two important subgroups of attributes:

The value of an explicitly-defined attribute is given directly in the data in one of several forms. A discrete attribute has a value drawn from a set of nominal values, a continuous attribute has a numeric value, a date attribute holds a calendar date, a time attribute holds a clock time, a timestamp attribute holds a date and time, and a label attribute serves only to identify a particular case.
The value of an implicitly-defined attribute is specified by a formula.

The file hypothyroid.names looks like this: diagnosis. | the target attribute age: continuous. sex: M, F. on thyroxine: f, t. query on thyroxine: f, t. on antithyroid medication: f, t. sick: f, t. pregnant: f, t. thyroid surgery: f, t. I131 treatment: f, t. query hypothyroid: f, t. query hyperthyroid: f, t. lithium: f, t. tumor: f, t. goitre: f, t. hypopituitary: f, t. psych: f, t. TSH: continuous. T3: continuous. TT4: continuous. T4U: continuous. FTI:= TT4 / T4U. referral source: WEST, STMW, SVHC, SVI, SVHD, other. diagnosis: primary, compensated, secondary, negative. ID: label.

What's in a name?

Names, labels, classes, and discrete values are represented by arbitrary strings of characters, with some fine print:

Tabs and spaces are permitted inside a name or value, but C5.0 collapses every sequence of these characters to a single space.
Special characters (comma, colon, period, vertical bar `|') can appear in names and values, but must be prefixed by the escape character `\'. For example, the name "Filch, Grabbit, and Co." would be written as `Filch\, Grabbit\, and Co\.'. (Colons in times and periods in numbers do not need to be escaped.)

Whitespace (blank lines, spaces, and tab characters) is ignored except inside a name or value and can be used to improve legibility. Unless it is escaped as above, the vertical bar `|' causes the remainder of the line to be ignored and is handy for including comments. This use of `|' should not occur inside a value.

Specifying the classes

The first entry in the names file specifies the classes in one of three formats:

A list of class names separated by commas, e.g.
The name of a discrete attribute (the target attribute) that contains the class value, e.g.:

The name of a continuous target attribute followed by a colon and one or more thresholds in increasing order and separated by commas. If there are t thresholds X1, X2, ..., Xt then the values of the attribute are divided into t+1 ranges:

less than or equal to X1
greater than X1 and less than or equal to X2
. . .
greater than Xt.

Each range defines a class, so there are t+1 classes. For example, a hypothetical entry

age: 12, 19. would define three classes: age 6 -> class compensated [0.570] Rule 3: (63/6, lift 39.3) TSH > 6 FTI class primary [0.892] Rule 4: (296, lift 1.1) on thyroxine = t FTI > 65.3 -> class negative [0.997] Rule 5: (240, lift 1.1) TT4 > 153 -> class negative [0.996] Rule 6: (29, lift 1.1) thyroid surgery = t FTI > 65.3 -> class negative [0.968] Rule 7: (31, lift 42.7) thyroid surgery = f TSH > 6 TT4 class primary [0.970] The rules are divided into four bands of roughly equal sizes and a further summary is generated for both training and test cases. Here is the output for test cases: Evaluation on test data (1000 cases): Rules ---------------- No Errors 7 5( 0.5%) = 159 (153): negative (6/0.1) TT4 153: negative (6.2/0.2) TT4 61: compensated (179/29.3) TT4 compensated [0.85] negative [0.13] primary [0.01] Retry, new case or quit [r,n,q]: r¤ TSH [7.4]: ¤ TT4 [108]: ¤ T4U [1.08]: ¤ on thyroxine [f]: t¤ -> negative [1.00] Retry, new case or quit [r,n,q]: q¤ The values of some attributes might not affect the classification, so predict prompts for the values of those attributes that are required. The reply `?' indicates that a requested attribute value is unknown. (Similarly, use `N/A' for non-applicable values.) When all the relevant information has been entered, the most likely class (or classes) are printed, each with a confidence value.

Next, predict asks whether the same case is to be tried again with changed attribute values (a kind of `what if' scenario), a new case is to be classified, or all cases are complete. If a case is retried, each prompt for an attribute value shows the previous value in square brackets. A new value can be entered, followed by the enter key, or the enter key alone can be used to indicate that the value is unchanged.

Classifiers can also be used in batch mode. The sample application provided in the public source code reads cases from a cases file and shows the predicted class and the confidence for each.

Segmentation fault errors

For applications with very many cases or attributes, C5.0 may crash with a message like Segmentation fault (core dumped). This usually occurs because a C5.0 thread has exhausted its allocated stack space.

Different releases of Linux have varying default stack sizes; for example, Fedora Core uses a default 10MB while Ubuntu uses 8MB. If you experience a segmentation fault error, you must override the default stack size setting by typing the following before running C5.0:

if you are using csh: limit stacksize new limit
if you are using sh: ulimit -Ss new limit

where new limit is the new stack size limit in kilobytes. For example, a csh user might type limit stacksize 20000 to increase the default stack size limit to 20MB.

Please note! You should not set the stack size limit to unlimited -- this will not change the default stack size limit for subsidiary threads. You must use a specific value in KB. A little experimentation may be necessary to find a value that works with your application.

Linux GUI

Linux users who have installed a recent version of Wine can invoke a slightly simplified version of the See5 user interface. The executable program gui starts the graphical user interface whose main window is similar to See5's, with five buttons: Locate Data invokes a browser to find the files for your application, or to change the current application; Construct Classifier selects the type of classifier to be constructed and sets other options; Stop interrupts the classifier-generating process; Review Output re-displays the output from the last classifier construction (if any), saved automatically in a file filestem.out; and Cross-Reference shows how cases in training or test data relate to (parts of) a classifier and vice versa. For more details on these, please see the See5 tutorial.

The graphical interface calls C5.0 directly, so use of the GUI has minimal impact on performance when generating a classifier.

Please note: C5.0 should be run for the first time from the command-line interface, not the GUI. The first run installs the licence in C5.0 -- after that, C5.0 can be used from either interface.

Linking to Other Programs

The classifiers generated by C5.0 are retained in files filestem.tree (for decision trees) and filestem.rules (for rulesets). Free C source code is available to read these classifier files and to make predictions with them, enabling you to use C5.0 classifiers in other programs.

As an example, the source includes a program sample.c to input new cases and to show how each is classified by boosted or single trees or rulesets. The program reads the application's names file, the tree or rules file generated by C5.0, and an optional costs file. It then reads cases from a cases file in a format similar to a data file, except that a case's class can be given as `?' meaning "unknown". For each case, the program outputs the given class, the class predicted by the classifier, and the confidence with which this prediction is made.

Please see the file sample.c for compilation instructions and program options.

Click here to download a gzipped tar file containing the public source code.

Appendix: Summary of Options

-f filestem	select the application
-s	partition discrete values into subsets
-r	generate rule-based classifiers
-u bands	sort rules by their utility into bands
-b	use boosting with 10 trials
-t trials	use boosting with the specified number of trials
-w	winnow the attributes before constructing a classifier
-p	show soft thresholds
-g	do not use global tree pruning
-c CF	set the CF value for pruning trees
-m cases	set the minimum cases for at least two branches of a split
-S x	use a sample of x% for training and a disjoint sample for testing
-I seed	set the sampling seed value
-X folds	carry out a cross-validation
-e	ignore any costs file
-h	print a short summary of the options

Last updated April 2019

home

products

download

evaluations

prices

purchase

Từ khóa » C5.0

C5.0: An Informal Tutorial - RuleQuest

Preparing Data for C5.0

Application files

Names file

What's in a name?

Specifying the classes

Segmentation fault errors

Linux GUI

Linking to Other Programs

Appendix: Summary of Options

C5.0 Classification Models

[PDF] C50: C5.0 Decision Trees And Rule-Based Models

C5.0 Node - IBM

C5.0 Decision Trees And Rule-Based Models - Github Sites

C5.0 Decision Trees And Rule-Based Models • C50 - Github Sites

C5.0 Decision Tree Algorithm - RPubs

C5.fault: C5.0 Decision Trees And Rule-Based Models

[PDF] Decision Tree Classification Of Products Using C5.0 And Prediction ...

1.10. Decision Trees — Scikit-learn 1.1.1 Documentation

C4.5 Algorithm - Wikipedia

C5.0 Classification Algorithm And Application On Individual Credit ...

An R Package For Fitting Quinlan's C5.0 Classification Model - GitHub

Classification Of Data Using Decision Tree And Regression Tree Methods

Liên Hệ