Free Essay

Data

In:

Submitted By mnawawee
Words 5674
Pages 23
Tutorial on Classification
Igor Baskin and Alexandre Varnek

Introduction
The tutorial demonstrates possibilities offered by the Weka software to build classification models for SAR (Structure-Activity Relationships) analysis. Two types of classification tasks will be considered – two-class and multi-class classification. In all cases protein-ligand binding data will analyzed, ligands exhibiting strong binding affinity towards a certain protein being considered as “active” with respect to it. If it is not known about the binding affinity of a ligand towards the protein, such ligand is conventionally considered as “nonactive” one. In this case, the goal of classification models is to be able to predict whether a new ligand will exhibit strong binding activity toward certain protein biotargets. In the latter case one can expect that such ligands might possess the corresponding type of biological activity and therefore could be used as ‘’hits” for drug design. All ligands in this tutorial are described by means of an extended set of MACCS fingerprints, each of them comprising 1024 bits, the “on” value of each of them indicating the presence of a certain structural feature in ligand, otherwise its value being “off”.

Part 1. Two-Class Classification Models.
1. Data and descriptors.
The dataset for this tutorial contains 49 ligands of Angeotensin-Converting Enzyme (ACE) and 1797 decoy compounds chosen from the DUD database. The set of "extended" MACCS fingerprints is used as descriptors.

2. Files
The following file is supplied for the tutorial: • ace.arff – descriptor and activity values

3. Exercise 1: Building the Trivial model ZeroR
In this exercise, we build the trivial model ZeroR, in which all compounds are classified as “nonactive”. The goal is to demonstrate that the accuracy is not a correct choice to measure
1

the performance of classification for unbalanced datasets, in which the number of “nonactive” compounds is much larger than the number of “active” ones. Step by step instructions
Important note for Windows users: During the installation, the ARFF files should be associated with Weka.

In the starting interface of Weka, click on the button Explorer. • In the Preprocess tab, click on the button Open File. In the file selection interface, select the file ace.arff.

The dataset is characterized in the Current relation frame: the name, the number of instances (compounds), the number of attributes (descriptors + activity/property). We see in this frame that the number of compounds is 1846, whereas the number of descriptors is 1024, which is the number of attributes (1025) minus the activity field. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor fp_1 (the first bit in the corresponding fingerprint) is “on” in 201 compounds and “off” in 1645 compounds in the dataset. • Select the last attribute “activity” in the Attributes frame.

2

One can read from the Selected attribute frame that there are 1797 nonactive and 49 active compounds in the dataset. Nonactive compounds are depicted by the blue color whereas active compounds are depicted by the red color in the histogram.

• Click on the tab Classify.
The ZeroR method is already selected by default. For assessing the predictive performance of all models to be built, the 10-fold cross-validation method has also be specified by default.

• Click on the Start button to build a model.

The predictive performance of the model is characterized in the right-hand Classifier output frame. The Confusion Matrix for the model is presented at the bottom part of the Classifier output window. It can be seen from it that all compounds have been classified as “nonactive”. It is clear that such 3

trivial model is useless and it cannot be used for discovering “active” compounds. However, pay attention that the accuracy of the model (Correctly Classifieds Instances) of this trivial model is very high: 97.3456 %. This fact clearly indicates that the accuracy cannot be used for assessing the usefulness of classification models built using unbalanced datasets. For this purpose a good choice is to use the “Kappa statistic”, which is zero for this case. “Kappa statistic” is an analog of correlation coefficient. Its value is zero for the lack of any relation and approaches to one for very strong statistical relation between the class label and attributes of instances, i.e. between the class of biological activity of chemical compounds and the values of their descriptors. Another useful statistical characteristic is “ROC Area”, for which the value near 0.5 means the lack of any statistical dependence.

4. Exercise 2: Building the Naïve Bayesian Model
In this exercise, we build a Naïve Bayesian model for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to demonstrate the ability of Weka to build statistically significant classification models for predicting biological activity of chemical compounds, as well as to show different ways of assessing the statistical significance and usefulness of classification models.

• In the classifier frame, click Chose, then select the NaiveBayes method from the bayes submenu. • Click on the Start button to build a model.

Although the accuracy of the model became lower (93.8787 % instead of 97.3456 %), its real statistical significance became much stronger. This follows from the value of the ‘’Kappa statistic” 0.42, which indicates the existence of moderate statistical dependence. It can be analyzed using the 4

“Confusion Matrix” at the bottom of the Classifier output window. So, there are 45 true positive, 1688 true negative, 109 false positive, and 4 false negative, and 109 false positive compounds. It is because of the considerable number of false positive that the value of recall for “active” compounds 0.292 is rather low. Nonetheless, the model exhibits an excellent value of “ROC Area” for “active” compounds 0.98. This indicates that this Naïve Bayesian model could very advantageously be used for discovering biologically active compounds through virtual screening. This can clearly be shown by analyzing ROC and Cost/Benefit plots. The Naïve Bayes method provides probabilistic outputs. This means that Naïve Bayes models can assess the value of the probability (varying from 0 to 1) that a given compound can be predicted as “active”. By moving the threshold from 0 to 1 and imposing that a compound can be predicted as “active” if the corresponding probability exceeds the current threshold, one can build the ROC (Receiver Operating Characteristic) curve.

• Visualize the ROC curve by clicking the right mouse button on the model type bayes.NaiveBayes in the Result list frame and selecting the menu item Visualize threshold curve / active.

The ROC curve is shown in the Plot frame of the window. The axis X in it corresponds to the false positive rate, whereas its axis Y corresponds to the true positive rate. The color depicts the value of the threshold. The “colder” (closer to the blue) color corresponds to the lower threshold value. All compounds with probability of being “active” exceeding the current threshold are predicted as “active”. If such prediction made for a current compound is correct, then the corresponding compound is true positive, otherwise it is false positive. If for some values of the threshold the true positive rate greatly exceeds the false positive rate (which is indicated by the angle A close to 90 degrees), then the classification model with such threshold can be used to extract selectively “active” 5

compounds from its mixture with the big number of “nonactive” ones in the course of virtual screening.

In order to find the optimal value of the threshold (or the optimal part of compounds to be selected in the course of virtual screening), one can perform the cost/benefit analysis.

• Close the window with the ROC curve. • Open the window for the cost/benefir analysis by clicking the right mouse button on the model type bayes.NaiveBayes in the Result list frame and selecting the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

6

Consider attentively the window for the Cost/Benefit Analysis. It consists of several panels. The left part of the window contains the Plot: ThresholdCurve frame with the Threshold Curve (called also the Lift curve). The Threshold curve looks very similar to the ROC curve. In both of them the axis Y corresponds to the true positive rate. However, in contrast to the ROC curve, the axis X in the Threshold curve corresponds to the part of selected instances (the “Sample Size”). In other words, the Threshold curve depicts the dependence of the part of “active” compounds retrieved in the course of virtual screening upon the part of compounds selected from the whole dataset used for screening. Remind that only those compounds are selected in the course of virtual screening, for which the estimated probability of being “active” exceeds the chosen threshold. The value of the threshold can be modified interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The confusion matrix for the current value of the threshold is shown in the Confusion Matrix frame at the left bottom corner of the window.

Pay attention that the confusion matrix for the current value of the threshold sharply differs from the previously obtained one. In particular, the classification accuracy 97.8873 % is considerably higher than the previous value 93.8787 %, the number of false positives has greatly decreased from 109 to 31, whereas the number of false negatives has increased from 4 to 8. Why is this happening? In order to give an answer to this question and explain the corresponding phenomenon, let us take a look at the right side of the window. Its right bottom corner contains the Cost Matrix frame.

The left part of the frame contains the Cost matrix itself. Its four entries indicate the cost one should pay for decisions taken on the base of the classification model. The cost values are expressed in the table in abstract units, however in the case studies they can be considered in money scale, for example, in EUROs. The left bottom cell of the Cost matrix defines the cost of false positives. Its default value is 1 unit. In the case of virtual screening this corresponds to the mean price one should pay in order to synthesize (or purchase) and test a compound wrongly predicted by the model as “active”. The right top cell of the Cost matrix defines the cost of false negatives. Its default value is 1 unit. In the case of virtual screening this corresponds to the mean price one should pay for “throwing away” very useful compound and losing profit because of the wrong prediction taken by the classification model. It is also taken by default that one should not pay price for correct decisions 7

taken using the classification model. It is clear that all these settings can be changed in order to match the real situation taking place in the process of drug design. The overall cost corresponding to the current value of the threshold is indicated at the right side of the frame. Its current value is 39 (cost of 31 false positives and 8 false negatives). In order to find the threshold corresponding to the minimum cost, it is sufficient to press the button Minimize Cost/Benefit. This explains the afore-mentioned difference in confusion matrices. The initial confusion matrix corresponds to the threshold 0.5, whereas the second confusion matrix results from the value of the threshold found by minimizing the cost function. The current value of the cost is compared by the program with the cost of selecting the same number of instances at random. Its value 117.18 is indicated at the right side of the frame. The difference between the values of the cost function between the random selection and the current value of the cost is called Gain. Its current value 78.18 is also indicated at the right side of the frame. In the context of virtual screening, the Gain can be interpreted as the profit obtained by using the classification model instead of random selection of the same number of chemical compounds. Unfortunately, the current version of the Weka software does not provide the means of automatic maximization of the Gain function. However, this can easily be done interactively by moving the slider in the Threshold frame of the Cost/Benefit Analysis window. The current model corresponds to the minimum value of the Cost function. Read the values for the current threshold from the right side of the Threshold frame.

So, the current model (with the threshold obtained by minimizing the cost) specifies that it is optimal to select 3.9003 % of compounds in the course of virtual screening, and this ensures retrieving of 83.6735 % of active compounds.

• Close the window with the Cost/Benefit Analysis.

Exercise: Find the model corresponding to the maximum Gain.

5. Exercise 3: Building the Nearest Neighbors Models (k-NN)
In this exercise, we build k-NN models for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn how to use instancebased (lazy) methods. • In the classifier frame, click Chose, then select the IBk method from the lazy submenu.
8

The lazy submenu contains a group of methods, in which the training phase is almost omitted – it actually amounts to memorizing all instances from the training set. Instead of it, all main calculations are delayed to the test phase. That is why such methods are sometimes called lazy, instance-based and memory-based. The price for this “laziness” is however rather high – computations at the test phase are very intensive, and that is why such methods work very slowly during prediction, especially for big training sets. So, the abbreviation IBk means that this is an Instance-Based method based on k neighbours. The default value of k is 1. So, build an 1-NN model

• Click on the Start button to build a 1-NN model.

One can see that the 1 Nearest Neighbour model is statistically much stronger than the previous Naïve Bayes one. In particular, the number of Incorrectly Classified Instances has decreased from 113 to 13, whereas the Kappa statistic has increased from 0.42 to 0.8702. Nonetheless, the ROC Area became slightly smaller in comparison with the Naïve Bayes model. Now perform the Cost/Benefit Analysis for the 1-NN model.

• Click the right mouse button on the model type lazy.IBk in the Result list frame and selecting the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

One can see that the Cost became considerably lower (13 vs 39), and the Gain became higher (87.13 vs 78.18). It can also be checked that the initial 1-NN model corresponds to the lowest Cost and the 9

highest Gain. It can also be seen that when using the 1-NN classifier in virtual screening it is sufficient to select 2.9252 % of compounds in order to retrieve 91.8367 % “active” ones. Can this result be further improved? Yes, this can be carried out by using the weighted modification of the k-NN method.

• Close the window with the Cost/Benefit Analysis. • Click with the left mouse button on the word IBk in the Classifier frame. The window for setting options for the k-NN method pops up. • Change the option distanceWeighting to Weight to 1-distance.

• Click on the OK button. • Click on the Start button.
One can see that the ROC Area has increased from 0.95 to 0.977, although the accuracy of prediction and the Kappa statistics have not changed.

• Repeat the Cost/Benefit analysis.

So, after the introduction of weighting, the Cost became lower (11 vs 13), the Gain became slightly higher (87.24 vs 87.13), and now it is sufficient to screen 2.8169 % (instead of 2.9252 %) of 10

compounds in order to retrieve the same number of the “active” ones. So, some moderate improvement has been achieved. So, the Nearest Neighbours approach appeared to be considerably more efficient than the Naïve Bayes for predicting the ability of chemical compounds to bind to the Angeotensin-

Converting Enzyme (ACE) using the set of "extended" MACCS fingerprints as descriptors. The question arises: is the Nearest Neighbours approach always more efficient than the Naïve Bayes?
The answer is: no. In this case the exceptionally high performance of the 1-NN method can be explained by the fact that the MACCS fingerprints have specially been optimized so as to provide high performance of retrieving “active” compounds in similarity search, which is actually 1-NN. With other sets of descriptors, the results might be different.

Exercise: Build two 3-NN models (with and without weighting) for the same dataset and analyze their relative performances in comparison with the corresponding 1-NN models. Hint: the kNN option in the k-NN parameters window should be changed.

6. Exercise 4: Building the Support Vector Machine Models
In this exercise, we build Support Vector Machine (SVM) models for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software for that. • In the classifier frame, click Chose, then select the SMO method from the functions submenu.
The Weka software implements John Platt’s Sequential Minimal Optimization (SMO) algorithm for training a support vector classifier, and this explains abbreviation SMO used in Weka for this methods.

• Click on the Start button.

11

We have obtained very good model with a small number of misclassification errors (13) and rather high value of the Kappa statistic 0.8536. The only thing that became worse in comparison with previous models is ROC Area. This can however be easily explained by the fact that original SVM method is not probabilistic, and only a single optimal value of threshold (which is in the case of the standard SVM approach the distance between the separating hyperplane and the coordinate origin) is provided. Without such freely moving threshold, it would not be possible to perform virtual screening based on ranking chemical compounds and adjusting threshold for selection. This results in the relatively bad value of ROC Area. Nonetheless this can be improved by using a special modification of the original SVM approach, which assigns probability value to each prediction. Since the algorithm for assigning probability values to SVM predictions is based on the use of logistic functions, such models are called in Weka Logistic Models.

• Click with the left mouse button on the word SOM in the Classifier frame. The window for setting options for the SVM method pops up.

12

• Change the option buildLogisticModels to True. • Click on the OK button. • Click on the Start button.

Although the accuracy of prediction has not changed, the ROC Area became very high – 0.993 for “active” compounds. For such probabilistic variant of the SVM method, good ROC curves can be built, and the Cost/benefit analysis can easily be performed.

• Click the right mouse button on the model type functions.SMO in the Result list frame and select the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

13

One can see that the value of the Cost function is low (11), whereas the Gain is rather high (83.45). In order to retrieve 87.7551 % of active compounds in the course of virtual screening, it is possible to select only 2.6002 % of compounds. The Threshold Curve at the left side of the window also demonstrates very good performance of the probabilistic SVM approach in virtual screening.

• Close the window with the Cost/Benefit Analysis.
The obtained results can further be improved. All these models have been built using the linear kernels chosen by default. Such kernels take into account only individual impacts of descriptors (in this case fingerprint bits), but do not consider their interaction (in this case, interaction of features corresponding to different fingerprint bits). All binary interactions of features can be depicted using the quadratic kernels. Let us build a new probabilistic SVM model for the quadratic kernel.

• Click with the left mouse button on the word SOM in the Classifier frame. The window for setting options for the SVM method pops up.

Now change the kernel from the linear to the quadratic one. All of them are particular cases of the polynomial kernel with different exponents (one for the linear, and two to the quadratic kernel). Therefore, in order to obtain the quadratic kernel, it is sufficient to set the exponent parameter of the polynomial kernel to 2, and it is not necessary to change the type of kernel. • Click with left mouse button on the PolyKernel word near the kernel label.

A new window with parameters of the polynomial kernel pops up. • Change the value of the exponent option from 1.0 to 2.0

14

• Click on the OK button to close the window with polynomial kernel options. • Click on the OK button to close the window with SVM options. • Click on the Start button.

So, all statistical parameters have substantially improved in comparison with the case of the linear kernel. In particular, the number of misclassification errors has dropped from 13 to 9, the value of the Kappa statistics has raised from 0.8536 to 0.9007.

• Perform the Cost/Benefit analysis.

15

The Cost has fallen even further from 9 to 7, in comparison with the linear kernel. With the quadratic kernel, it is sufficient to select only 2.2752 % of compounds in order to retrieve 85.7143 % of “active” ones. So, the transition to the quadratic kernel from the linear one has lead to substantial improvement of SVM classification models. This means that it is important to consider not just individual features coded by fingerprints, but also nonlinear interaction between them. Unfortunately, very popular Tanimoto similarity measure does not take this into account.

Exercise: Rebuild probabilistic SVM model with quadratic kernel for different values of parameter C (trade-off between errors and model complexity). Trye the values 0.1, 0.5, 2, 10. Can any improvement be achieved in comparison with the use of its default value 1?

7. Exercise 5: Building a Classification Tree Model
In this exercise, we build a classification tree model (using the C4 method named in Weka as J48) for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software to build and visualize classification trees. • In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click on the Start button.

The statistical parameters of the J48 model appears not high, especially in comparison with previously considered methods. Nonetheless, the main strength of individual classification trees 16

stems not from high statistical significance of models, but from their interpretation ability. In order to visualize the classification tree in the text mode, scroll the text field in the Classifier output frame up.

In order to obtain more usual representation of the same tree, do the following

• Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu item Visualize tree. • Resize a new window with graphical representation of the tree • Clock with the right mouse button to the space in this screen, and in the popup menu select the item Fit to screen.

17

The Tree View graphical diagram can be used to visualize decision trees. It contains two type of nodes, ovals and rectangles. Each oval contains a query of the sort: does chemical structure contains a feature depicted by the specified fingerprint bit number. If the answer is “yes”, then the node connected with the previous one with the “= on” branch is queried next. Otherwise, the “= off” branch is activated. The tree top node is queried the first. The “leaves” of the tree, depicted by rectangular, contain final decisions, whether the current compound is active or not.

Exercise: Build the ROC curve and perform the Cost/Benefit analysis of the J48 model.

8. Exercise 6: Building a Random Forest Model
In this exercise, we build a Random Forest model for predicting the ability of chemical compounds to bind to the Angeotensin-Converting Enzyme (ACE). The goal is to learn the possibilities offered by the Weka software to build Random Fore models. Although models built using individual decision trees are not very strong from statistical point of view, they can largely be improved by applying ensemble modeling. In the latter case, an ensemble of several models is built instead of a single one, and prediction of the ensemble model is made as a consensus of predictions made by all its individual members. The most widely used method based on the ensemble modeling is Random Forest, which has recently become very popular in chemoinformatics. • In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click with the left mouse button on the word RandomForest in the Classifier frame. The window for setting options for the Random Forest method pops up. • Change the value of the numTrees option from 10 to 100

18

We have changed the default number of trees 10 in ensemble to 100

• Click on the OK button to close the window with the Random Forest options. • Click on the Start button.

The resulting model is rather strong. Although its classification accuracy and the value of Kappa statistics are worse than for the SVM model, the ROC Area appears to be very high. This means that it can advantageously be applied in virtual screening. Indeed, perform the Cost/Benefir Analysis of the model.

• Click the right mouse button on the model type trees.RandomForest in the Result list frame and select the menu item Cost/Benefit analysis / active. • Click on the Minimize Cost/Benefit button at the right bottom corner of the window.

19

Very good Cost/Benefit parameters are observed. The Cost is rather low (10), the Gain is rather high (85.4), it is sufficient to select only 2.6544 % of compounds in order to retrieve 89.7959 % of “active” ones.

Exercise: Study the dependence of the Kappa statistic and ROC Area upon the number of trees in ensemble. Try 10, 20, 30, 40, 50, 100, 200 trees.

20

Part 2. Milti-Class Classification Models.
1. Data and descriptors.
The dataset for this tutorial contains 3961 ligands to 40 different protein biotargets and 3127 decoy compounds chosen from the DUD database []. The extended set of MACCS fingerprints is used as descriptors.

2. Files
The following file is supplied for the tutorial: • dud.arff – descriptor and activity values

3. Exercise 7: Building the Naïve Bayes Model
In this exercise we will show how the Naïve Bayes method implemented in the Weka software can be apply for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets. In this case the output attribute is called “classes” and it can take 41 different values: the names of biotargets and “none” for the lack of affinity.

Step by step instructions
Important note for Windows users: During installation, the ARFF files should have been associated with Weka. In this case, it is highly recommended to locate and double click on the file dud.arff, and to skip the following three points.

• In the starting interface of Weka, click on the button Explorer. • In the Preprocess tab, click on the button Open File. In the file selection interface, select the file dud.arff.

21

The dataset is characterized in the Current relation frame: the name (dud), the number of instances (compounds), the number of attributes (descriptors + activity/property). We see in this frame that the number of compounds is 7088, whereas the number of descriptors is 1024, which is the number of attributes (1025) minus the “classes” field. The Attributes frame allows user to modify the set of attributes using select and remove options. Information about the selected attribute is given in the Selected attribute frame in which a histogram depicts the attribute distribution. One can see that the value of the currently selected descriptor fp_1 (the first bit in the corresponding fingerprint) is “on” in 1675 compounds and “off” in 5413 compounds in the dataset. • Select the last attribute “classes” in the Attributes frame.

One can read from the Selected attribute frame the list of different classes (40 types of biotargets and ‘none’ for decoys) and the number of compounds belonging to each of the classes (i.e. the number of ligands strongly binding to the corresponding protein). Compounds binding to different 22

biotargets are depicted with different colors in the histogram. The last black color corresponds to “decoys”.

• Click on the tab Classify. • In the classifier frame, click Chose, then select the NaiveBayes method from the bayes submenu. • Click on the Start button to build a model.
All information concerning the predictive performance of the resulting model can be extracted from the text field in the right-hand Classifier output frame. Consider first the global statistics.

We can see that for 81.3488 % of ligands the corresponding biotargets have been correctly predicted. The value of the Kappa statistic 0.7688 means that statistical significance of the model is rather high. Therefore, it can be applied in “target fishing”, i.e. in prediction of putative biological targets for a given compound. Consider individual statistic for each of the targets.

23

For each of the targets, several statistical characteristics are presented: True Positive (TP) rate, False Positive (FP) rate, Precision, Recall, F-Measure, and ROC Area. One can see that different targets are characterized by rather different performance of recognition. For example, models for dhfr and gart ligands are very strong, whereas those for pr, hivrt and hivpr are not so good. For each of the targets, individual ROC curves can be built and Cost/Benefit analysis can be performed.

Exercise: Perform Cost/Benefit analysis for the ace target and compare its results with the case of the two-class classification Naïve Bayes model obtained in Exercise 2 24

4. Exercise 8: Building a Joint Classification Tree for 40 Targets
In this exercise we will show how decision trees can be applied for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets and how the resulting decision tree can be visualized.

• In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click on the Start button.
The global statistics of the multi-class classification model is as follows:

So, the classification tree model is characterized with better statistical characteristics than the Naïve Bayes one (compare with the previous exercise). Now visualize the joint classification tree.

• Click the right mouse button on the model type trees.J48 in the Result list frame and select the menu item Visualize tree. • Resize a new window with graphical representation of the tree • Clock with the right mouse button to the space in this screen, and in the popup menu select the item Auto Scale.
The graphical representation of the tree appears to be very big and cannot accommodate into the window. So use scroll bars to scroll it inside the window.

25

5. Exercise 9: Building the Multi-Class Random Forest Model
In this exercise we will show how the Random Forest method can be applied for building a multi-class model capable of predicting affinity to 40 pharmaceutically important protein biotargets.

• In the classifier frame, click Choose, then select the J48 method from the trees submenu. • Click with the left mouse button on the word RandomForest in the Classifier frame. The window for setting options for the Random Forest method pops up.
• The global statistics of the multi-class classification model is as follows:

We can see that the Random Forest method provides the strongest multi-class classification models. Exercise: Perform Cost/Benefit analysis for the ace target and compare its results with the case of the two-class classification Random Forest model obtained in Exercise 6.

26

Appendix

1. Notes for Windows
On Windows, Weka should be located on the usual program launcher, in a folder Wekaversion (e.g., weka-3-6-2). It is recommended to associate Weka to ARFF files. Thus, by double clicking an ARFF, Weka/Explorer will be launched and the default directory for loading and writing data will be set to the same directory as the loaded file. Otherwise, the default directory will be Weka directory. If you want to change the default directory for datasets in Weka, proceed as follows: • Extract from the java archive weka.jar, the weka/gui/explorer/Explorer.props file. It can be done using an archive program such as WinRAR or 7-zip. • Copy this file in your home directory. To identify your home directory, type the command echo %USERPROFILE% in a DOS command terminal. • Edit the file Explorer.props with WordPad. • Change the line InitialDirectory=%c by InitialDirectory=C:/Your/Own/Path If you need to change the memory available for Weka in the JVM, you need to edit the file RunWeka.ini or RunWeka.bat in the installation directory of Weka (root privilege may be required). Change the line maxheap=128m by maxheap=1024m. You cannot assign more than 1.4Go to a JVM because of limitations of Windows.

2. Notes for Linux
To launch Weka, open a terminal and type: java -jar /installation/directory/weka.jar.

If you need to assign additional memory to the JVM, use the option -XmMemorySizem, replacing MemorySize by the required size in megabytes. For instance to launch Weka with 1024 Mo, type: java -jar -Xm512m /installation/directory/weka.jar.

27

Similar Documents

Premium Essay

Data

...Discuss the importance of data accuracy. Inaccurate data leads to inaccurate information. What can be some of the consequences of data inaccuracy? What can be done to ensure data accuracy? Data accuracy is important because inaccurate data leads may lead to such things as the closing down of business, it may also lead to the loosing of jobs, and it may also lead to the failure of a new product. To ensure that one’s data is accurate one may double check the data given to them, as well as has more than one person researching the data they are researching. Project 3C and 3D Mastering Excel: Project 3G CGS2100L - Section 856 MAN3065 - Section 846 | | 1. (Introductory) Do you think Taco Bell was treated fairly by the mass media when the allegations were made about the meat filling in its tacos? I think so being that they are serving the people for which I must say that if you are serving the people then it’s in the people rights to know what exactly you are serving them. 2. (Advanced) Do you think the law firm would have dropped its suit against Taco Bell if there were real merits to the case? It’s hard to say but do think that with real merits it would have changed the playing feel for wit real merits whose the say that Taco Bell wouldn’t have had an upper hand in the case. 3. (Advanced) Do you think many people who saw television and newspaper coverage about Taco Bell's meat filling being questionable will see the news about the lawsuit being withdrawn? I doubt that...

Words: 857 - Pages: 4

Free Essay

Data

...Import Data from CSV into Framework Manager 1. Save all your tables or .csv file under one folder. In our case we will use the Test folder saved on blackboard with three .csv files named TestData_Agent.csv, TestData_Customer.csv, TestData_InsuranceCompany.csv. 2. Now , locate the correct ODBC exe at “C:\Windows\SysWOW64\odbcad32.exe” 3. Once the ODBC Data Source Administrator is open, go to the “System DSN” tab and click “Add”. 4. Select “Microsoft Text Driver (*.txt, *.csv)” if you want to import from csv files. 5. Unclick the “Use Current Directory”, and then click Select Directory to define the path of your data source. Give data source a name as well. Let’s use TestData in this case. NOTE: All the files under the specified location will be selected by default. 6. Again press ok and close the dialogue. Now we will import this Database/csv files into Cognos using Framework Manager. 7. Now Go to find Framework Manager. C:\Program Files (x86)\ibm\Cognos Express Clients\Framework Manager\IBM Cognos Express Framework Manager\bin 8. Right click on 'FM.exe', and then select 'Properties'. Click 'Compatibility' tab. Check "Run this program as an administrator' under 'Privilege Level'.  9. Open Framework Manager and create a new project and give it any name, in this case CSV_MiniProject. Then click OK. 10. Put the username: “Administrator” and password:”win7user”. 11. Select Language as English and hit ok. 12. Select Data Sources...

Words: 775 - Pages: 4

Free Essay

Data

...instructed to backfill with temporary labour. The collated data is being used to investigate the effect of this shift in labour pattern, paying particular attention to staff retention. The table below gives a month by month record of how many staff have been employed, temporary and permanent , how many temporary staff are left at the end of each month compared to how many are left that are on a permanent contract. Month | Temporary staff | permanent staff | total | permanent leavers | Temporary leavers | total leavers | Jan-15 | 166 | 359 | 525 | 7 | 2 | 9 | Feb-15 | 181 | 344 | 525 | 15 | 5 | 20 | Mar-15 | 181 | 344 | 525 | 0 | 7 | 7 | Apr-15 | 204 | 321 | 525 | 23 | 7 | 30 | May-15 | 235 | 290 | 525 | 31 | 12 | 43 | Jun-15 | 238 | 287 | 525 | 3 | 17 | 20 | Jul-15 | 250 | 275 | 525 | 12 | 42 | 54 | Aug-15 | 267 | 258 | 525 | 17 | 23 | 40 | Sep-15 | 277 | 248 | 525 | 10 | 27 | 37 | Oct-15 | 286 | 239 | 525 | 9 | 30 | 39 | Nov-15 | 288 | 237 | 525 | 2 | 34 | 36 | Dec-15 | 304 | 221 | 525 | 16 | 45 | 61 | Jan-16 | 305 | 220 | 525 | 1 | 53 | 54 | Feb-16 | 308 | 217 | 525 | 3 | 57 | 60 | An explanation of how I analysed and interpreted the data To make a comparison between the labour pattern and retention, I placed the above data into a line graph this gives a more of an idea to trends over the period My Findings The actual level of staff has remained constant throughout the data collated, as each job requires a specific amount of man...

Words: 621 - Pages: 3

Free Essay

Data

...Data Collection - Ballard Integrated Managed Services, Inc. (BIMS) Learning Team C QNT/351 September 22, 2015 Michael Smith Data Collection - Ballard Integrated Managed Services, Inc. (BIMS) Identify types of data collected--quantitative, qualitative, or both--and how the data is collected. A survey was sent out to all the employees’ two paychecks prior and a notice to complete the survey was included with their most recent paychecks. After reviewing the surveys that have been returned it was found that the data collected is both quantitative and qualitative. Questions one thru ten are considered qualitative data because the response for those questions are numbered from one (very negative) to five (very positive), which are measurements that cannot be measured on a natural numerical scale. They can only be classified or grouped into one of the categories and are simply selected numerical codes. Then, questions A-D could fall under quantitative data because it can determine the number of employees in each department, whether they are male or female and amount of time employed with the company. From that data it is able to find an average of time employed, then subcategorize by department, gender and if they are a supervisor or manager. Identify the level of measurement for each of the variables involved in the study. For qualitative variable there are a couple levels of measurements. Questions A, C, and D in Exhibit A fall in nominal-level data because when asking...

Words: 594 - Pages: 3

Premium Essay

Big Data and Data Analytics

...Big Data and Data Analytics for Managers Q1. What is meant by Big Data? How is it characterized? Give examples of Big Data. Ans. Big data applies to information that can’t be processed or analysed using traditional processes or tools or software techniques. The data which is massive in volume and can be both structured or unstructured data. Though, it is a bit challenging for enterprises to handle such huge amount fast moving data or one which exceeds the current processing capacity, still there lies a great potential to help companies to take faster and intelligent decisions and improve operations. There are three characteristics that define big data, which are: 1. Volume 2. Velocity 3. Variety * Volume: The volume of data under analysis is large. Many factors contribute to the increase in data volume, for example, * Transaction-based data stored through the years. * Unstructured data streaming in social media. * Such data are bank data (details of the bank account holders) or data in e-commerce wherein customers data is required for a transaction. Earlier there used to data storage issues, but with big data analytics this problem has been solved. Big data stores data in clusters across machines, also helping the user on how to access and analyse that data. * Velocity: Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with...

Words: 973 - Pages: 4

Premium Essay

Knbs Data

...KNBS DATA DISSEMINATION AND ACCESS POLICY November 2012 VISION A centre of excellence in statistics production and management MISSION To effectively manage and coordinate the entire national statistical system to enhance statistical production and utilization Herufi House, Lt. Tumbo lane P.O. Box 30266 – 00100 GPO Nairobi, Kenya Tel: +254-20-317583/86/88,317612/22/23/51 Fax: +254 – 20-315977 Email: info@knbs.or.ke Web: www.knbs.or.ke i WI-83-1-1 Preface Kenya National Bureau of Statistics (KNBS) is the principal agency of the Government for collecting, analysing and disseminating statistical data in Kenya. KNBS is the custodian of official statistical information and is mandated to coordinate all statistical activities, and the National Statistical System (NSS) in the country. Official statistics are data produced and disseminated within the scope of the Statistical Programme of the National Statistical System (NSS) in compliance with international standards. To achieve this mandate, KNBS strives to live up to the aspirations of its vision; to be a centre of excellence in statistics production and management. Chapter Four on The Bill of Rights section 35 of the new constitution in Kenya gives every citizen right of access to information held by the State. This policy document strives to provide a framework for availing statistical information to the public in conformity with this bill and government’s open data initiative. This...

Words: 3544 - Pages: 15

Free Essay

Data Information

...Data and Information Summary HCI/520 11/18/2013 Data and Information Summary Today we live in a world where data is a critical resource. Information is also a critical resource and consists of data that is processed into meaningful information for the purpose of organizations and users. Collected data is stored into what is known as databases where it is organized into potentially valuable information. Data also known as Raw data is a stream of facts that are not organized or arranged into a form that people can understand or use (Gillenson, Ponniah, Kriegel, Trukhnov, Taylor, Powell, & Miller, 2008) . Raw Data are facts that have not yet been processed to reveal their meaning (Gillenson, Ponniah, Kriegel, Trukhnov, Taylor, Powell, & Miller, 2008). For example when AT&T wireless ask their clients to participate in a survey about the products they have purchased or how was their customer service experience the data collected is useful but not until the raw data is organized by combining it with other similar data and analyzed into meaningful information. Information is the result of processing raw data to reveal its meaning (Coronel, Morris, & Rob, 2010). Data processing can be as simple as organizing data to reveal patterns or as complex as making forecasts or drawing inferences using statistical modeling (Gillenson, Ponniah, Kriegel, Trukhnov, Taylor, Powell, & Miller, 2008). Both data and information are types of knowledge which share similarities...

Words: 538 - Pages: 3

Free Essay

Bad Data

...Bad Data AMU DEFM420 Big Data refers to volume, variety and velocity of available data. The issue with that is that any emphasis is put on volume or quantity of data. The quantity is a very vague element of Big Data, there is no precise requirements to purely volume based data. What should be considered in big data, the complexity and depth of the data? If the content of data is deep and containing detailed information it holds more purpose. When we analyzes data we as a culture prefer less to review but of more importance. I would rather read two pages of relevant data then read one hundred pages that contain 3 pages of data. This is a factor of human nature but also a business factor. The majority of what we do in government work is time sensitive. We operate on a system of end dates. With time being a factor wasting time on Big Data that isn’t always pertinent is a waste. While in cases of no time limit, having the full three V’s of big data is acceptable and may in the end give more accurate information after spending excessive time sorting through the information mainly the volume portions. Is the system of Big Data wrong? No it is not wrong but the concept is too vague. For different situations data needs to be limited. Others not so much so it gives us a system and collection of information that is in some cases excessive for the need. It is a double edged sword. There are other aspects of Big Data collections useful in contracting offices...

Words: 325 - Pages: 2

Premium Essay

Data Management

...University Data Management March 18, 2014 Data partitioning is a tool that can help manage the day-to-day needs of an organization. Each organization has unique values that drive business. All organizations have policies and processes that are influenced by their environment and industry. The use of data partitioning can help productivity by recognizing the need to categorize data to tailor unique needs. This approach does require some effort. To transition to a new database approach, organizations need to assess the pros and cons of a database transition. The scale of an organization’s database may be the one factor that drives an adoption of this approach. Data partitioning has been developed to address issues that traditional database queries have created. One main problem that partitioning was created to solve is the performance of database queries. According to Alsultanny (2010), “System performance is one of the problems where a significant amount of query processing time is spent on full scans of large relational tables. Partitioning of the vertical or horizontal partitioning or a combination of both is a reliable way of solving the problem” (p.273). By separating queries into either horizontal or vertical processes, the user can avoid delays and strains on a database. This saves time which can be used to improve the productivity an organization has towards their day-to-day operations. Large-scale databases receive the most benefits from partitioning. Data partitioning...

Words: 1572 - Pages: 7

Free Essay

Data & Information

...able to function without data, information and knowledge. Data, information and knowledge are different from one another, yet there are interrelated to each other. Data Data are unprocessed raw facts which can be either qualified and or quantified. It explains phenomenal refer to statistical observations and other recordings or collections of evidence (Chaim Zins, 2007, p.480). Data can be in numbers or text. For example, temperature currency, gender, age, body weight. Figure 1, is example of data recorded in Microsoft Excel data sheet. Figure 1 Information The outcome of data processing is information. Figure 2 expresses the process of how data is being transformed to information. Data which is the input when being processed such as organized, examined, analyzed, summarized it gives the output as information. Information is processed data which gives explicit meaning to its readers. Based on Figure 1 data, after processed them, gives you the information of the percentage of a group of 24 youth, the number of times they eat fast food in a week as shown in Figure 3. Figure 3 show that youth in their twenties eats fast food at least once a week there are even a small number of them (4.1%) takes fast food almost every day (6 times/ week). It gives the information about the demand of fast food among youth in their twenties. Figure 3 The average age of this group also can be obtained from the data in the excel data sheet in Figure 4. Figure...

Words: 285 - Pages: 2

Free Essay

Big Data

...Lecture on Big Data Guest Speaker Simon Trang Research Member at DFG RTG 1703 and Chair of Information Management Göttingen University, Germany 2014 The City City of Göttingen • Founded in the Middle Ages • True geographical center of Germany • 130,000 residents Chair of Information Management Lecture on Big Data at Macquarie University 2 2 The University Georg-August-Universität Göttingen (founded in 1737) • • • • One of nine Excellence Universities in Germany 13 faculties, 180 institutes 26,300 students (2013) 11.6% students from abroad (new entrants: approximately 20%) • 13,000 employees (including hospital and medical school), including 420 professors • 115 programs of study from A as in Agricultural Science to Z as in Zoology are offered (73 bachelor / 22 master programs) Chair of Information Management Lecture on Big Data at Macquarie University 3 “The Göttingen Nobel Prize Wonder” Over 40 Nobel prize winners have lived, studied, and/or lived, studied or/and researched 41 Prize researched at the University of Göttingen, among them… at the University of Göttingen, among them… • • • • • • • • • • • • • • Max von Laue, Physics, 1914 Max von Laue, physics, 1914 Max Planck, physics, 1918 Max Planck, Physics, 1918 Werner Heisenberg, physics, 1932 Werner Heisenberg, Physics, 1932 Otto Hahn, chemistry 1944 Otto Hahn, Chemistry 1944 Max Born, physics, 1954 Max Born, Physics, 1954 Manfred Eigen, chemistry, 1967 Manfred Eigen, Chemistry, 1967 Erwin...

Words: 1847 - Pages: 8

Premium Essay

Data Gathering

...Four Basic Data Gathering Procedure Options Some students are not aware of the fact that they need to know some data gathering proceduretechniques when writing their research papers. Usually, they are simply concentrating on how to come with a good thesis statement, how to develop literature review or even how to cute reference materials. Let us now talk about the methods of data gathering since some research proposal examples do not even mention about this segment. There are different ways for you to conduct data gathering procedures. Usually, these ways are related to the same processes in statistics. Dissertation research methods are almost always related to data gathering so you really need to learn how to acquire your data for analysis. * Data mining – this procedure involves the search for published data from reputable sources. The process is simpler than other techniques but you need to make sure that the data is up to date. * Interviewing o this data gathering procedure involves a certain amount of time and effort investment. However, you can maximize the data that you can acquire form each respondents because you will personally acquire data from them. * Depending on your research paper topics, you can conduct surveying. If you wish to gather quick and raw data, this is the best medium for you. Prepare a set of questionnaire and then have your respondents fill them out. * Lab experiments – this type of data gathering procedure is intended if you wish...

Words: 267 - Pages: 2

Free Essay

Data Breaches

...Daniel Baxter Nico Ferragamo Han Vo Romilla Syed IT 110 8 December 2015 Data Breaches The Case In July of 2014 JPMorgan Chase, a multinational banking and financial services holding company was hacked. JPMorgan Chase is the largest bank in the United States, the sixth largest bank in the world, and the world’s third largest public company. Initial reports from JPMorgan Chase stated that the attack had only breached about one million accounts. Further details revealed that the hack breached the accounts of seventy-six million households (roughly two-thirds of the total number of households in the United States), and about seven million small businesses. While the hack began in July, it was not fully stopped until the middle of August, and it was not disclosed to the public until September. The hack is considered to be one of the most serious attacks on an American Corporation’s information systems and is one of the largest data breaches in history. JPMorgan Chase claims that the login information associated with the accounts (such as social security numbers and passwords) was not compromised, and the information that was stolen had not been involved in any fraudulent activities, however, the names, email addresses, physical addresses, and phone numbers on the accounts were taken by the hackers. The hack was believed to have been committed by a group of Russian hackers. It’s also believed to have been part of a large ring of attempted attacks on as many as nine banks and...

Words: 1557 - Pages: 7

Premium Essay

Primary Data vs Secondary Data

...Differences Between Primary Data vs Secondary Data -Submitted by Arvind Kartik SOURCES OF PRIMARY DATA Regardless of any difficulty one can face in collecting primary data; it is the most authentic and reliable data source. Following are some of the sources of primary data. Experiments Experiments require an artificial or natural setting in which to perform logical study to collect data. Experiments are more suitable for medicine, psychological studies, nutrition and for other scientific studies. In experiments the experimenter has to keep control over the influence of any extraneous variable on the results. Survey Survey is most commonly used method in social sciences, management, marketing and psychology to some extent. Surveys can be conducted in different methods. Questionnaire: It is the most commonly used method in survey. Questionnaires are a list of questions either an open-ended or close -ended for which the respondent give answers. Questionnaire can be conducted via telephone, mail, live in a public area, or in an institute, through electronic mail or through fax and other methods. Interview : It is a face-to-face conversation with the respondent. It is slow, expensive, and they take people away from their regular jobs, but they allow in-depth questioning and follow-up questions. The interviewer can not only record the statements the interviewee speaks but he can observe the body language or non-verbal communication such as face-pulling, fidgeting...

Words: 659 - Pages: 3

Free Essay

Big Data

...A New Era for Big Data COMP 440 1/12/13 Big Data Big Data is a type of new era that will help the competition of companies to capture and analyze huge volumes of data. Big data can come in many forms. For example, the data can be transactions for online stores. Online buying has been a big hit over the last few years, and people have begun to find it easier to buy their resources. When the tractions go through, the company is collecting logs of data to help the company increase their marketing production line. These logs help predict buying patterns, age of the buyer, and when to have a product go on sale. According to Martin Courtney, “there are three V;s of big data which are: high volume, high variety, high velocity and high veracity. There are other sites that use big volumes of data as well. Social networking sites such as Facebook, Twitter, and Youtube are among the few. There are many sites that you can share objects to various sources. On Facebook we can post audio, video, and photos to share amongst our friends. To get the best out of these sites, the companies are always doing some type of updating to keep users wanting to use their network to interact with their friends or community. Data is changing all the time. Developers for these companies and other software have to come up with new ways of how to support new hardware to adapt. With all the data in the world, there is a better chance to help make decision making better. More and more information...

Words: 474 - Pages: 2