Implementing Portfolio Selection by Using Data Mining

The Chinese University of Hong Kong
Department of Computer Science and
Engineering

Final Year Project
Trading Strategy and Portfolio
Management (LWC 1301)
Implementing Portfolio Selection By
Using Data mining

Tseng Ling Chun (1155005610)
Supervisor: Professor Chan Lai Wan
Marker: Professor Xu Lei
1

Table of Contents
Table of Contents………………………………………….…………………………………………………2
1. Introduction………………………………………….…………………………………………................4
1.1 Financial Portfolios.......................................................................................................4
1.2 Data Mining and Decision Trees………………………………………..................….4
1.3 Flow of Report……………………………………….....................................................….5
2. Classification and Regression Trees (CART) …………………………………..........……….6
2.1 Detailed description of CART……………………………………................................6
2.2 Tree Construction………………………………………..............................................….8
2.2.1 Application of Impurity Function in CART……………………...…...9
2.3 Splitting Rules…………........……………...………….………………………….......……11
3. Optimizing Size of Tree……………………………....………..................................................….12
3.1 Parameterization of Trees…………………………………...........................……….13
3.2 Cost – Complexity Function……………………………………....….........................14
3.3 V – Fold Cross – Validation……………………………………..........................…….15
4. Iterative Dichotomiser 3 (ID3) …………………………………....…..................................….18
4.1 Entropy and Information Gain……………....……………..................…………….19
5. Data Used…………………………………..................……….………………………………………….20
5.1 Platforms and Open Source Library…................................................................20
5.1.1 Testing and Development Environment…….......…................…...20
5.1.2 Robotrader………...………...………...………...………...…..................…...21
5.1.3 Java Object Oriented Neural Engine………...………...……….……..22
5.1.4 TA-Lib (Technical Analysis Library) ………...………...………........23
5.1.5 WEKA………...………...………...………...………....……........................…...23
5.2 Historical Stock Data Source………...………...………...………...…............……...23
5.2.1 Raw Data Format………...………...………...………...................………...23
5.3 Pre-process of information………...………...………...…...............……...………...24
5.3.1 Simple moving average (SMA).………...………...………...…....…..…24
5.3.2 Exponential moving average (EMA).....……...……..........................25
5.3.3 Relative Strength Index (RSI) ………...………...………......................26
5.3.4 Momentum and rate of change………...………...…….................…...26
6. Experiments and Results………………………………………….…………………………...……27
6.1 Trading strategy………...………...………...………...………...……….......……...........27
6.1.1 Planning of Trading strategy………...………...………...………...…27
6.1.2 Stock of choosing………...………...………...………...…....…...…….......................27
6.1.3 Finding rising / dropping stock in the following 30 days…...28
6.1.3.1 Flow of testing………...………...………..................................28
6.1.3.2 Choosing of method………...………...………...………..........28
6.1.3.3 Test without using SMA in CART………...…….......…….28
6.1.3.4 Test with using SMA in CART by adding momentum related attribute and modify the class………...............30

2

6.1.3.5 Classification of using ID3 algorithm………..................33
6.2 Portfolio management………...………...………...……….........................................35
6.2.1 Planning of Portfolio management………...………...………............35
6.2.2 Stock List of choosing………...………...………......................................35
6.2.3 Finding rising stock in following year………...………...………......37
6.2.3.1 Flow of testing………...………...………..................................37
6.2.3.2 Test using accurate value………...………...………...…......37
6.2.3.3 Test using proportion value………...………...……….......39
6.2.3.4 Summary on choice of first step………...………...……...39
6.2.4 Choosing low relation stock from the list………...……….............41
6.2.4.1 Flow of testing ………...………...……….................................42
6.2.4.2 Method………...………...……….................................................42
6.2.4.3 Testing using monthly percentage change data……44
7. Conclusion……………………………………….….…………………………………………................46
8. Difficulties and Challenges.......................................................................................................48
9. Contribution of Work.……………………....…….....…………………………………………........49
Bibliography………………………………………….…....……....…………………………….................51
Appendix A...…………………………………………………………………………………………………52
A.1 Decision Tree from CART........................................................................................52
A.2 ID3 Testing Result......................................................................................................57
A.3 Sample Data used in ID3 Algorithm...................................................................57
A.4 Sample Data used in CART Algorithm...............................................................58
A.5 Sample Data used in K – means Clustering.....................................................59

3

1. Introduction
With the fast expense of computer technologies and Internet in the past decade, mass number of financial investment tools and portfolios have been published by different financial institutions. Since many people are trading in different markets through different financial institutions and portfolios, this situation attracts lots of economists and mathematicians to investigate different portfolio simulations so as to get higher returns from market with less risk. Under this environment, ability of computers has been improved to handle the numerous data generated by the markets. During this improvement, different data mining techniques have being used.
In this report, we are trying to select the best portfolio among different combinations by using different ideas of data mining techniques, including
Classification and Regression Trees – CART and Iterative Dichotomiser 3 (ID3).

1.1 Financial Portfolio
The meaning of “Portfolio” is in the sense of “collection”, while in financial aspect it directly means a collection of different investments. It is important that not to put all resources into one single security because different securities will have different characteristics like liquidity in different time series. Therefore, diversification of portfolios becomes a hot issue recently with concern of reducing systematic risk and specific risk. However, this kind of concerns is not the underlying part in this report.

1.2 Data Mining and Decision Trees
Data Mining is techniques that digging out useful information that hide in vast amount of data. It can be used to reduce noise, analyze data from different dimensions and summarize the relationship between data. Basically it is a process to find out the correlations or patterns in a large relational database.
Data mining method includes association, clustering, classification, regression,
4

deviation detection, summarization and sequential mining. Our report is going to use Classification and Regression Trees to select portfolio.
Decision tree is a non-parametric learning method with schematic tree-shape, which is used for classification and regression. Goal of decision tree is to derive a model that can predict the values of target variables by applying simple decision rules. These simple decision rules are induced by the features of used data.
Following shows learning example of decision tree, the diagram shows that decision tree model learns from data and approximates a curve with if-then-else rules. Figure. 1.1: Example of decision tree model with if-then-else decision rule

1.3 Flow of Report
The structure of this report is as follows: Chapter 2 will present you the decision tree methodology, which is used in the report – CART, by introducing some important components like impurity measures and Gini splitting rules. Then in chapter 3 we will show you how to optimize the tree in order to get a better result by applying cross validation to the model. After all, detailed description of data for experiment will be shown in chapter 4. Chapter 5 provides results and chapter 6 will give conclusion.

5

2. Classification and Regression Trees (CART)
2.1 Detailed Description
Classification and Regression Trees (CART) is a non-parametric method that uses available data in the form of

where X represents the matrix of

explanatory variables and Y represents the vector of data classes. There exists a situation that the available data might not belong to any classes and therefore further computation is need for the data to meet the characteristic of Y, computed available data imply X.
To select a portfolio or stock, we have to first know what kind of the portfolio or stock is, that means what class Y they belong to. For example, a stock can be defined as bearish or bullish or subjectively a neutral market, therefore class vector are predefined classes of stocks. X

then may contain available data such as technical variables related to the predefined three classes. The detailed construction of vector y will be told in another chapter.
Since there are so many observations inside a learning sample, we now assume each of them has its own class, therefore we can treat them as combinations of available data X and class vector Y and they are used to extract useful data patterns from the learning sample by “learning”. For example, the decision tree T gets outcomes from those samples and try to evaluate what is the relationship between X and Y. After explaining, the model can insert new data out of the sample pool into classes from Y. This process is important since market information is always updating in a large amount and this trained model is then available to give suggestion of to buy, sell or hold from new data.
Now consider an example, which is illustrated in Figure. 2.1 by using CART.
CART tends to simplify the data by splitting them with minimal number of questions. Note that CART will only answer Yes/No questions like “

6

?”. In

this example, we classify nodes with tags Apple, Orange, Banana, Grape and
Melon as terminal nodes

where k = 1, 2, ... , n. If the answer is yes from

question then left branch is taken.
X1 ≤ 0.5
Apple

X2 ≤ 0.5

X1 ≤ 0.75

Orange

Banana

X2 ≤ 0.25

Grape

Melon

Figure 2.1: Example of classification tree with minimal number of questions
CART goes through all the available variables so as to find out the splits s, which are combination of available data X and suitable question values used. Among all these splits, an optimal split is found if that split s* splits the data into two parts with maximum homogeneity. The process is repeated until all splits become
“optimal” and then the resulting tree becomes an “optimal” size, which means it contains minimal number of questions.

7

Figure. 2.2: Result of application of CART on tree in Figure. 2.1 with 2 variables.
Four Splits is enough for data to be separated into different nodes. Red lines indicate the questions.

2.2 Tree Construction
Since in this project we are going to apply decision tree to a vast amount of data.
Therefore, in order to separate them one by one and extract the relationship between them, constructing a tree is an important topic and it contains three major steps to complete it.
1. A maximum tree

should be constructed,

2. Adjusting the right size of tree and,
3. Inputting new data to newly constructed tree.
First of all, since available data X = (X1, X2, …, XP) where P is the number of variables and the class vector Y should have the same length with the number of observations because each observation should belongs to one class. Also, we assume J be the number of unique classes like in y
J = 3.

8

,

We now let tP be the parent node while tL and tR represent left and right child node such that there exist both fraction from parent node to left node, the pL, and fraction from parent node to right node, the pR = (1 – pL). If nP is the number of observations in parent node while nL and nR are also number of observations in left and right child node, then

,

Eq. 2.1

As we mentioned in Ch. 2.1, CART splits the initial data (parent node) into two separated parts (child node) in order to find out the most homogenous group so as to obtain an “optimal” tree. The word homogeneity is first defined in impurity function i(t).

2.2.1 Application of Impurity Function in CART
The impurity function measures the purity for a region containing data from different classes. Assume there are K classes then the impurity function becomes a function of probabilities p1, p2, …, pK in different regions that belong to class y1, y2, …, yK. Hence, we can conclude two major properties of this function:

1. It achieves a maximum when at point

, which is at

uniform distribution.
2. It achieves a minimum when at points
.
Impurity function provides a mean to split left and right child node with a probability of maximum homogeneity while compared to the parent node. The split is also equivalent to the split after measuring the maximum change of the impurity function

where t represents a node:
Eq. 2.2

Where

and C = set of child node

9

.

To apply to CART, we first let the two kinds of fraction pL and pR be the estimated probability of left and right child node and therefore we can define the goodness of split s for node t as:
Eq. 2.3
By Eq. 2.3, we can then obtain the equation to solve the optimization problem at each node as follows:

Eq. 2.4

Eq. 2.4 is a algorithm of CART searching out the best split s* through all variables constituting the matrix space X that maximizes the change of impurity function.
Therefore, by applying Eq. 2.4, a maximum tree TMAX would be established since it is a tree containing maximum number of nodes for a given data sample set. Eq.
2.4 works to the original given data set and result data splitting portions until the condition in Eq. 2.5 holds:
Eq. 2.5
Where

is the set of terminal nodes in tree T and p(j|t) is the estimated

posterior probability of class j given a point is in node t.
This condition means that all observations are in the same class j in each terminal node of maximum tree TMAX.

10

2.3 Splitting Rules
In this report, we employ the concept of Gini index to define the impurity function By using the idea of Gini index, the form of impurity function

is shown as

below:
Eq. 2.6

where k,l =

are class indices and:

Eq. 2.7

where nj(t) is the number of observations from variable set X in the class j and
. Note that these classes have been separated to each corresponding node t by a given split s.
By applying Gini impurity function into Eq. 2.3 and Eq. 2.4, we can obtain:
Eq. 2.8

Eq. 2.9

Gini algorithm tends to look for the largest class or the “most important” class and isolate it from other data.

11

3. Optimizing Size of Tree
From Chapter 2, we can conclude the process of creating a tree by (1) applying
Eq. 2.9 to the learning sample data set then (2) apply to the newly created nodes in a tree. (3) This process stops until the condition in Eq. 2.5 holds for every terminal nodes or until the tree size has been optimized. In this chapter, we are going to show you what is the optimized tree size and why maximum tree is not always the best choice.

3.1 Parameterization of Trees
Since this report aims to optimize the tree in order to use CART technique to obtain satisfying result for portfolio selection, by applying Eq. 2.9 until condition
Eq. 2.5 holds can achieve the optimization. However, since the homogeneity of classes at each stage were said to be increase by filtering out observations that of other classes to other nodes, therefore, it means that even the smallest or the
“less important” random noise would take account as parts of a final decision.
However, we would only want those reliable parts of the tree but not the noise because new data will pass through the whole trained tree, but they will also pass through those noisy parts and with a higher probability the classification of tree would be wrong.
In other words, a tree with larger tree size would cause over-parameterization problem because of noises while a tree with small tree size causes underparameterization since it cannot learn the significant part inside the learning sample. The solution to solve these problems is to apply cross-validation to the subtrees of tree T of different sizes and compare their performance.

12

3.2 Cost Complexity Function
Tree complexity means that how large the size of tree is and it depends on how much nodes are there. There are two such a extreme cases: (1) For a maximum tree TMAX, it will be penalized by it’s large size since larger portion of noisy parts will be produced however it makes a perfect predictions in learning sample.
(2) For very small size of tree, they get much lower penalty for their size but their ability of predicting become limited since they might ignore those significant learning sample data from what we have just mentioned in Chapter
3.1.
In order to find the balance between tree size, which might be penalized, and predicting power, which depends on tree size, we can apply cost complexity function – the tree pruning to achieve our goals.
Imagine that we cut off the growing process during the growing process at various time and various points then we can evaluate the misclassification error at that point and time. It leads to an error versus tree size diagram. There are two types of classification error:

Error

Error

Tree Size
(a)

Tree Size
(b)

Figure 3.1 (a) shows training error where those mistakes made on training set while (b) shows testing error where those mistakes made on testing set.

13

Increase in tree size due to decrease in training error. However, it is not the same in testing. Testing error is first decreased with increment in tree size since we expect the result is similar to what we have tested in training stage. However the fact is that during testing, larger amount of noisy data were passed through the model and therefore became less accurate on the testing data. In this case the model is actually fitting on noisy data but not test data. Therefore, the shape of the testing error curve starts to increase, in which the size of tree is too larget hat it cannot perform well on test data. Overfitting problem occurs.
To obtain the optimal result by considering the tradeoff between tree size and error, we first define the misclassification error at node t as:

Eq. 3.1 and Eq. 3.2
Therefore the misclassification error of the tree is:
Eq. 3.3
Where

is a set of terminal nodes. We then define the size of T as

for any

subtree T TMAX. Then the pruning method, the cost complexity function is defined as:
Eq. 3.4 where  is the complexity parameter and ≥ 0.

is the cost component.

Since the set of subtree T, could be considered as pruned tree, of TMAX is finite and so T = {T1, T2, … , TN) with decreasing number of nodes. The tree T() is then considered as “optimal” with some , it will remain “optimal” until another

14

parameter ’ in Eq. 3.4 makes the tree more “optimal” then it will be replaces.
This process carries on until a most optimal one is found.
An optimal tree T() exists when:

Eq. 3.5 or Eq. 3.6

After determining the misclassification error for optimization of tree size, we can then apply the technique of V-fold cross-validation to determine the optimal tree.

3.3 V-fold cross validation
When a tree is built by learning some specific dataset, and then separated, independent testing datasets are run through the tree. The classification errors first decreased with increasing of tree size until it reaches a minimum. If the tree grew continuously beyond this minimum point, the classification error then increase. Example is shown in Figure 3.1 (b).
However, it is difficult to hold back data from the learning dataset, which is used for separated tests and testing independent data frequently is also expensive.
Therefore V-fold cross-validation is used for testing independent tree size without separating test dataset and reducing data used to build a tree.

15

Brief principle of V-fold cross-validation is as follows:
1. Since all the data in the dataset are allowed to used to build a tree, so the tree was first intentionally to grow larger than “optimal” one. We refer this tree as unpruned tree, or the maximum tree TMAX, which is perfectly fit in learning data set. 2. The whole learning dataset is then separated into some groups called “folds”.
They are separated such that the distributions of classes of available variables are similar in different groups. The number of groups is V.
3. Now assume there are 10 partitions, the model first uses 9 of the partitions into a new pseudo-learning dataset then a test tree is built based on the pseudolearning dataset. This tree is then only fit 90% of the unpruned tree. The unused
10% of whole learning data becomes independent and can be used as test sample to the test tree.
4. During the testing, classification error for test tree 1 occurs and being considered as an independent error of tree 1.
5. Repeat the process 3 and 4 with different pseudo-learning datasets for V times
6. Once there are 10 tests tree is built and 10 classification errors have been extracted, stop the process.
7. We then average those classification errors by their tree sizes into average error rate. This average error rate for specific tree size represents “CrossValidation cost” – CV. After computing CV for each size of tested trees, a tree size that produces minimum CV must be found.

16

8. The unpruned tree is then pruned to the size with minimum CV. During pruning, we remove the “least important nodes”, which is found by cost complexity function in Chapter 3.2.
However, selecting the optimal tree with minimum value of CV – ECV(T) may not be the best solution for selection since there may exist more than one solution, which means it may occurs a situation that the answers inside a range of values
Also, if the value of V is less than the number of observations in available variables set X, then the V-fold cross-validation might cause different result to the original cross-validation.

17

4. Iterative Dichotomiser 3 (ID3)
Besides using CART, this project also includes the ID3 technique, which is slightly different to CART, to obtain the best selection.
ID3 is first developed by Ross Quinlan in 1986. This algorithm creates simple and efficient tree with the smallest depth. The major difference between ID3 and
CART is in their splitting method. Since CART is a binary tree model, every time it can only develop 2 nodes under 1 parent node. However, ID3 selection tree can have multiple children and siblings at the same time. Since the main method of
ID3 is based on Concept Learning Method System (CLS), therefore, the training approach over a set of learning dataset C should be:
1. TRUE node is created if all data in C are positive, then stop immediately.
2. FALSE node is created if all data in C are negative, then stop immediately
3. Divide the training data in C into subsets C1, C2, … , CN according to their values
4. Repeat process 1 – 3 for each Ci where i = 1, 2, … , N
Inside above procedures, ID3 goes through all attributes of the training dataset and selects the attribute that perfectly separated the given dataset. It stops until a best attribute is found. Note that ID3 will not turn back to look at those data once it passed through.
The requirements of learning dataset used are as follows:
1. Same attributes should have a same fixed number of values.
2. Attributes must be predefined before used as example.
3. Continuous classes are not directly allowed.
4. There must have sufficient amount of data, since ID3 needs large amount of data to distinguish useful patterns from occurrences. More data result in better accuracy. 18

4.1 Entropy and Information Gain
In order to decide which attribute to be chose, a technique called Information
Gain is introduced, which is based on the Entropy function as follows:

Eq. 4.1

For
Where p(ji) is the probability of class ji in the training dataset C. The probability is also defined as:

All attributes will be estimated so as to investigate which of them can reduce the impurity most when it is used to divide C. Suppose now the number of positive values inside attribute Ai is w and it will be used to divide the dataset C into different subset D1, D2, … , Dw. Therefore the entropy for particular subset of dataset becomes:
Eq. 4.2

Therefore, the information gain of Ai becomes:
Eq. 4.3
Where information gain measures the difference between entropies (before and after). It means that it measures how much uncertainty has been reduced after splitting dataset C on attribute A. The larger the information gain, the larger portion of uncertainties has been removed. Note that if IG(A) is too small then the process will halt automatically.

19

5. Data Used
5.1 Platform and Open Source Library
For testing and finding trading strategy, some Open Source Library and application will be used for visualizing the chart of stock and the trading result.
Some library is used to calculate technical analysis indicator for testing.

5.1.1 Testing and Development Environment
For the development, Windows 7 is used as the development platform. Since there are some finance / data mining library based on java. Java is decided be the development language. For the Integrated Development Environment. Eclipse is chose since it provides plugins for getting the open source application source code. Figure. 5.1 Logo of Eclipse

20

5.1.2 Robotrader1
Robotrader is a simulation platform for automated stock trading. This is an open source application built by java. It is built under open source license LPGLv2. The main usage of this software is to:
● Visualize the stock chart
● Obtain and convert stock data
● provide main framework of automated trading
Simple structure of Robotrader
Directory

Description

Important file

GUI

Code of the GUI, linking of different

ReportModule.java

module, showing the total return report Market

Class which is the data of the market

HistoricData.java

like

IIndicatorContainer.java

User current money
Data from certain stock(include pre-computed data)
Register the trading strategy(indicator) Quotedb

Download / reading of stock file

InstrumentQuoteFile.java
YahooUSHistoricLoader.java

Trader

Directory that store various trading

Our implementation and

strategy

testing of trading is based on here

Stat

Directory that generate statistics data
Table. 5.1 Structure of Robotrater

1

robotrader offical website http://jrobotrader.atspace.com

21

register

Market

day pass

load stock

Trader

quotedb

output report stat Figure. 5.2 The relationship between the structure of Robotrader

5.1.3 Java Object Oriented Neural Engine
JOONE is a neural network framework built in java. In the predicting algorithm, this library will be used.

Figure. 5.3 Logo of Java Object Oriented Neural Engine – JOONE

22

5.1.4 TA-Lib (Technical Analysis Library)2
TA-Lib is an open source library, which can perform technical analysis of financial market data. TA-Lib is under a BSD License and available in different programming language. In the testing, Java version of TA-Lib will be used.

5.1.5 WEKA
Weka is machine-learning software written in java. It provides a GUI for us to do machine learning easily. It provide a back end java library for the machine learning program. In this project, we are mainly using the library instead of using the GUI to test the program directly. We are going to combine the Robotrader and
Weka into one testing platform for us to test the strategy.

5.2 Historical Stock Data Source
5.2.1 Raw Data Format
The data used is EOD (End Of Day) data extract from yahoo.us. The following table shows its format:

Field

Description

Stock

String

Stock Number in market

Date

Integer

The Date of that row of data

Open

2

Type

Float

Opening price of the day

TA-Lib official webpage : http://ta-lib.org/

23

High

Float

Highest price of the day

Low

Float

Lowest price of the day

Close

Float

Closing price of the day

Volume

Integer

Total traded Amount of Stock of the day

Adjueste

Float

Adjuested closing price of the

d Close3

stock (orignal price before split / dividend)

Table. 5.2 EOD data format

5.3 Pre-process of information
Since each raw data only include certain day stock price information. It is a discrete data and hard to find the relationship like trend in the stock. Some of the stock information like technical indicator will be pre-process for the testing.
These are some technical indicator that may be used as extra information.

5.3.1 Simple moving average (SMA)
Simple moving average is an un-weighted means of price in n previous days. By changing the n, the average price of stock in n day can be showed. SMA can also apply in Volume.
The formula of SMA is :

Eq. 5.1
3

Finance Help | - SLN2311 - Historical prices and adjusted close http://help.yahoo.com/kb/index?page=content&y=PROD_FIN&locale=en_CA&id=SLN2311 24

Figure. 5.4 Figure of HSBC Holdings plc from 2010-11 to 2011-06 blue area is stock price, red is SMA with n = 15.

5.3.2 Exponential moving average (EMA)
Exponential moving average is weighted means of price in n previous days. This is similar to SMA. For the further day i, with smaller weighted . Then the further days have the smaller affect to this indicator.

Eq. 5.2

Figure. 5.5 Figure of HSBC Holdings plc from 2010-11 to 2011-06 blue is stock price, red is EMA with n = 15

25

5.3.3 Relative Strength Index (RSI)
Relative Strength Index is a technical indicator that shows the trend of given period. J. Welles Wilder develops it in 1978. The following shows the basic equations of RSI concept:

Eq. 5.3

Eq. 5.4

For the day raise:

Eq. 5.5
For the day drop:

Eq. 5.6

5.3.4 Momentum and rate of change
Momentum (MTM) and rete of change (ROC) is the similar indicator, they are both indicator for analysis the price-changing rate.
For the MTM:
Eq. 5.7

For the ROC:
Eq. 5.8

26

6. Experiments and Results
6.1 Trading strategy
Trading strategy is about to of finding some rule or model that can achieve a higher return. The transcetion can depends on the rule that was found without further decision of human.
To find the rules of trading strategy, this part of the project is going to use classification method to classify if there stock have a high chance to rise or drop in certain period.

6.1.1 Planning of Trading strategy
In finding Trading strategy by classification, we are going to apply different classification method to the stock. Also, we will also find out which type of attribute is more suitable for finding out the rules .

6.1.2 Stock of choosing
For this part, we are going to select 0005.HK for testing. The period is from
20000101 to 20131231 training for 20000101 to 20091231 about 3000 instances, testing for 20100101 to 20131231 about 1000 instance.
The reason of choosing that is in this period of time the stock have many type of trend. For more trends that may appear, the prediction of stock price will be more accurate.

27

6.1.3 Finding rising / dropping stock in the following 30 days
In this part, we are going to apply two decision tree method for classify if the stock have rising or dropping trend.
6.1.3.1 Flow of testing
1.

Introduce the method

2.

Introduce the attribute

3.

Study on result

6.1.3.2 Choosing of method
For the trading strategy part, we are going to choose two decision trees method for doing the classification. The first one is CART (Classification And Regression
Tree). For the second part we are going to use decision tree that are built from
ID3 algorithm.
6.1.3.3 Test without using SMA in CART
In this part we are going to use the data without SMA to test. For the input setting, we guess the price accurate value will have it won meaning in the trend.
Attribute:
Name

Value

Description

past_30price

-30closing_price()

currentprice

closing_price()

past_30volume

-30closing_volume()

volume

closing_volume()

class

{c0,c1,c2}

28

c0:
(+30closing_price)/closing_
price > 1.1 c2: (+30closing_price)/closing_ price < 0.9 c1: other x means x days after current day
-y means y days before current day
This is the tree of result: currentprice < 73.7750015258789
| past_30price < 72.95000076293945
| | past_30price < 56.92500114440918
| | | currentprice < 59.77499961853027: c0(23.0/0.0)
| | | currentprice >= 59.77499961853027: c1(27.0/2.0)
| | past_30price >= 56.92500114440918: c0(59.0/7.0)
| past_30price >= 72.95000076293945
| | currentprice < 67.25: c2(25.0/4.0)
| | currentprice >= 67.25
| | | currentprice < 72.95000076293945: c1(16.0/5.0)
| | | currentprice >= 72.95000076293945: c2(7.0/4.0) currentprice >= 73.7750015258789
| currentprice < 122.45000076293945
| | past_30volume < 1.926585E7
| | | currentprice < 76.6500015258789
| | | | past_30price < 81.17500305175781: c1(7.0/1.0)
| | | | past_30price >= 81.17500305175781: c0(8.0/3.0)
| | | currentprice >= 76.6500015258789
| | | | currentprice < 118.95000076293945: c1(1015.0/160.0)
| | | | currentprice >= 118.95000076293945
| | | | | volume < 5656500.0
| | | | | | currentprice < 120.70000076293945: c2(17.0/2.0)
| | | | | | currentprice >= 120.70000076293945: c1(6.0/2.0)
| | | | | volume >= 5656500.0: c1(69.0/7.0)
| | past_30volume >= 1.926585E7
| | | past_30price < 119.8499984741211
| | | | currentprice < 84.57500076293945
| | | | | past_30price < 80.79999923706055: c1(61.0/4.0)
| | | | | past_30price >= 80.79999923706055
| | | | | | currentprice < 81.82500076293945: c1(23.0/9.0)
| | | | | | currentprice >= 81.82500076293945: c2(16.0/5.0)
| | | | currentprice >= 84.57500076293945: c1(109.0/17.0)
| | | past_30price >= 119.8499984741211
| | | | past_30price < 130.20000457763672: c2(19.0/17.0)
| | | | past_30price >= 130.20000457763672: c1(9.0/1.0)
| currentprice >= 122.45000076293945
| | currentprice < 148.79999542236328: c1(929.0/49.0)
| | currentprice >= 148.79999542236328
| | | past_30price < 139.8499984741211: c2(8.0/1.0)
| | | past_30price >= 139.8499984741211: c1(12.0/3.0)

29

Study on result:
We guess the tree that constructed is always depends on the price only. And it is highly related to the price. The tree show that the tree only memorize some pattern in certain time (e.g. current price < 118.95000076293945 with many classification). And then for the further testing of testing data it show that the tree is even not related to the decision. The c0 classification there has True positive with 0%.
6.1.3.4 Test with using SMA in CART by adding momentum related attribute and modify the class
From the previous test, the accurate value will lead to some obey problem. So we are going to adding momentum change (percentage change), And for the class , after a series of test we guess it is too high / too low for setting 1.1 or 0.9.
By the volatility formula:
Volatility = SD (change)*sqrt (days)
For volatility of 0005, the volatility approximate to 0.05, so we modify to 1.05 and 0.95.
Attribute:
Name

Value

Description

past_30average

average(from -59 to -30 closing_price()) SMA of price from -59 to -30

30average

average(from -29 to 0 closing_price()) SMA of price from -29 to current day

past_30volume

average(from -59 to -30 closing_volume()) SMA of volume from -59 to 30

30volume

average(from -29 to 0 closing_volume()) SMA of volume from -29 to current day

30

price_MT

ln(30average/past_30avera ge) percentage change of price

volume_MT

ln(30volume/past_30volum
e)

percentage change of volume class

{c0,c1,c2}

c0:
(past_30average)/30averag
e > 1.05 c2: (past_30average)/30averag e < 0.95 c1: other

x means x days after current day
-y means y days before current day
The Result of the tree:
30average < 83.06250047683716
| past_30average < 75.5733317732811
| | past_30average < 48.232500433921814
| | | 30volume < 4.164597165625E7: c2(3.0/0.0)
| | | 30volume >= 4.164597165625E7: c1(16.0/4.0)
| | past_30average >= 48.232500433921814
| | | 30average < 68.06916570663452: c0(86.0/1.0)
| | | 30average >= 68.06916570663452
| | | | 30volume < 3.4278431484375E7: c0(22.0/2.0)
| | | | 30volume >= 3.4278431484375E7: c1(4.0/0.0)
| past_30average >= 75.5733317732811
| | past_30volume < 1.680683175E7: c0(47.0/1.0)
| | past_30volume >= 1.680683175E7: c2(53.0/0.0)
30average >= 83.06250047683716
| past_30volume < 2.03377383515625E7
| | 30volume < 1.077808501953125E7
| | | past_30volume < 6514848.3203125
| | | | 30average < 114.25833308696747
| | | | | 30average < 88.99583327770233
….
| | | | | | | | price_MT < 0.07610166271935453: c1(5.0/0.0)
| | | | | | | | price_MT >= 0.07610166271935453
| | | | | | | | | past_30average < 133.67000150680542: c1(3.0/1.0)
| | | | | | | | | past_30average >= 133.67000150680542: c0(3.0/0.0)
| | | | | past_30average >= 139.19499897956848
| | | | | | past_30volume < 2.7342849921875E7
| | | | | | | past_30average < 141.0199966430664: c2(17.0/1.0)
| | | | | | | past_30average >= 141.0199966430664
| | | | | | | | past_30volume < 2.29203465625E7: c0(7.0/2.0)
| | | | | | | | past_30volume >= 2.29203465625E7
| | | | | | | | | past_30average < 144.56166315078735: c1(19.0/2.0)
| | | | | | | | | past_30average >= 144.56166315078735: c2(3.0/0.0)
| | | | | | past_30volume >= 2.7342849921875E7: c2(40.0/1.0)

31

Number of Leaf Nodes: 140
Please see the reference page

The testing putting into the classification tree: c0 c1

c2

Classified as/real

150

26

6

c0

184

148

37

c1

45

19

63

c2

True positive rate: c0 82%

c1

40%

c2

50%

Study of the result:
For the decision tree, there will be not only depends on certain attribute only.
Even in the training section the accuracy is not so high, in the testing data set perform quite well. There are three type of action. For the accuracy over 33% should be quite good. Also there are not many rising signal become dropping signal. The prediction of rising is quite good in this classification.

32

6.1.3.5 Classification of using ID3 algorithm
For this part of the project, we are going to implement the classification by ID3 algorithm decision tree. Since this implementation of decision tree can only accept fixed type input. So, we modify them into {true, false} type.
Attribute:
Name

Value

Description

pre30pricerise

if(average(from -29 to 0 closing_price()) average(from -59 to -30 closing_price()) > 0)

SMA of price from -59 to -30

pbigrise

average(from -29 to 0
SMA of price from -29 to closing_price())/average(fro current day m -59 to -30 closing_price())
> 1.05

pbigdrop

average(from -59 to -30 closing_volume()) SMA of volume from -59 to 30

pre30volumerise

average(from -29 to 0 closing_volume()) SMA of volume from -29 to current day

vbigrise

ln(30average/past_30avera ge) percentage change of price

vbigdrop

ln(30volume/past_30volum
e)

percentage change of volume class

{c0,c1,c2}

c0:
(past_30average)/30averag
e > 1.05 c2: (past_30average)/30averag e < 0.95 c1: other

33

Decision Tree: pbigdrop = false
| pbigrise = false
| | vbigdrop = false
| | | pre30pricerise = false
| | | | vbigrise = false
| | | | | pre30volumnrise = false: c1
| | | | | pre30volumnrise = true: c1
| | | | vbigrise = true: c1
| | | pre30pricerise = true
| | | | vbigrise = false
| | | | | pre30volumnrise = false: c1
| | | | | pre30volumnrise = true: c1
| | | | vbigrise = true: c1
| | vbigdrop = true
| | | pre30pricerise = false: c1
| | | pre30pricerise = true: c1
| pbigrise = true
| | vbigrise = false
| | | pre30volumnrise = false
| | | | vbigdrop = false: c1
| | | | vbigdrop = true: c1
| | | pre30volumnrise = true: c1
| | vbigrise = true: c1 pbigdrop = true
| vbigdrop = false
| | vbigrise = false
| | | pre30volumnrise = false: c1
| | | pre30volumnrise = true: c1
| | vbigrise = true: c1
| vbigdrop = true: c1

c0

c1

c2

classified as/real

0

51

0

c0

0

565

0

c1

0

62

0

c2

Testing result:
We guess using attribute with only 2 types is a total fail choice. There are nothing can be classified as c0 (big rise) or c1 (big drop)

34

6.2 Portfolio management
Portfolio management is about to distribute different investment in different proportion. These investments include securities like shares, bonds, real estate and etc. The main reason of using portfolio management is not only to increase the return of the total investment. Most basically is to increase the return and maintain the volatility to a acceptable range.
This part of the project are going to find strategy for choosing stock to reach the portfolio management's aim, increase the return and lower the volatility of total investment in a low

6.2.1 Planning of Portfolio management
In this portfolio management, we are going to divide it into two part for our choosing the stock.
For the first part, we are going to find stocks that have high chance of rising in the following year. The main reason of doing this is to optimize the stock.
In the second part, we are going to select various stocks from first part to construct a stock combination. The main reason of doing this is to optimize the total volatility of the investment combination, to reaching a more stable return.

6.2.2 Stock List of choosing
For this part, we are going to select the following stock for finding the combination of stock. Following table is the list of the stock.
0001

0002

0003

0004

0005

0006

0008

0010

0011

0012

0013

0014

0016

0017

0019

0020

0023

0041

0069

0083

0097

0101

0142

0179

0267

35

0291

0293

0315

0363

0941

1038
These are the stock, which are components of HSI in 2000.
The reason of choosing stock from 2000 list is there will be a fair result of testing.
Since the HSI will change its component depends of the performance of stock.
“Performance” always means good performance of stock. If that particular stock did not perform well in the past and if we chose components from current list, these stocks will have a higher chance that obtain a good result (rising). So, we choose the stock of previous list.

36

6.2.3 Finding rising stock in following year
In this part we are using the same technique with the trading strategy,
CART(Classification and regression tree), this method perform not bad in the trading strategy part.
6.2.3.1 Flow of testing
1.

Introduce the method

2.

Introduce the attribute

3.

Study on result

6.2.3.2 Test using accurate value
First attempt, we are using cart to do the classification.
We are going to apply the method from trading strategy test. And the attribute was modified from 30dats to 260days that are similar to 1years. Since we want to find out the long period of trend, so we changed to 260days.
Here is the attribute table:
Name

Value

Description

past_260average

average(for -260 to -519 closing_price()) SMA of price from last 2 year to last 1year

260average

average(for 0 to -259 closing_price()) SMA of price from last year to current day

past_260volume

average(for -260 to -519 volume_price()) SMA of volume from last 2 year to last 1year

260volume

average(for 0 to -259 closing_price()) SMA of price from last year to current day

class

{c0,c1,c2}

c0:
(+260closing_price)/260ave
rage > 1.2 c2: 37

(+260closing_price)/260ave rage < 0.8 c1: other
+x means x days after current day
-y means y days before current day
Tree result: past_260volume < 8142249.042480469
| past_260average < 98.87307676672935
| | 260average < 86.81057678163052
| | | past_260average < 89.88461546599865:
| | | past_260average >= 89.88461546599865:
| | 260average >= 86.81057678163052:
| past_260average >= 98.87307676672935: past_260volume >= 8142249.042480469
| past_260average < 133.7886544317007:
| past_260average >= 133.7886544317007
| | 260average < 126.74211592972279
| | | past_260average < 142.62557727098465:
| | | past_260average >= 142.62557727098465
| | | | 260average < 124.98875057697296:
| | | | 260average >= 124.98875057697296:
| | 260average >= 126.74211592972279
| | | 260average < 128.8205772638321
| | | | past_260average < 142.49903884530067:
| | | | past_260average >= 142.49903884530067:
| | | 260average >= 128.8205772638321:

Study on result:
We guess this a bad result for just using the same concept in 1 year directly, it is almost using the stock price to predict next year’s drop/rise. Even in the classification rate is high. We do not think that the tree can learn anything from the tree. Also, for the SMA hold a so long period (260 days), and there only include SMA attributes for learning. Also, there also take no affect on volume of the stock and we both think that the volume is import in classification. So we are going to not use this tree.

38

6.2.3.3 Test using proportion value
Second attempt, we are using still cart to do the classification, but we modified the attribute.
Attribute:
Name

Value

Description

past_260average_MT

closing_price()-average(for
0 to -259 closing_price())

current_closing price minus
SMA of price of last 1 year,
It use to find the momentum of the stock price in previous year

past_260_over_current

closing_price()/average(for
0 to -259 closing_price())

It use to find the momentum rate change of the stock price of previous year

past_260volume_MT

closing_volume()average(for 0 to -259 closing_volume()) current_closing price minus
SMA of price of last 1 year,
It use to find the momentum of the volume in previous year past_260_over_currentvolu me average(for 0 to -259 closing_price()) It use to find the momentum rate change of the stock volume of previous year

class

{c0,c1}

c0:
(+260closing_price)/closing
_price > 1.3 c1: other

+x means x days after current day
-y means y days before current day

39

Tree result: past_260_over_current < 0.9727344170056904
| past_260volume_MT < 4509568.572265625
| | past_260_over_currentvolume < 0.9747692049082883: c1(132.0/0.0)
| | past_260_over_currentvolume >= 0.9747692049082883
| | | past_260average_MT < -26.765577137470245
| | | | past_260average_MT < -28.83451946079731: c0(60.0/0.0)
| | | | past_260average_MT >= -28.83451946079731
| | | | | past_260average_MT < -28.60211554169655: c1(3.0/0.0)
| | | | | past_260average_MT >= -28.60211554169655: c0(16.0/0.0)
| | | past_260average_MT >= -26.765577137470245
| | | | past_260average_MT < -11.35817302763462: c1(114.0/0.0)
| | | | past_260average_MT >= -11.35817302763462
| | | | | past_260volume_MT < 2026206.537109375
| | | | | | past_260average_MT < -7.8269229382276535
| | | | | | | past_260average_MT < -8.202884435653687
| | | | | | | | past_260average_MT < -9.647596016526222
| | | | | | | | | past_260average_MT < -10.551922991871834: c0(9.0/1.0)
| | | | | | | | | past_260average_MT >= -10.551922991871834: c1(7.0/3.0)
| | | | | | | | past_260average_MT >= -9.647596016526222: c0(17.0/0.0)
| | | | | | | past_260average_MT >= -8.202884435653687: c1(6.0/3.0)
| | | | | | past_260average_MT >= -7.8269229382276535: c0(87.0/3.0)
| | | | | past_260volume_MT >= 2026206.537109375
| | | | | | past_260average_MT < -3.1514424234628677: c1(29.0/0.0)
| | | | | | past_260average_MT >= -3.1514424234628677
| | | | | | | past_260_over_currentvolume < 1.3149355996046213: c1(18.0/0.0)
| | | | | | | past_260_over_currentvolume >= 1.3149355996046213
| | | | | | | | past_260volume_MT < 2363666.533203125: c0(23.0/1.0)
| | | | | | | | past_260volume_MT >= 2363666.533203125
| | | | | | | | | past_260_over_current < 0.9664495898604564: c1(4.0/0.0)
| | | | | | | | | past_260_over_current >= 0.9664495898604564: c0(2.0/0.0)
| past_260volume_MT >= 4509568.572265625: c1(310.0/0.0) past_260_over_current >= 0.9727344170056904: c1(1230.0/0.0)

Study on result:
We modified the attribute of the stock to the ratio instead of accurate value.
However, in the tree building result, we guess there are overfitting of that. This tree always depends on the proportion of past stock price and current stock price.
The decision always ends in the first step and these classification is seem not related. We guess this straightforward view is not a good decision tree.

40

6.2.3.4 Summary on choice of first step
The decision is not good for longer period (1 year) classification. So we think choosing high rising chance stock in yearly is difficult to do. We guess this is due to the pattern of yearly data is so small for building the decision tree. So the tree will always obey into certain direction.

41

6.2.4 Choosing low relation stock from the list
In this part, we are going to divide the stocks into different category. Different categories will have different kind of stock. In the same category. For testing is the stock with similar properties, we aim to use percentage of change
6.2.4.1 Flow of testing

1.

Introduce the method

2.

Introduce the attribute

3.

Study on result

6.2.4.2 Method
K-means clustering is a method that can define the number of k cluster from the existing data. The value of k is given by user. Every cluster will have its centroid.
After the process of clustering, the data will be divided into k group.
Working principle: it firstly random define k point as centroid if any of cluster have changing for each data point for each centroid calculate the distance between cluster and data point divide the point into nearest cluster for each cluster, reconstruct a new centroid by calculate the means of data

after the process of clustering, the data point will be divided into k groups.

42

Figure. 6.1 Clustering with k = 3.
There are many types of distance counting method. Details are shown as follows:
1. Euclidian distance:
Eq. 6.1 where n represents the dimension, p and q represent point 1 and 2. Also,
2. Manhattan distance:
Eq. 6.2 where n represents the dimension, p and q represent point 1 and 2.

43

6.2.4.3 Testing using monthly percentage change data
In this part, we are using K means clustering to divide the stock into 5 set. For each attribute of testing, is the percentage change of certain month. The reason of setting this input is that similar stock will have high chance that they have similar moving trend.
Attribute:
Name

Value

Description

percentage_change1

ln(closing_price()/ 30closing_price())

The percentage change of stock in one month

...percentage_change repeat for 36 times(3years data)
+x means x days after current day
-y means y days before current day

There is the result that the result of the clustering:

44

Since this is a 36-dimension graph, the real clustering is:
Cluster

Stock list

number of stock

0

0004,0010,0012,0014,0016, 14
0017,0019,0020,0023,0041,
0069,0083,0101,0267

1

0002,0003,0006,0011,0097, 11
0142,0179,0293,0315,0941,
1038

2

0001,0013

2

3

0291,0363

2

4

0005

1

Study on result:
For the clustering result, it is quite similar to the normal divide method. For people with business knowledge, they will cluster the stock that is similar with the result. For example all the utility stock (0002,0003,0006) are in the cluster 1.
For the Hutchison’s stock, they (0001, 0013) are in the cluster 2 (but one(1038) is in cluster 1).
The result seems useless on selection of the stock. Someone may think without kmeans algorithm can still divide the stock into different sector by industry. It is not a big problem of choosing stock.
The above result not only shows the similar stock in same cluster. The stocks in the same cluster will have similar price changing direction. So, investment in same cluster may have the similar return in a certain of time.
Then, for the portfolio management, we may divide the investment equally into different cluster.

45

7. Conclusion
For this project, we divide the project into 2 parts, trading strategy and portfolio management. For the portfolio management part, we further divide it into
1.increasing return and 2.lowering the volatility of the portfolio.
The aim of doing this project is to investigate what kind of method can we used to increase the return of the investment. For investing in the stock market, if we can increase the rate of successful prediction by 1%, it may bring us large profit.
In the trading strategy part, we find out a quite good input set for us to train the rule in a short period. For the stock market, numerical is better than only category type of input. For the same input, we found that different algorithms are also important in application and attributes may have complex relationship.
However, same method cannot fit every situation. We cannot find useful input for increasing the return in portfolio management (long period time for select of stock). The long term (yearly) testing is not sensitive to same input as rather short input (monthly). So, the testing result is not good as previous.
For lowering the volatility, we use k-means clustering method. It separates the stock in the way which is similar to industry category from some stock website.
There should be a further application with the group selection. However, for the reason of simplicity, reducing the volatility can make the model to do the simplest thing. Just hold the equal percentage of the stock in the different group, the volatility may be lower than previous (Assuming the price is moving randomly). 46

Some of the inputs seem quite simple for classify the stock. However In the testing of the stock, we found that the good set up of the model is more important than extra input. In fact, there are two variables from the stock market, first is price, second is volume. Other attribute is just derivatives of these two variables. Some testes are just prototype of concept. The most important outcome of this project is that we found which attribute is related to the result. That is important for the further improvement of thing part of the program.

47

8. Difficulties and Challenges
Lack of knowledge of the finance open source software:
For testing in the experiment part, we always need to download and further process the data into a data, which is easy to analysis. However, both of us have not much idea on the software and what the software can do. We have spend half semester time to finding and setting a complex software but it does not work.
Once we found the Robotrader, relatively not complex software. We spent most of the time to study the structure and code of Robotrader.
EOD Data limitation:
For the EOD data, we find that EOD data is not suitable to finding the long period trend of stock price (we also do not know if there are any long period trend exist). May be it is just the problem that we do not transform data to the useful input. Trading period limitation:
Since we are not doing the high frequency trading. Many technical indicator do not work for the changing into input.
Machine learning method choosing:
In the first semester, we chose neural network for training the strategy. However, the neural network is a "black box" method. It is quite hard to determine if the result is correct until using testing result. For us, it is quite hard to improve the performance of the prediction. So, we chose another type of predication method, decision tree which will generate the rule and more easy for us to improve the prediction. 48

9. Contribution of Work
1st Semester:
In last summer, my final year project partner and I decided to complete a project of stock price prediction by using neural network approach. After agree on this topic, we start to do some research and test the system that we found and establish. However, the result is quite disappointing since we cannot totally understand the whole picture of neural network and how it works on stock price prediction.
We were confused on the application of recursive neural network and stepforward neural network.
Therefore we could not obtain what we expected form this project in the 1st semester. After the project presentation, our supervisor suggested us to change the topic since there are so many researches related to this topic have launched in past decade, it is hard to get a great advance in the present stage.
2nd Semester
After failure in 1st semester, we discussed what makes us fail in 1st semester and finally we decided to change the topic to a more appropriate manner so as to have a investigation on different trading strategies and how to apply them into portfolio selection but not only focus on investigating trading strategies.
In this semester, we use different data mining methods to achieve our goal. I suggest using CART while my partner Frank suggested using ID3. However, we didn’t argue with each other, but decided to do these two together and the result is quite encouraging.

49

At last, I want to give thanks to my partner Frank. This project should be a great learning experience. Without his suggestion of topic, we have barely have chance to learn in this area. In addition, he also supported me a lot and always provides some suggestions for me. This project will not be done without his effort.

50

Bibliography
[1] T. Hastie, R. Tibshirani, and J. Friedman. Element of Statistical Learning,
Springer, 2009
[2] 1.8 Decision Trees, Scikit-Learn Organization, 2013
Available HTTP: http://scikit-learn.org/stable/modules/tree.html
[3] J. Li. Classification/Decision Trees(1), STAT 597 Lecture Notes, The
Pennsylvania State University, 2011.
Available HTTP: http://sites.stat.psu.edu/~jiali/
[4] The impurity Function, STAT 557 Lecture Notes, The Pennsylvania State
University, 2014.
Available HTTP: https://onlinecourses.science.psu.edu/stat557/node/85
[5] V-fold cross-validation, DTREG, 2010
Available HTTP: http://www.dtreg.com/crossvalidation.htm
[6] Chapter 4 – Decision Tree, asiaMiner, 1998
Available HTTP: http://120.105.96.8/lab/Past_Course/98-2/datamining/6.pdf
[7] W. Wang, and A. Gelman. Difficulty of selecting among multilevel models using predictive accuracy. New York: The Columbia University, 2014.
Available HTTP: http://www.stat.columbia.edu/~gelman/research/unpublished/xval.pdf
[8] T. Y. Fun, Cyrus. Analyzing stock quotes using data mining technique, The
University of Hong Kong, 2013.
Available HTTP: http://i.cs.hku.hk/fyp/2012/fyp12031/Final_Report.pdf
[9] V-fold cross-validation, Statsoft Electronic Statistic Textbooks, 2014.
[10] R. Schapire. COS 424 – Interacting with data, Princeton University, 2007
Available HTTP: http://www.cs.princeton.edu/courses/archive/spr07/cos424/scribe_notes/0220.pdf [11] A. Sunden. Trading based on classification and regression tree, KTH royal institute of technology, 2010.
[12] A. Andriyashin, W. Hardle, and R. Timofeev. Recursive partfolio selection with decision tree, Berlin: SFB 649 Economic Risk, 2008.
[13] S. Russell, and P. Norvig. Artificial Intelligence: A Modern Approach 3rd
Edition, London: Pearson, 2009.

51

Appendix A
A.1 Decision Tree from CART
(6.1.3.4 Test with using SMA in CART by adding momentum related

attribute and modify the class)
30average < 83.06250047683716
| past_30average < 75.5733317732811
| | past_30average < 48.232500433921814
| | | 30volume < 4.164597165625E7: c2(3.0/0.0)
| | | 30volume >= 4.164597165625E7: c1(16.0/4.0)
| | past_30average >= 48.232500433921814
| | | 30average < 68.06916570663452: c0(86.0/1.0)
| | | 30average >= 68.06916570663452
| | | | 30volume < 3.4278431484375E7: c0(22.0/2.0)
| | | | 30volume >= 3.4278431484375E7: c1(4.0/0.0)
| past_30average >= 75.5733317732811
| | past_30volume < 1.680683175E7: c0(47.0/1.0)
| | past_30volume >= 1.680683175E7: c2(53.0/0.0)
30average >= 83.06250047683716
| past_30volume < 2.03377383515625E7
| | 30volume < 1.077808501953125E7
| | | past_30volume < 6514848.3203125
| | | | 30average < 114.25833308696747
| | | | | 30average < 88.99583327770233
| | | | | | past_30average < 86.87916648387909: c0(2.0/0.0)
| | | | | | past_30average >= 86.87916648387909: c1(43.0/0.0)
| | | | | 30average >= 88.99583327770233
| | | | | | past_30average < 89.0083338022232: c0(27.0/1.0)
| | | | | | past_30average >= 89.0083338022232
| | | | | | | past_30average < 97.06666696071625
| | | | | | | | volume_MT < -0.21804166104012898: c2(30.0/2.0)
| | | | | | | | volume_MT >= -0.21804166104012898
| | | | | | | | | past_30average < 95.23333370685577
| | | | | | | | | | past_30average < 89.57499992847443: c1(6.0/0.0)
| | | | | | | | | | past_30average >= 89.57499992847443
| | | | | | | | | | | 30average < 90.00833404064178: c0(4.0/2.0)
| | | | | | | | | | | 30average >= 90.00833404064178
| | | | | | | | | | | | 30volume < 6488559.9765625: c2(21.0/1.0)
| | | | | | | | | | | | 30volume >= 6488559.9765625
| | | | | | | | | | | | | past_30average < 90.30833351612091
| | | | | | | | | | | | | | past_30average < 89.8125: c0(2.0/1.0)
| | | | | | | | | | | | | | past_30average >= 89.8125: c2(7.0/0.0)
| | | | | | | | | | | | | past_30average >= 90.3083335: c1(6.0/1.0)
| | | | | | | | | past_30average >= 95.23333370685577: c1(8.0/0.0)
| | | | | | | past_30average >= 97.06666696071625
| | | | | | | | past_30average < 116.51666676998138
| | | | | | | | | past_30average < 97.9208334684372: c1(10.0/0.0)
| | | | | | | | | past_30average >= 97.9208334684372
| | | | | | | | | | past_30average < 109.94166707992554
| | | | | | | | | | | 30volume < 5409938.328125: c1(2.0/0.0)

52

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

| | | | | | | | 30volume >= 5409938.328125
| | | | | | | | | volume_MT < -0.2725792239463332: c1(2.0/0.0)
| | | | | | | | | volume_MT >= -0.272579223: c0(9.0/1.0)
| | | | | | | past_30average >= 109.94166707992554: c1(9.0/1.0)
| | | | | past_30average >= 116.51666676998138: c0(15.0/2.0)
| 30average >= 114.25833308696747: c2(47.0/0.0) past_30volume >= 6514848.3203125
| past_30average < 85.07083308696747
| | price_MT < -0.11886135627750044: c2(8.0/0.0)
| | price_MT >= -0.11886135627750044
| | | past_30volume < 9937318.3203125
| | | | price_MT < -0.05290842702612067: c2(9.0/3.0)
| | | | price_MT >= -0.05290842702612067: c1(11.0/3.0)
| | | past_30volume >= 9937318.3203125: c1(52.0/0.0)
| past_30average >= 85.07083308696747
| | 30volume < 8215306.63671875
| | | 30volume < 6421098.345703125
| | | | 30average < 87.20416641235352: c0(17.0/0.0)
| | | | 30average >= 87.20416641235352
| | | | | volume_MT < 0.23452068635315515
| | | | | | 30average < 89.77500057220459: c0(13.0/0.0)
| | | | | | 30average >= 89.77500057220459
| | | | | | | past_30volume < 7179101.6484375
| | | | | | | | past_30average < 111.65833270549774
| | | | | | | | | price_MT < -0.010370028402811551
| | | | | | | | | | 30average < 91.65833365917206
| | | | | | | | | | | past_30average < 89.25: c1(2.0/0.0)
| | | | | | | | | | | past_30average >= 89.25: c2(3.0/0.0)
| | | | | | | | | | 30average >= 91.658333: c1(13.0/0.0)
| | | | | | | | | price_MT >= -0.010370028402811551
| | | | | | | | | | 30volume < 5697111.71875: c1(5.0/2.0)
| | | | | | | | | | 30volume >= 5697111.71875: c0(3.0/0.0)
| | | | | | | | past_30average >= 111.6583327: c2(2.0/0.0)
| | | | | | | past_30volume >= 7179101.6484375: c0(11.0/3.0)
| | | | | volume_MT >= 0.23452068635315515
| | | | | | past_30average < 108.6958349943161
| | | | | | | 30average < 96.48333370685577
| | | | | | | | volume_MT < 0.5989818068801052: c1(25.0/10.0)
| | | | | | | | volume_MT >= 0.5989818068801052
| | | | | | | | | past_30average < 93.6958335: c2(5.0/0.0)
| | | | | | | | | past_30average >= 93.6958335: c1(5.0/2.0)
| | | | | | | 30average >= 96.48333370685577: c2(15.0/2.0)
| | | | | | past_30average >= 108.6958349943161: c1(12.0/0.0)
| | | 30volume >= 6421098.345703125
| | | | past_30volume < 1.365124340625E7
| | | | | past_30volume < 6653846.6640625
| | | | | | past_30average < 100.30000030994415
| | | | | | | 30volume < 6828783.34765625: c2(2.0/1.0)
| | | | | | | 30volume >= 6828783.34765625: c1(3.0/0.0)
| | | | | | past_30average >= 100.30000030994415: c0(3.0/0.0)
| | | | | past_30volume >= 6653846.6640625
| | | | | | 30average < 108.05000042915344
| | | | | | | 30average < 98.37916719913483
| | | | | | | | price_MT < 0.028488586996955848
| | | | | | | | | past_30average < 88.2708331: c1(17.0/1.0)

53

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

| | | | | | | past_30average >= 88.27083313465118
| | | | | | | | past_30average < 88.5124998: c0(8.0/0.0)
| | | | | | | | past_30average >= 88.51249980926514
| | | | | | | | | past_30average < 89.51666688919067
| | | | | | | | | | past_30average < 89.30833351612091
| | | | | | | | | | | 30average < 88.00416672229767
| | | | | | | | | | | | price_MT < 0.022957: c0(4.0/0.0)
| | | | | | | | | | | | price_MT >= 0.022957: c1(8.0/1.0)
| | | | | | | | | | | 30average >= 88.004166: c1(6.0/0.0)
| | | | | | | | | | past_30average >= 89.3083: c0(5.0/0.0)
| | | | | | | | | past_30average >= 89.51666: c1(10.0/0.0)
| | | | | | price_MT >= 0.028488586996955848: c1(32.0/1.0)
| | | | | 30average >= 98.37916719913483: c0(6.0/0.0)
| | | | 30average >= 108.05000042915344: c1(61.0/5.0)
| | past_30volume >= 1.365124340625E7: c0(8.0/0.0)
30volume >= 8215306.63671875
| past_30average < 120.50833451747894
| | 30average < 93.59583342075348
| | | 30average < 87.72083365917206
| | | | past_30volume < 9248506.66796875
| | | | | past_30average < 88.01249969005585
| | | | | | 30volume < 9010618.3359375: c0(3.0/0.0)
| | | | | | 30volume >= 9010618.3359375: c1(16.0/2.0)
| | | | | past_30average >= 88.01249969005585: c0(11.0/0.0)
| | | | past_30volume >= 9248506.66796875: c0(18.0/0.0)
| | | 30average >= 87.72083365917206
| | | | 30volume < 9662961.67578125: c1(28.0/2.0)
| | | | 30volume >= 9662961.67578125
| | | | | past_30average < 87.6749997138977: c0(2.0/0.0)
| | | | | past_30average >= 87.6749997138977: c2(9.0/3.0)
| | 30average >= 93.59583342075348
| | | 30average < 107.48333442211151
| | | | price_MT < -0.027045834387435536: c0(61.0/3.0)
| | | | price_MT >= -0.027045834387435536
| | | | | price_MT < -0.01687964036108963: c1(7.0/2.0)
| | | | | price_MT >= -0.01687964036108963: c0(7.0/1.0)
| | | 30average >= 107.48333442211151
| | | | past_30average < 103.18750047683716
| | | | | past_30average < 93.69999957084656: c2(2.0/0.0)
| | | | | past_30average >= 93.69999957084656: c1(20.0/0.0)
| | | | past_30average >= 103.18750047683716
| | | | | 30average < 123.299999833107
| | | | | | 30average < 116.06666648387909: c0(16.0/2.0)
| | | | | | 30average >= 116.06666648387909
| | | | | | | 30volume < 1.0631763359375E7
| | | | | | | | past_30average < 118.69166767597198
| | | | | | | | | past_30average < 106.08333: c0(2.0/0.0)
| | | | | | | | | past_30average >= 106.0833: c1(24.0/5.0)
| | | | | | | | past_30average >= 118.69166767: c2(3.0/0.0)
| | | | | | | 30volume >= 1.0631763359375E7: c0(5.0/0.0)
| | | | | 30average >= 123.299999833107: c0(15.0/0.0)
| past_30average >= 120.50833451747894
| | past_30volume < 8881448.365234375: c2(5.0/0.0)
| | past_30volume >= 8881448.365234375
| | | past_30average < 127.58333241939545: c1(17.0/0.0)

54

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

| | | | | | past_30average >= 127.58333241939545: c2(2.0/0.0)
30volume >= 1.077808501953125E7
| 30volume < 2.28420066484375E7
| | past_30average < 92.55833280086517
| | | past_30average < 90.25416588783264
| | | | past_30average < 87.72916662693024
| | | | | past_30volume < 9556166.65625: c0(13.0/0.0)
| | | | | past_30volume >= 9556166.65625: c1(8.0/0.0)
| | | | past_30average >= 87.72916662693024: c1(15.0/1.0)
| | | past_30average >= 90.25416588783264
| | | | 30volume < 1.21910567578125E7: c2(15.0/1.0)
| | | | 30volume >= 1.21910567578125E7: c1(3.0/0.0)
| | past_30average >= 92.55833280086517
| | | past_30average < 131.8416666984558
| | | | 30volume < 2.03765516015625E7
| | | | | 30volume < 1.136689003515625E7
| | | | | | price_MT < -0.04136368921396885: c0(6.0/0.0)
| | | | | | price_MT >= -0.04136368921396885
| | | | | | | past_30average < 129.14999961853027
| | | | | | | | 30average < 115.40833258628845: c0(3.0/1.0)
| | | | | | | | 30average >= 115.40833258628845: c1(45.0/6.0)
| | | | | | | past_30average >= 129.14999961853027: c2(3.0/0.0)
| | | | | 30volume >= 1.136689003515625E7
| | | | | | past_30volume < 1.038330999609375E7
| | | | | | | past_30volume < 1.0308056671875E7
| | | | | | | | price_MT < -0.013769810465573058: c1(28.0/0.0)
| | | | | | | | price_MT >= -0.013769810465573058
| | | | | | | | | past_30average < 123.60833239555359
| | | | | | | | | | 30average < 124.09166646003723: c1(29.0/0.0)
| | | | | | | | | | 30average >= 124.09166646003723: c2(4.0/0.0)
| | | | | | | | | past_30average >= 123.60833239555: c2(7.0/0.0)
| | | | | | | past_30volume >= 1.0308056671875E7: c2(3.0/0.0)
| | | | | | past_30volume >= 1.038330999609375E7
| | | | | | | past_30volume < 2.006368996875E7
| | | | | | | | price_MT < 0.04761932146681137: c1(345.0/4.0)
| | | | | | | | price_MT >= 0.04761932146681137
| | | | | | | | | past_30average < 122.23333370685: c0(4.0/0.0)
| | | | | | | | | past_30average >= 122.23333370685: c1(17.0/0.0)
| | | | | | | past_30volume >= 2.006368996875E7
| | | | | | | | past_30average < 129.7716679573059: c2(2.0/0.0)
| | | | | | | | past_30average >= 129.7716679573059: c1(2.0/0.0)
| | | | 30volume >= 2.03765516015625E7
| | | | | 30average < 127.9516670703888: c1(14.0/0.0)
| | | | | 30average >= 127.9516670703888: c0(10.0/2.0)
| | | past_30average >= 131.8416666984558
| | | | 30average < 140.32666850090027
| | | | | past_30average < 141.5266673564911
| | | | | | 30volume < 1.3384851640625E7: c2(8.0/0.0)
| | | | | | 30volume >= 1.3384851640625E7
| | | | | | | 30average < 140.01333475112915
| | | | | | | | 30average < 128.41166615486145
| | | | | | | | | past_30volume < 1.584262825E7
| | | | | | | | | | past_30average < 132.9549994468: c2(3.0/0.0)
| | | | | | | | | | past_30average >= 132.9549994468: c1(8.0/0.0)
| | | | | | | | | past_30volume >= 1.584262825E7: c0(8.0/1.0)

55

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

| | | | | | | | | 30average >= 128.41166615486145: c1(91.0/7.0)
| | | | | | | | 30average >= 140.01333475112915
| | | | | | | | | past_30average < 135.96666598320007: c1(3.0/0.0)
| | | | | | | | | past_30average >= 135.9666659832000: c0(14.0/3.0)
| | | | | | past_30average >= 141.5266673564911: c0(24.0/1.0)
| | | | | 30average >= 140.32666850090027
| | | | | | past_30volume < 1.9316740078125E7
| | | | | | | past_30volume < 1.3976396609375E7
| | | | | | | | price_MT < -0.03302021305750843
| | | | | | | | | past_30average < 140.36666870117188: c2(8.0/0.0)
| | | | | | | | | past_30average >= 140.36666870117188: c1(2.0/0.0)
| | | | | | | | price_MT >= -0.03302021305750843: c1(28.0/1.0)
| | | | | | | past_30volume >= 1.3976396609375E7: c1(125.0/7.0)
| | | | | | past_30volume >= 1.9316740078125E7
| | | | | | | past_30average < 145.57166695594788
| | | | | | | | past_30average < 144.38000059127808: c2(3.0/0.0)
| | | | | | | | past_30average >= 144.38000059127808: c1(4.0/0.0)
| | | | | | | past_30average >= 145.57166695594788: c2(4.0/0.0)
| | 30volume >= 2.28420066484375E7
| | | 30average < 122.91666626930237: c2(21.0/2.0)
| | | 30average >= 122.91666626930237
| | | | 30average < 142.6049988269806: c0(11.0/1.0)
| | | | 30average >= 142.6049988269806: c1(11.0/1.0) past_30volume >= 2.03377383515625E7
| past_30average < 117.99999868869781
| | price_MT < 0.12036383786541163
| | | past_30volume < 2.35881233125E7: c2(6.0/1.0)
| | | past_30volume >= 2.35881233125E7
| | | | past_30average < 81.12166583538055
| | | | | past_30average < 76.39416575431824
| | | | | | past_30average < 68.06916570663452: c0(2.0/0.0)
| | | | | | past_30average >= 68.06916570663452
| | | | | | | past_30average < 72.4608324766159
| | | | | | | | past_30average < 70.5474990606308: c1(3.0/1.0)
| | | | | | | | past_30average >= 70.5474990606308: c2(3.0/0.0)
| | | | | | | past_30average >= 72.4608324766159: c1(6.0/0.0)
| | | | | past_30average >= 76.39416575431824: c0(9.0/0.0)
| | | | past_30average >= 81.12166583538055: c1(41.0/2.0)
| | price_MT >= 0.12036383786541163: c2(10.0/0.0)
| past_30average >= 117.99999868869781
| | 30average < 121.36500036716461
| | | 30average < 113.46416628360748
| | | | 30average < 94.62666630744934
| | | | | past_30average < 120.43833386898041: c2(5.0/0.0)
| | | | | past_30average >= 120.43833386898041: c1(4.0/0.0)
| | | | 30average >= 94.62666630744934: c2(15.0/0.0)
| | | 30average >= 113.46416628360748: c0(36.0/0.0)
| | 30average >= 121.36500036716461
| | | past_30average < 129.61166787147522: c2(52.0/0.0)
| | | past_30average >= 129.61166787147522
| | | | past_30average < 139.19499897956848
| | | | | 30volume < 2.393230975E7: c1(11.0/3.0)
| | | | | 30volume >= 2.393230975E7
| | | | | | 30volume < 2.48086630625E7: c0(2.0/0.0)
| | | | | | 30volume >= 2.48086630625E7

56

|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
|
|
|
|
|

| | | price_MT < 0.07610166271935453: c1(5.0/0.0)
| | | price_MT >= 0.07610166271935453
| | | | past_30average < 133.67000150680542: c1(3.0/1.0)
| | | | past_30average >= 133.67000150680542: c0(3.0/0.0) past_30average >= 139.19499897956848
| past_30volume < 2.7342849921875E7
| | past_30average < 141.0199966430664: c2(17.0/1.0)
| | past_30average >= 141.0199966430664
| | | past_30volume < 2.29203465625E7: c0(7.0/2.0)
| | | past_30volume >= 2.29203465625E7
| | | | past_30average < 144.56166315078735: c1(19.0/2.0)
| | | | past_30average >= 144.56166315078735: c2(3.0/0.0)
| past_30volume >= 2.7342849921875E7: c2(40.0/1.0)

A.2 ID3 Testing Result
(6.1.3.5 Classification of using ID3 algorithm)
C0
0
0
0

C1
51
565
62

C2
0
0
0

Comparing with CART algorithm:
True positive rate:
ID3
C0
0
C1
1
C2
0

Classified as/ real
C0
C1
C2

CART
0.82
0.40
0.50

57

A.3 Sample Data used in ID3 Algorithm
(6.1.3.5 Classification of using ID3 algorithm)

A.4 Sample Data used in CART Algorithm
(6.1.3.3 Test without using SMA in CART)

58

A.5 Sample Data used in K-means Clustering
(6.2.4.3 Testing using monthly percentage change data)

0001.hk and 0002.hk:

59

Implementing Portfolio Selection by Using Data Mining

Similar Documents

Word of Mouth

Artificial Intelligence

Carbon Tax Mining

Mba Special Assignment

An Integrated Framework for Project Portfolio

Hostel Management Project

Essay

Abcd

Mba Syllabus

Hello

Predictive Analytics in Fmcg

Sem3

What Explains the Stock Market’s Reaction to Federal Reserve Policy?

Cfdj

Minimizing Risk in the Ghanaian Banking Sector

Popular Essays