Free Essay

Assignment

In:

Submitted By melosalz
Words 9180
Pages 37
Active Learning with Support Vector Machines

Kim Steenstrup Pedersen
Department of Computer Science
University of Copenhagen
2200 Copenhagen, Denmark kimstp@di.ku.dk Jan Kremer
Department of Computer Science
University of Copenhagen
2200 Copenhagen, Denmark jan.kremer@di.ku.dk Christian Igel
Department of Computer Science
University of Copenhagen
2200 Copenhagen, Denmark igel@di.ku.dk Abstract
In machine learning, active learning refers to algorithms that autonomously select the data points from which they will learn. There are many data mining applications in which large amounts of unlabeled data are readily available, but labels
(e.g., human annotations or results from complex experiments) are costly to obtain. In such scenarios, an active learning algorithm aims at identifying data points that, if labeled and used for training, would most improve the learned model. Labels are then obtained only for the most promising data points. This speeds up learning and reduces labeling costs. Support vector machine (SVM) classifiers are particularly well-suited for active learning due to their convenient mathematical properties. They perform linear classification, typically in a kernel-induced feature space, which makes measuring the distance of a data point from the decision boundary straightforward. Furthermore, heuristics can efficiently estimate how strongly learning from a data point influences the current model. This information can be used to actively select training samples. After a brief introduction to the active learning problem, we discuss different query strategies for selecting informative data points and review how these strategies give rise to different variants of active learning with SVMs.

1

Introduction

In many applications of supervised learning in data mining, huge amounts of unlabeled data samples are cheaply available while obtaining their labels for training a classifier is costly. To minimize labeling costs, we want to request labels only for potentially informative samples. These are usually the ones that we expect to improve the accuracy of the classifier to the greatest extent when used for training. Another consideration is the reduction of training time. Even when all samples are labeled, we may want to consider only a subset of the available data because training the classifier of choice using all the data might be computationally too demanding. Instead of sampling a subset uniformly at random, which is referred to as passive learning, we would like to select informative samples to maximize accuracy with less training data. Active learning denotes the process of autonomously selecting promising data points to learn from. By choosing samples actively, we introduce a selection bias. This violates the assumption underlying most learning algorithms that training and test data are identically distributed: an issue we have to address to avoid detrimental effects on the generalization performance. 1

In theory, active learning is possible with any classifier that is capable of passive learning. This review focuses on the support vector machine (SVM) classifier. It is a state-of-the-art method, which has proven to give highly accurate results in the passive learning scenario and which has some favorable properties that make it especially suitable for active learning: (i) SVMs learn a linear decision boundary, typically in a kernel-induced feature space. Measuring the distance of a sample to this boundary is straightforward and provides an estimate of its informativeness. (ii)
Efficient online learning algorithms make it possible to obtain a sufficiently accurate approximation of the optimal SVM solution without retraining on the whole dataset. (iii) The SVM can weight the influence of single samples in a simple manner. This allows for compensating the selection bias that active learning introduces.

2

Active Learning

In the following we focus on supervised learning for classification. There also exists a body of work on active learning with SVMs in other settings such as regression [15] and ranking [8, 48]. A discussion of these settings is, however, beyond the scope of this article.
The training set is given by L = {(x1 , y1 ), ..., (x , y )} ⊂ X × Y. It consists of labeled samples that are drawn independently from an unknown distribution D. This distribution is defined over
X × Y, the cross product of a feature space X and a label space Y, with Y = {−1, 1} in the binary case. We try to infer a hypothesis f : X → Z mapping inputs to a prediction space Z for predicting the labels of samples drawn from D. To measure the quality of our prediction, we define a loss function L : Z × X × Y → R+ . Thus, our learning goal is minimizing the expected loss
R(f ) = E(x,y)∼D L(f (x), x, y) ,

(1)

which is called the risk of f . We call the average loss over a finite sample L the training error or empirical risk. If a loss function does not depend on the second argument, we simply omit it.
In sampling-based active learning, there are two scenarios: stream-based and pool-based. In streambased active learning, we analyze incoming unlabeled samples sequentially, one sample at a time.
Contrary, in pool-based active learning we have access to a pool of unlabeled samples at once. In this case, we can rank samples based on a selection criterion and query the most informative ones.
Although, some of the methods in this review are also applicable to stream-based learning, most of them consider the pool-based scenario. In the case of pool-based active learning, we have, in addition to the labeled set L, access to a set of m unlabeled samples U = {x +1 , ..., x +m }. We assume that there exists a way to provide us with a label for any sample from this set (the probability of the label is given by D conditioned on the sample). This may involve labeling costs, and the number of queries we are allowed to make may be restricted by a budget. After labeling a sample, we simply add it to our training set.
In general, we aim at achieving a minimum risk by requesting as few labels as possible. We can estimate this risk by computing the average error over an independent test set not used in the training process. Ultimately, we hope to require less labeled samples for inferring a hypothesis performing as well as a hypothesis generated by passive learning on L and completely labeled U.
In practice, one can profit from an active learner if only few labeled samples are available and labeling is costly, or when learning has to be restricted to a subset of the available data to render the computation feasible. A list of real-world applications is given in the general active learning survey [39] and in a review paper which considers active learning for natural language processing [29].
Active learning can also be employed in the context of transfer learning [41]. In this setting, samples from the unlabeled target domain are selected for labeling and included in the source domain.
A classifier trained on the augmented source dataset can then exploit the additional samples to increase its accuracy in the target domain. This technique has been used successfully, for example, in an astronomy application [34] to address a sample selection bias, which causes source and target probability distributions to mismatch [33].
2

3

Support Vector Machine

Support vector machines (SVMs) are state-of-the-art classifiers [6, 12, 26, 36, 38, 40]. They have proven to provide well-generalizing solutions in practice and are well understood theoretically [42].
The kernel trick [38] allows for an easy handling of diverse data representations (e.g., biological sequences or multimodal data). Support vector machines perform linear discrimination in a kernelinduced feature space and are based on the idea of large margin separation: they try to maximize the distance between the decision boundary and the correctly classified points closest to this boundary.
In the following, we formalize SVMs to fix our notation, for a detailed introduction we refer to the recent WIREs articles [26, 36].
An SVM for binary classification labels an input x according to the sign of a decision function of the form f (x) = w, φ(x) + b =

αi yi κ(xi , x) + b ,

(2)

i=1

where κ is a positive semi-definite kernel function [1] and φ(x) is a mapping X → F to a kernelinduced Hilbert space F such that κ(xi , xj ) = φ(xi ), φ(xj ) . We call F the feature space, which includes the weight vector w = i=1 αi yi φ(xi ) ∈ F. The training patterns xi with αi > 0 are called support vectors. The decision boundary is linear in F and the offset from the origin is controlled by b ∈ R.
The distance of a pattern (x, y) from the decision boundary is given by |f (x)/ w |. We call yf (x) the functional margin and yf (x)/ w the geometric margin: a positive margin implies correct classification. Let us assume that the training data L is linearly separable in F. Then m(L, f ) = min(x,y)∈L yf (x) defines the margin of the whole data set L with respect to f (in the following we do not indicate the dependency on L and f if it is clear from the context). We call the feature space region {x ∈ X | |f (x)| ≤ 1} the margin band [9].
A hard margin SVM computes the linear hypothesis that separates the data and yields a maximum margin by solving max w,b,γ

γ

(3)

subject to yi ( w, φ(xi ) + b) ≥ γ w =1

,

i = 1, . . . ,

with w ∈ F, b ∈ R and γ ∈ R [40]. Instead of maximizing γ and keeping the norm of w fixed to one, one can equivalently minimize w and fix a target margin, typically γ = 1.
In general, we cannot or do not want to separate the full training data correctly. Soft margin SVMs mitigate the concept of large margin separation. They are best understood as the solutions of the regularized risk minimization problem min w,b

1 w 2

2

+

Ci Lhinge ( w, φ(xi ) + b, yi ) .

(4)

i=1

Here, Lhinge (f (xi ), yi ) = max(0, 1 − yi f (xi )) denotes the hinge loss. An optimal solution w∗ =

∗ i=1 αi yi φ(xi ) has the property that 0 ≤ αi ≤ Ci for i = 1, . . . , . For soft margin SVMs, the patterns in L need not be linearly separable in F. If they are, increasing the Ci until an optimal

solution satisfies αi < Ci for all i leads to the same hypothesis as training a hard margin SVM.
Usually, all samples are given the same weight Ci = C.

4

Uncertainty Sampling

It seems to be intuitive to query labels for samples that cannot be easily classified using our current classifier. Consider the contrary: if we are very certain about the class of a sample, then we might regard any label that does not reflect our expectation as noise. On the other hand, uncovering an expected label would not make us modify our current hypothesis.
3

Uncertainty sampling was introduced by Lewis and Gale [23]. The idea is that the samples the learner is most uncertain about provide the greatest insight into the underlying data distribution.
Figure 1 shows an example in the case of an SVM. Among the three different unlabeled candidates, our intuition may suggest to ask for the label of the sample closest to the decision boundary: the labels of the other candidates seem to clearly match the class of the samples on the respective side or are otherwise simply mislabeled. In the following, we want to show how this intuitive choice can be justified and how it leads to a number of active learning algorithms that make use of the special properties of an SVM.

Figure 1: The three rectangles depict unlabeled samples while the blue circles and orange triangles represent positively and negatively labeled samples, respectively. Intuitively, the label of the sample xa might tell us the most about the underlying distribution of labeled samples, since in the feature space, φ(xa ) is closer to the decision boundary than φ(xb ) or φ(xc ).

4.1

Version Space

The version space is a construct that helps to keep track of all hypotheses that are able to perfectly classify our current observations [27]. Thus, for the moment, we assume that our data are linearly separable in the feature space. The idea is to speed up learning by selecting samples in a way that minimizes the version space rapidly with each labeling.
We can express the hard margin SVM classifier (3) in terms of a geometric representation of the version space. For this purpose, we restrict our consideration to hypotheses f (x) = w, φ(x) without bias (i.e., b = 0). The following results, however, can be extended to SVMs with b = 0.
The version space V(L) refers to the subset of F that includes all hypotheses consistent with the training set L [27]:
V(L) := w ∈ F | w = 1, yf (x) > 0, ∀(x, y) ∈ L

(5)

In this representation, we can interpret the hypothesis space as the unit hypersphere given by w =
1. The surface of the hypersphere includes all possible hypotheses classifying samples that are mapped into the feature space F. We define Λ(V) as the area the version space occupies on the surface of this hypersphere. This is depicted in Figure 2. The hypothesis space is represented by the big sphere. The white front, which is cut out by the two hyperplanes, depicts the version space.
Each sample x can be interpreted as defining a hyperplane through the origin of F with the normal vector φ(x). Each hyperplane divides the feature space into two half-spaces. Depending on the label y of the sample x, the version space is restricted to the surface of the hypersphere that lies on the respective side of the hyperplane. For example, a sample x that is labeled with y = +1, restricts the version space to all w on the unit hypersphere for which w, φ(x) > 0, i.e., the ones that lie on the positive side of the hyperplane defined by the normal vector φ(x). Thus, the version space is defined by the intersection of all half-spaces and the surface of the hypothesis hypersphere.
Figure 2a illustrates this geometric relationship.
4

(a) The sphere depicts the hypothesis space. The two hyperplanes are induced by two labeled samples. The version space is the part of the sphere surface (in white) that is on one side of each hyperplane. The respective side is defined by its label.

(b) The center (in black) of the orange sphere depicts the SVM solution within the version space.
It has the maximum distance to the hyperplanes delimiting the version space. The normals of these hyperplanes, which are touched by the orange sphere, correspond to the support vectors.

Figure 2: Geometric representation of the version space in 3D following Tong [43].

If we consider a feature mapping with the property ∀x, z ∈ X : φ(x) = φ(z) , such as normalized kernels, including the frequently used Gaussian kernel, then the SVM solution (3) has a particularly nice geometric interpretation in the version space. Under this condition, the decision function f (x) = w, φ(x) maximizing the margin m(L, f ) (i.e., the minimum y w, φ(x) over L) also maximizes the minimum distance between w and any hyperplane defined by a normal φ(xi ), i = 1, . . . , . The solution is a point within the version space, which is the center of a hypersphere, depicted in orange in Figure 2b. This hypersphere yields the maximum radius possible without intersecting the hyperplanes that delimit the version space. The radius is given by r = y w,φ(x) , φ(x) where φ(x) is any support vector. Changing our perspective, we can interpret the normals φ(xi ) of the hyperplanes touching the hypersphere as points in the feature space F. Then, these are exactly the support vectors since they have the minimum distance to our decision boundary defined by w.
This distance is m(L, f ), which turns out to be proportional to the radius r.
4.2

Implicit Version Space

An explicit version space, as defined above, only exists if the data are separable, which is often not the case in practice. The Bayes optimal solution need not have vanishing training error (as soon as under D we have p(y1 |x) ≥ p(y2 |x) > 0 for some x ∈ X and y1 , y2 ∈ Y with y1 = y2 ). Thus, minimizing the version space might exclude hypotheses with non-zero training error that are in fact optimal. In agnostic active learning [2], we do not make the assumption of an existing optimal zeroerror decision boundary. An algorithm that is theoretically capable of active learning in an agnostic setting is the A2 -algorithm [2]. Here, a hypothesis cannot be deleted due to its disagreement with a single sample. If, however, all hypotheses that are part of the current version space agree on a region within the feature space, this region can be discarded. For each hypothesis the algorithm keeps an upper and a lower bound on its training error (see the work by Balcan et al. [2] for details).
It subsequently excludes all hypotheses which have a lower bound that is higher than the global minimal upper bound. Despite being intractable in practice, this algorithm forms the basis of some important algorithms compensating the selection bias as discussed below.
4.3

Uncertainty-Based Active Learning

Although the version space is restricted to separable problems, it motivates many general active selection strategies. Which samples should we query to reduce the version space? As we have seen previously, each labeled sample that becomes a support vector restricts the version space to one side of the hyperplane it induces in F. If we do not know the correct label of a sample in advance, we
5

(a) Simple Margin will query the sample that induces a hyperplane lying closest to the SVM solution. In this case, it would query sample xa .

(b) Here the SVM does not provide a good approximation of the version space area. Simple
Margin would query sample xa while xc might have been a more suitable choice.

Figure 3: The version space area is shown in white, the solid lines depict the hyperplanes induced by the support vectors, the center of the orange circle is the weight vector w of the current SVM. The dotted lines show the hyperplanes that are induced by unlabeled samples. This visualization is inspired by Tong [43].

should always query the sample that ideally halves the version space. This is a safe choice as we will reduce it regardless of the label. Computing the version space in a high-dimensional feature space is usually intractable, but we can approximate it efficiently using the SVM. In the version space, the
SVM solution w is the center of the hypersphere touching the hyperplanes induced by the support vectors. Each hyperplane delimits the version space. Assuming that the center of this hypersphere is close to the center of the version space, we can use it as an approximation. If we now choose a hyperplane that is close to this center, we approximately bisect the version space. Therefore, we want to query the sample x that induces a hyperplane as close to w as possible:
ˆ
x = argmin | w, φ(x) | = argmin |f (x)|
ˆ
x∈U

(6)

x∈U

This strategy queries the sample closest to the current decision boundary and is called Simple Margin [43]. Figure 3 shows this principle geometrically, projected to two dimensions.
By querying the samples closest to the separating hyperplane, we try to minimize the version space by requesting as few labels as possible. However, depending on the actual shape of the version space, the SVM solution may not provide a good approximation and another query strategy would have achieved a greater reduction of the version space area. This is illustrated in Figure 3b. The strategy of myopically querying the samples with the smallest margin may even perform worse than a passive learner.
Therefore, we can choose a different heuristic to approximate the version space more accurately [37,
45]. For instance, we could compute two SVMs for each sample: one for the case we labeled it positively and one assuming a negative label. We can then, for each case, compute the margins m+ = + w, φ(x) and m− = − w, φ(x) . Finally, we query the sample which gains the maximum value for min(m+ , m− ). This quantity will be very small if the corresponding version spaces are very different. Thus, we take the maximum to gain an equal split. This strategy is called MaxMin
Margin [43] and allows us to make a better choice in case of an irregular version space area, as we can see in Figure 4. This, however, comes with the additional costs of computing the margin for each potential labeling.
Uncertainty sampling can also be motivated by trying to minimize the training error directly [9].
Depending on the assumptions made, we can arrive at different strategies. Considering the classifier has just been trained on few labeled data, we assume the prospective labels of the yet unlabeled data to be uncorrelated with the predicted labels. Therefore, we want to select the sample for which we can expect the largest error, namely
1
max(0, 1 − f (x)) + max(0, 1 + f (x)) ,
(7)
x = argmax
ˆ
2 x∈U 6

Figure 4: In this case, the MaxMin Margin strategy would query sample xc . Each of the two orange circles correspond to an SVM trained with a positive and a negative labeling of xc [43].

where we assume the hinge loss and f (x) as defined in (2). Assuming a separable dataset, we are only interested in uncertain samples, i.e., those within the margin band. Under these constraints, any choice of x leads to the same value of the objective (7): we select the sample at random in this case. However, as soon as some labeled samples are available for SVM training, the prediction of the SVM for an unlabeled point x is expected to be positively correlated with the label of x. Thus, we assume correct labeling and look for each sample at the minimum error we gain regardless of the labeling. We want to find the sample that maximizes this quantity, i.e., x = argmax min max(0, 1 − f (x)), max(0, 1 + f (x)) = argmin |f (x)| ,
ˆ
x∈U

(8)

x∈U

which gives us the same selection criterion as in (6), the Simple Margin strategy. If all unlabeled samples meet the target margin (i.e., the hinge loss of the samples is zero) and if we assume the
SVM labels them correctly (i.e., |f (x)| ≥ 1), it seems that we have arrived at a hypothesis that already generalizes well. Both, picking samples near or far away from the boundary appears to be non-optimal. Therefore, we simply proceed by choosing a random sample from the unlabeled data.
In practice, we may start by training on a random subset and perform uncertainty sampling until the user stops the process or until all unlabeled samples meet the target margin. In this case, we query another random subset as a validation set and estimate the error. We may repeat the last step until we reach a satisfactory solution.
4.4

Expected Model Change

Instead of trying to minimize an explicit or implicit version space, we can find the most informative sample by selecting it based on its expected effect on the current model, the expected model change.
In gradient-based learning, this means selecting the sample x ∈ U that, if labeled, would maximize the expected gradient length, where we take the expectation Ey over all possible labels y.
Non-linear SVMs are usually trained by solving the underlying quadratic optimization problem in its Wolfe dual representation [12, 38]. Let us assume SVMs without bias. If we add a new sample
(x +1 , y +1 ) to the current SVM solution and initialize its coefficient with α +1 = 0, the partial derivative with respect to α +1 of the dual problem W (α) to be maximized is g +1

=

∂W (α)
=1−y
∂α +1

αi yi κ(xi , x

+1

+1 )

=1−y

+1 f (x +1 )

.

(9)

i=1

As αi is constrained to be non-negative, we only change the model if the partial derivative is positive, that is, if y +1 f (x +1 ) < 1. Note that y +1 f (x +1 ) > 1 implies that (x +1 , y +1 ) is already correctly classified by the current model and meets the target margin.
7

Let us assume that our current model classifies any sample perfectly and that the dataset is linearly separable in the feature space:
1
0

p(y|x) =
If we just consider the partial derivative g arrive at selecting

+1

if y f (x) > 0 otherwise .

(10)

in the expected model change selection criterion, we

x = argmax p(y = 1|x)|1 − f (x)| + p(y = −1|x)|1 + f (x)|
ˆ
x∈U

= argmax p(y = 1|x)(1 − f (x)) + p(y = −1|x)(1 + f (x)) x∈U = argmax x∈U 1 − f (x) if f (x) > 0
= argmin |f (x)| .
1 + f (x) if f (x) < 0 x∈U (11)

Thus, uncertainty sampling can also be motivated by maximizing the expected model change. Next, we want to have a look at approaches that try to exploit the uncertainty of samples and simultaneously explore undiscovered regions of the feature space.

5

Combining Informativeness and Representativeness

By performing mere uncertainty sampling, we may pay too much attention to certain regions of the feature space and neglect other regions that are more representative of the underlying distribution.
This leads to a sub-optimal classifier. To counteract this effect, we could sample close to the decision boundary, but also systematically include samples that are farer away [18, 20].
In Figure 5, we see an example where uncertainty sampling can mislead the classifier. Although φ(xa ) is closest to the separating hyperplane, it is also far away from all other samples in feature space and thus may not be representative of the underlying distribution. To avoid querying outliers, one would like to select samples not only based on their informativeness, but also based on representativeness [13]. In our example, selecting sample xb would be a better choice, because it is located in a more densely populated region where a correct classification is of more importance to gain an accurate model.

Figure 5: The white rectangles depict unlabeled samples. The blue circle and the orange triangle are labeled as positive and negative, respectively. In feature space, φ(xa ) lies closer to the separating hyperplane than φ(xb ), but is located in a region, which is not densely populated. Using pure uncertainty sampling, e.g., Simple
Margin, we would query the label of sample xa .

Informativeness is a measure of how much querying a sample would reduce the uncertainty of our model. As we have seen, uncertainty sampling is a viable method to exploit informativeness. Representativeness measures how well a sample represents the underlying distribution of unlabeled data [39]. By using a selection criterion that maximizes both measures, we try to improve our models with less samples than a passive learner while carefully avoiding a model that is too biased. In
8

Figure 6 we can see a comparison of the different strategies using a toy example where we sequentially query six samples. Figure 6a shows a biased classifier as the result of uncertainty sampling.
While the solution in Figure 6b is closer to the optimal hyperplane, it also converges slower, as it additionally explores regions where we are relatively certain about the labels. Combining both strategies, as shown in Figure 6c, yields a decision boundary that is close to the optimal (Figure 6d) with fewer labels.

(a) Uncertainty sampling.

(b) Selecting representative samples.

(c) Combining informativeness and representativeness.

(d) Optimal hyperplane, obtained by training on the whole dataset.

Figure 6: Binary classification with active learning on six samples and passive learning on the full dataset.

5.1

Semi-Supervised Active Learning

To avoid oversampling unrepresentative outliers, we can combine uncertainty sampling and clustering [47]. First, we train an SVM on the labeled samples and then apply k-means clustering to all unlabeled samples within the margin band to identify k groups. Finally, we query the k medoids.
As only samples within the margin band are considered, they are all subject to high uncertainty.
The clustering ensures that the informativeness of our selection is increased by avoiding redundant samples. However, when using clustering, one has to decide what constitutes a cluster [14]. Depending on the scale and the selected number of clusters, different choices could be equally plausible. To avoid this dilemma, we can choose another strategy to incorporate density information [22]. We build on the min-max formulation [21] of active learning and request the sample x = argmin
ˆ
xs ∈U

max

min

ys ∈{−1,+1} w,b

1 w 2
9

2

+C

L(f (x), x, y)
(x,y)∈Ls

(12)

where Ls = L ∪ (xs , ys ). We take the minimum regularized expected risk when including the sample xs ∈ U with the label ys that yields the maximum error. Selecting the sample x minimizing
ˆ
this quantity can be approximated by uncertainty sampling (e.g., using Simple Margin).
In this formulation, however, we base our decision only on the labeled samples and do not take into account the distribution of unlabeled samples. Assuming we knew the labels for each sample in U, we define the set Lsu containing the labeled samples (x, y) for x ∈ U and the training set L. We select the sample x = argmin
ˆ
xs ∈U

min

max

min

yu ∈{±1}nu −1 ys ∈{−1,+1} w,b

1 w 2

2

+C

L(f (x), x, y) ,

(13)

(x,y)∈Lsu

where nu = |U| and yu is the label vector assigned to the samples x ∈ U without the label for xs .
Thus, we also maximize the representativeness of the selected sample by incorporating the possible labelings of all unlabeled samples. By using a quadratic loss-function and relaxing yu to continuous values, we can approximate the solution through the minimization of a convex problem [22].
Both clustering and label estimation are of high computational complexity. A simpler algorithm, which the authors call Hinted SVM, considers unlabeled samples without resorting to these techniques [24]. Instead, the unlabeled samples are taken as so-called hints that inform the algorithm of feature space regions which it should be less confident in. To achieve this, we try to simultaneously find a decision boundary that produces a low training error on the labeled samples while being close to the unlabeled samples, the hints. This can be viewed as semi-supervised learning [10]. It is, however, in contrast to typical semi-supervised SVM approaches that push the decision boundary away from the pool of unlabeled samples.
The performance of this algorithm depends on the hint selection strategy. Using all unlabeled samples might be too costly for large datasets while uniform sampling of the unlabeled pool does not consider the information provided by labeled samples. Therefore, we can start with the pool of all unlabeled samples and iteratively drop instances that are close to already labeled ones. When the ratio of hints to all samples is below a certain threshold, we can switch to uncertainty sampling.
Another problem with uncertainty sampling is that it assumes our current hypothesis to be very certain about regions far from the decision boundary. If this assumption is violated, we will end up with a classifier worse than obtained using passive learning. One way to address this issue is to measure the uncertainty of our current hypothesis and to adjust our query strategy accordingly [28].
To achieve this, we compute a heuristic measure expressing the confidence that the current set of support vectors will not change if we train on more data. It is calculated as c= 2
|LSV | · k

+ − min(kx , kx ) ,

(14)

(x,y)∈LSV

+

where LSV are the support vectors and kx and kx are the number of positively and negatively labeled
+
− samples within the k nearest neighbors of (x, y) ∈ LSV . In the extremes, we get c = 1 if kx = kx
+
− and c = 0 if kx = 0 ∨ kx = 0 for all (x, y) ∈ LSV . We can use this measure to decide whether a labeled data point (x, y) should be kept for training by adding it with probability

p(x) =

c if yf (x) ≤ 1
1 − c otherwise.

(15)

to the training data set. This means that more samples are queried within the margin if we are very confident that the current hyperplane represents the optimal one. In the following, we discuss a related idea, which not only queries samples with a certain probability, but also subsequently incorporates this probability to weight the impact of the training set samples.
5.2

Importance-Weighted Active Learning

When we query samples actively instead of selecting them uniformly at random, the training and test samples are not independent and identically distributed (i.i.d.). Thus, the training set will have a sample selection bias. As most classifiers rely on the i.i.d. assumption, this can severely impair the prediction performance.
10

˜
Assume that we sample the training data points from the biased sample distribution D over X × Y, while our goal is minimizing the risk (1) with respect to the true distribution D. If we know the
˜
relationship between D and D, we can still arrive at an unbiased hypothesis by re-weighting the loss for each sample [11, 49]. We introduce the weighted loss Lw (z, x, y) = w(x, y)L(z, x, y) and define the weighting function pD (x, y) w(x, y) =
(16)
pD (x, y)
˜
˜ reflecting how likely it is to observe (x, y) under D compared to D under the assumption that the
˜ This leads us to the basic result support of D is included in D.
E

˜
(x,y)∼D

Lw (f (x), x, y) =
(x,y)∈X ×Y

=

pD (x, y)
˜

pD (x, y)
L(f (x), y)d(x, y) pD (x, y)
˜

pD (x, y)L(f (x), y)d(x, y) =
(x,y)∈X ×Y

E

(x,y)∼D

L(f (x), x, y) . (17)

Thus, by choosing appropriate weights we can modify our loss-function such that we can compute an unbiased estimator of the generalization error R(f ). This technique for addressing the sample selection bias is called importance weighting [3, 4].
We define a weighted sample set Lw as the training set L augmented with non-negative weights w1 , . . . , w for each point in L. These weights are used to set w(xi , yi ) = wi when computing the weighted loss. For the soft margin SVM minimizing the weighted loss can easily be achieved by multiplying each regularization parameter Ci in (4) with the corresponding weight, i.e., Ci = wi · C [49]. While the weighting gives an unbiased estimator, it may be difficult to estimate the weights reliably and the variance of the estimator may be very high. Controlling the variance is a crucial problem when using importance weighting.
The original importance-weighted active learning formulation works in a stream-based scenario and inspects one sample xt at each step t > 1 [3]. Iteration t of the algorithm works as follows:
1. Receive the unlabeled sample xt .
2. Choose pt ∈ [0, 1] based on all information available in this round.
3. With probability pt , query the label yt for xt , add (xt , yt ) to the weighted training set with weight wt = 1/pt , and retrain the classifier.
In step 2 the query probability pt has to be chosen based on earlier observations: this could be, for instance, the probability that two hypotheses disagree on the received sample xt .
This algorithm can also be adapted to the pool-based scenario [16]. In this case, we can simply define a probability distribution over all unlabeled samples in the pool. We set the probability for each point in proportion to its uncertainty, i.e., its distance to the decision boundary. This works well if we assume a noise-free setting. Otherwise, this method suffers from the same problems as other approaches that are based on a version space. Given a mislabeled sample (i.e., the label has a very low probability given the features), the active learner can be distracted and focus on regions within the hypothesis space which do not include the optimal decision boundary.
One way to circumvent these problems is to combine importance weighting with ideas from agnostic active learning [50]. We keep an ensemble of SVMs H = {f1 , ..., fK } and train each on a bootstrap sample subset from L, which may be initialized with a random subset of the unlabeled pool U for which labels are requested. After the initialization, we choose points x ∈ U with selection probability pt (x) = pthreshold + (1 − pthreshold )(pmax (x) − pmin (x)) ,
(18)
where pthreshold > 0 is a small minimum probability to ensure that pt (x) > 0. Using Platt’s method [31], we define pi (x) ∈ [0, 1] to be the probabilistic interpretation of an SVM fi with pmin = min1≤i≤K pi (x) and pmax = max1≤i≤K pi (x). Thus, pt is high if there is a strong disagreement within the ensemble and low if all classifiers agree. This allows to deal with noise, because no hypothesis gets excluded forever.
11

6

Multi-Class Active Learning

The majority of research on active learning with SVMs focuses on the binary case, because dealing with more categories makes estimating the uncertainty of a sample more difficult. Furthermore, multi-class SVMs are in general more time consuming to train. There are different approaches to extend SVMs to multi-class classification. A popular way is to reduce the learning task to multiple binary problems. This is done by using either a one-vs-one [19, 32] or a one-vs-all [35, 46] approach.
However, performing uncertainty sampling with respect to each single SVM may cause the problem that one sample is informative for one binary task, but bears little information for the other tasks and, thus, for the overall multi-class classification.
It is possible to extend the version space minimization strategy to the multi-class case [43]. The area of the version space is proportional to the probability of classifying the training set correctly, given a hypothesis sampled at random from the version space. In the one-vs-all approach for N classes, we maintain N binary SVM models f1 , . . . , fN where the ith SVM model is trained to separate class i from the other classes. We consider minimizing the maximum product of all N version space areas to select
N

(i)
Λ(Vx,y )

x = argmin max
ˆ
x∈U

y∈{−1,1}

(19)

i=1

(i)

where Λ(Vx,y ) is the area of the version space of the i-th SVM if the sample (x, y) : x ∈ U was included in the training set. To approximate the area of the version space, we can use MaxMin
Margin for each binary SVM. The margin of a sample in a single SVM only reflects the uncertainty with respect to the specific binary problem and not in relation to the other classification problems.
Therefore, we have to modify our approximation if we want to extend the Simple Margin strategy to the multi-class case.

(i)

Figure 7: Single version space for the multi-class problem with N one-vs-all SVMs [43]. The area Λ(Vx,y=i )
(i)

corresponds to the version space area if the label y = i for the sample we picked. The area Λ(Vx,y=i ) corresponds to the case where y = i. In the multi-class case, we want to measure both quantities to approximate the version space area.

We can interpret each fi (x) as a quantity that measures to what extent x splits the version space.
As we have N different classifiers that influence the area of the version space, we have to quantify this influence for each one. In Figure 7, we can see the version space for one single binary problem where we want to discriminate between class i and the rest. Thus, we want to approximate the area
(i)
Λ(Vx,y=i ). If we choose a sample x where fi (x) = 0, we approximately halve the version space, for fi (x) = 1, the area approximately stays the same and for fi (x) = −1, we gain a zero area;
(i)
similarly for the area Λ(Vx,y=i ). Therefore, we can use the approximation
0.5 · (1 + fi (x)) · Λ(V (i) ) if y = i
.
(20)
0.5 · (1 − fi (x)) · Λ(V (i) ) if y = i
We can also employ one-versus-one multi-class SVMs [25]. Again, we use Platt’s algorithm [31] to approximate posterior probabilities for our predictions. We simultaneously fit the probabilistic
(i)
Λ(Vx,y ) =

12

model and the SVM hyperparameters via grid-search to derive a classification probability pk (x), k = 1, . . . , K, for each of the K SVMs given the sample x. We can use these probabilities for active sample selection in different ways. A simple approach is to just select the sample with the least classification confidence. This corresponds to the selection criterion xLC = argmin
ˆ
x∈U

min

pk (x) .

k∈{1,...,K}

(21)

This approach suffers from the same problem we mentioned earlier: the probability is connected only to each single binary problem instead of providing a measure that relates it to the other classifiers. To alleviate this, we can choose another approach, called breaking ties [25]. Here, we select the sample with the minimum difference between the highest class confidences, namely xBT = argmin
ˆ
x∈U

min k,l∈{1,...,K},k=l pk (x) − pl (x) .

(22)

This way, we prefer samples which two classifiers claim to be certain about, and thus, avoid considering the uncertainty of one classifier in an isolated manner.

7

Efficient Active Learning

In passive learning, selecting a training set comes almost for free, we just select a random subset of the labeled data. In active learning, we have to evaluate whether a sample should be added to the training set based on its ability to improve our prediction. A simple strategy to reduce computational complexity is to not only select single samples, but to collect them in batches instead. However, to profit from this strategy, we have to make sure to create batches that minimize redundancy.
Another consideration that makes efficient computation necessary is that for many algorithms we have to retrain our model on different training subsets. Usually, these subsets differ only by one or the few samples we consider for selection. Thus, we can employ online learning to train our model incrementally. In particular, this makes sense if we analyze the effect that adding a sample has on our model.
7.1

Online Learning

After we have selected a sample to be included in the training set, we have to retrain our model to reflect the additional data. Using the whole dataset for retraining is computationally expensive. A better option would be to incrementally improve our model with each selected sample through online learning. LASVM [5, 17] is an SVM solver for fast online training. It is based on a decomposition method [30] solving the learning problem iteratively by considering only a subset of the α-variables in each iteration. In LASVM, this subset considers one unlabeled sample (corresponding to a new variable) in every second iteration, which can, for instance, be picked by uncertainty sampling [5].
7.2

Batch-Mode Active Learning

When confronted with large amounts of unlabeled data, estimating the effect of single samples with respect to the learning objective is costly. Besides online learning, we can also gain a speed-up by labeling samples in batches. A naive strategy is to just select the n samples that are closest to the decision boundary [44]. This approach, however, does not take into account that the samples within the batch might bear a high level of redundancy.
To counteract this redundancy, we can select samples not only due to their individual informativeness, but also if they maximize the diversity within each batch [7]. One heuristic is to maximize the angle between the hyperplanes that the samples induce in version space:
| cos(∠(φ(xi ), φ(xj ))| =

| φ(xi ), φ(xj ) |
=
φ(xi ) φ(xj )

|κ(xi , xj )| κ(xi , xi )κ(xj , xj )

(23)

Let S be the batch of samples, which we initialize with one sample xS . We subsequently add the sample x whose corresponding hyperplane minimizes the maximum angle between any other
ˆ
hyperplane induced by a sample in the batch. It is computed as x = argmin max | cos(∠(φ(x), φ(z))| .
ˆ
x∈U \S z∈S

13

(24)

We can form a convex combination of this diversity measure with the well-known uncertainty measure (distance to the hyperplane) with trade-off parameter λ ∈ [0, 1]. Then, samples that should be added to the batch are iteratively chosen as x = argmin λ|f (x)| + (1 − λ) max | cos(∠(φ(x), φ(z))|
ˆ
z∈S

x∈U \S

.

(25)

We can choose λ = 0.5 to give equal weight to the uncertainty and diversity measure. An optimal value, however, might depend on how certain we are about the accuracy of the current classifier.

8

Conclusion

Access to unlabeled data allows us to improve predictive models in data mining applications. If it is only possible to label a limited amount of the available data due to labeling costs, we should choose this subset carefully and focus on patterns carrying the information most helpful to enhance the model. Support vector machines (SVMs) have convenient properties that make it easy to evaluate how unlabeled samples would influence the model if they were labeled and included in the training set. Therefore, SVMs are particularly well-suited for active learning. However, there are several challenges we have to address, such as efficient learning, dealing with multiple classes, and that actively choosing the training data introduces a selection bias. Importance weighting seems to be most promising to counteract this bias, and it can be easily incorporated into an active SVM learner.
Devising parallel algorithms for sample selection can speed up learning in many cases. Most of the research in active SVM learning so far has focused on binary decision problems. A challenge for future research is to develop efficient active learning algorithms for multi-class SVMs that address the nature of the multi-class decision in a more principled way.
Acknowledgments.
The authors gratefully acknowledge support from The Danish Council for Independent Research through the project SkyML (FNU 12-125149).
References
[1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical
Society, 68(3):337–404, 1950.
[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 65–72. ACM Press, 2006.
[3] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance weighted active learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 49–56. ACM
Press, 2009.
[4] A. Beygelzimer, J. Langford, Z. Tong, and D. Hsu. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems (NIPS), pages 199–207. MIT
Press, 2010.
[5] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research (JMLR), 6:1579–1619, 2005.
[6] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In
Workshop on Computational Learning Theory (COLT), pages 144–152. ACM Press, 1992.
[7] K. Brinker. Incorporating diversity in active learning with support vector machines. In Proceedings of the International Conference on Machine Learning (ICML), pages 59–66. AAAI
Press, 2003.
[8] K. Brinker. Active learning of label ranking functions. In Proceedings of the International
Conference on Machine Learning (ICML), pages 129–136. ACM Press, 2004.
[9] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classifiers. In
Proceedings of the International Conference on Machine Learning (ICML), pages 111–118.
Morgan Kaufmann, 2000.
14

[10] O. Chapelle, B. Sch¨ lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006. o [11] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory.
In N. Bshouty, G. Stoltz, N. Vayatis, and T. Zeugmann, editors, Algorithmic Learning Theory, pages 38–53. Springer, 2008.
[12] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
[13] S. Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19):1767–1781,
2011.
[14] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In Proceedings of the
International Conference on Machine Learning (ICML), pages 208–215. ACM Press, 2008.
[15] B. Demir and L. Bruzzone. A novel active learning method for support vector regression to estimate biophysical parameters from remotely sensed images. In L. Bruzzone, editor, Proceedings of SPIE 8537, Image and Signal Processing for Remote Sensing XVIII, volume 8537, page 85370L. International Society for Optics and Photonics, 2012.
[16] R. Ganti and A. Gray. Upal: Unbiased pool based active learning. In Proceedings of the
International Conference on Artificial Intelligence and Statistics (AISTATS), pages 422–431,
2012.
[17] T. Glasmachers and C. Igel. Second-order SMO improves SVM online and active learning.
Neural Computation, 20(2):374–382, 2008.
[18] I. Guyon, G. Cawley, G. Dror, and V. Lemaire. Results of the active learning challenge. Journal of Machine Learning Research (JMLR): Workshop and Conference Proceedings, 16:19–45,
2011.
[19] T. Hastie and R. Tibshirani.
26(2):451–471, 1998.

Classification by pairwise coupling.

Annals of Statistics,

[20] C.-H. Ho, M.-H. Tsai, and C.-J. Lin. Active learning and experimental design with SVMs.
Journal of Machine Learning Research (JMLR): Workshop and Conference Proceedings,
16:71–84, 2011.
[21] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Semi-supervised SVM batch mode active learning for image retrieval. In Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR), pages 1–7. IEEE, 2008.
[22] S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative examples. In Advances in Neural Information Processing Systems (NIPS), pages 892–900.
MIT Press, 2010.
[23] D. Lewis and W. Gale. A sequential algorithm for training text classifiers. In Proceedings of the SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages
3–12. ACM Press, 1994.
[24] C.-L. Li, C.-S. Ferng, and H.-T. Lin. Active learning with hinted support vector machine. Journal of Machine Learning Research (JMLR): Workshop and Conference Proceedings, 25:221–
235, 2012.
[25] T. Luo, K. Kramer, D. Goldgof, L. Hall, S. Samson, A. Remsen, and T. Hopkins. Active learning to recognize multiple types of plankton. In Proceedings of the International Conference on Pattern Recognition (ICPR), volume 3, pages 478–481. IEEE, 2004.
[26] A. Mammone, M. Turchi, and N. Cristianini. Support vector machines. Wiley Interdisciplinary
Reviews: Computational Statistics, 1(3):283–289, 2009.
[27] T. M. Mitchell. Generalization as search. Artificial intelligence, 18(2):203–226, 1982.
[28] P. Mitra, C. Murthy, and S. Pal. A probabilistic active support vector learning algorithm.
Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 26(3):413–418, 2004.
[29] F. Olsson. A literature survey of active machine learning in the context of natural language processing. Technical report, Swedish Institute of Computer Science, 2009.
[30] J. Platt. Fast training of support vector machines using sequential minimal optimization. In
B. Sch¨ lkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector o Learning, chapter 12, pages 185–208. MIT Press, 1999.
15

[31] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch¨ lkopf, and D. Schuurmans, editors, Advances o in Large Margin Classifiers, pages 61–74. MIT Press, 1999.
[32] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification.
Advances in Neural Information Processing Systems (NIPS), 12(3):547–553, 2000.
[33] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. Lawrence. Dataset Shift in
Machine Learning. MIT Press, 2009.
[34] J. W. Richards, D. L. Starr, H. Brink, A. A. Miller, J. S. Bloom, N. R. Butler, J. B. James,
J. P. Long, and J. Rice. Active learning to overcome sample selection bias: Application to photometric variable star classification. The Astrophysical Journal, 744(2):192, 2012.
[35] R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning
Research (JMLR), 5:101–141, 2004.
´
[36] S. Salcedo-Sanz, J. L. Rojo-Alvarez, M. Mart´nez-Ram´ n, and G. Camps-Valls. Support vector ı o machines in engineering: an overview. Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 4(3):234–267, 2014.
[37] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In
Proceedings of the International Conference on Machine Learning (ICML), pages 839–846.
Morgan Kaufmann, 2000.
[38] B. Sch¨ lkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, o Optimization, and Beyond. MIT Press, 2002.
[39] B. Settles. Active Learning. Morgan & Claypool, 2012.
[40] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
[41] X. Shi, W. Fan, and J. Ren. Actively transfer domain knowledge. In Proceedings of the
European Conference on Machine Learning and Knowledge Discovery in Databases (ECML
PKDD), pages 342–357. Springer, 2008.
[42] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[43] S. Tong. Active learning: Theory and applications. PhD thesis, Stanford University, 2001.
[44] S. Tong and E. Chang. Support vector machine active learning for image retrieval. In Proceedings of the International Conference on Multimedia (MM), pages 107–118. ACM Press,
2001.
[45] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research (JMLR), 2:45–66, 2002.
[46] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
[47] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classification using support vector machines. In Proceedings of the European Conference on Information
Retrieval (ECIR), pages 393–407. Springer, 2003.
[48] H. Yu. Svm selective sampling for ranking with application to data retrieval. In Proceedings of the International Conference on Knowledge Discovery in Data Mining (SIGKDD), pages
354–363. ACM Press, 2005.
[49] B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the International Conference on Data Mining (ICDM), pages
435–442. IEEE, 2003.
[50] L. Zhao, G. Sukthankar, and R. Sukthankar. Importance-weighted label prediction for active learning with noisy annotations. In Proceedings of the International Conference on Pattern
Recognition (ICPR), pages 3476–3479. IEEE, 2012.

16

Similar Documents

Free Essay

Boolean Assignment

...TR PT1420 5/13/14 Unit Assignment 4 l. What is the general fom1at of the statement used to code decisions in an application? A power full asset of the computer is its ability to make decisions and to take alternate course of action based on the outcome. 2. What is a Boolean expression? a logical statement that is either TRUE or FALSE. 3 . Explain the purpose of comparison operators and logical operators. The purpose of a comparison operator is to test some kind of relationship between two entities examples are >, <, ==, !=, etc 4. How does a comparison performed on numeric data differ from a comparison performed on string data? There are commonly used interchangeably, and the distinction between them is a small one. Comparison to" should be used when comparison is made between specific people, things, or other instances. 5. How does Visual Basic compare the Text property of a text box? When you compare the Text property of a text box with another value the Text property behaves like a variant. Visual Basic compares one text box to another as strings and compares a text box to a numeric variable or constant with a numeric compare. You can force a numeric comparison on a Text property by using the Val function. 6 . Why would it be useful to include the ToUpper method in a comparison? When comparing strings, the case of the characters is important. An uppercase “Y” does not compare equal to a lowercase “y”. Since the user may type a name or word in either...

Words: 902 - Pages: 4

Free Essay

Assignment of Income Doctrine

...ACC 616 Prof. Robert Simpson Student name: On the back Assignment 1: Explain Assignment of Income Doctrine The "assignment of income" doctrine states that income is taxed to the one who actually earns it. That means a taxpayer cannot avoid tax liability by assigning his income to another party or entity. Therefore, to be able to shift income to someone else, that one must actually earn the income. This doctrine aims to against the tax evasion when the taxpayer tries to deflect income to another party. First, starting from the term “earning”, earnings can occur either through the direct efforts of the taxpayer or the taxpayer’s ownership of an asset that generates income. Based on that understanding about earning, there are 2 ways to shifting income from one to another: the transferee must really work to earn that income or share the ownership of an asset that creates income. For example, if you are an owner of a business and you want to shift one part of your income to your family member such as your son, you need to hire your son to work for your company and give him the pay rate that is appropriate with his job. And the other way is to share your investment income with him, same meaning with sharing your ownership with him. The assignment of income applies the “tree and fruit” metaphor, in which the fruits cannot be attributed to a different tree from that on which they grew. If you want to avoid the tax liability on the fruit from the tree, you must prove that the...

Words: 421 - Pages: 2

Premium Essay

Assignment 2

...2.20 Write an SQL statement to display unique WarehouseIDs. SELECT DISTINCT WarehouseID FROM INVENTORY; 2.29 Write an SQL statement to display the SKU, SKU_Description, WarehouseID, and QuantityOnHand for all products having a QuantityOnHand greater than 1 and less than 10. Do not use the BETWEEN keyword. SELECT SKU, SKU_Description, WarehouseID, QuantityOnHand FROM INVENTORY WHERE QuantityOnHand > 1 AND QuantityOnhand < 10; 2.31 Write an SQL statement to show a unique SKU and SKU_Description for all products having an SKU description starting with ‘Half-dome’. SELECT DISTINCT SKU, SKU_Description FROM INVENTORY WHERE SKU_Description LIKE 'Half-dome%'; 2.33 Write an SQL statement to show a unique SKU and SKU_Description for all products having a ‘d’ in the third position from the left in SKU_Description. SELECT DISTINCT SKU, SKU_Description FROM INVENTORY WHERE SKU_Description LIKE '__d%'; 2.36 Write an SQL statement to display the WarehouseID and the sum of QuantityOnHand, grouped by WarehouseID. Name the sum TotalItemsOnHand and display the results in descending order of TotalItemsOnHand. SELECT WarehouseID, SUM(QuantityOnHand) AS TotalItemsOnHand FROM INVENTORY GROUP BY WarehouseID ORDER BY SUM (QuantityOnHand) DESC; 2.42 Write an SQL statement to display the SKU, SKU_Description, WarehouseID, Ware- houseCity, and WarehouseState of all items not stored in the Atlanta, Bangor, or Chicago warehouse. Do not use the...

Words: 349 - Pages: 2

Free Essay

Week 5 Assignment

...Janell Taylor 10/23/12 Implicit Test I found the test to be very interesting and I don’t like timed tests because I need time to think about my answer to be sure and confident about my choice. I don’t agree with the results because they said I made too many errors and I don’t understand that because that was just one part where I was making too many errors. It has helped me out in those areas of different topics, but I just wish I had more time. I guess I would say that the answers wasn’t valid enough for me because it didn’t really give me a score because I made too many errors, so I don’t agree with that too much. I believe prejudice is difficult to measure because I don’t agree with it and don’t like the fact that it’s very big and can get worse if we don’t come together to get in one accord to help each other and make the world better to live in dealing with different people and feelings. I think measuring prejudice can be a tough thing to deal with and handle because of the many people that are prejudice, which are hurting and harming many situations and people that’s trying to make it while being equal to everyone regardless of what. I would have to take the test over to get a better score because I felt that wasn’t fair because I made too many errors to get a...

Words: 252 - Pages: 2

Free Essay

Mat222 Week 1 Assignment.

...Solving Proportions MAT222 Week 1 Assignment September 22, 2014 Solving Proportions Solving for a proportion can be used within numerous real-world problems, such as finding the population of an area. Conservationists are able to predict the population of bear’s in their area by comparing information collected from two experiments. In this problem, 50 bears in Keweenaw Peninsula were tagged and released so conservationists could estimate the bear population. One year later, the conservationist took random samples of 100 bears from the same area, proportions are able to be used in order to determine Keweenaw Peninsula’s bear population. “To estimate the size of the bear population on the Keweenaw Peninsula, conservationists captured, tagged, and released 50 bears. One year later, a random sample of 100 bears included only 2 tagged bears. What is the conservationist’s estimate of the size of the bear population (Dugolpolski, 2012)?” In order to figure the estimated population, some variables need to first be defined and explain the rules for solving proportions. The ratio of originally tagged bears to the entire population is (50/x). The ratio of recaptured tagged bears to the sample size is (2/100). 50x=2100 is how the proportion is set up and is now ready to be solved. Cross multiplication is necessary for this problem. The extremes are (100) and (50). The means are (x) and (2). 100(50)=2x New equation, and now solve for (x). 50002=2x2 Divide both...

Words: 608 - Pages: 3

Free Essay

Mat222 Week 1 Assignment

...conservationist’s estimate of the size of the bear population? You will notice while reading question #56 on page 437, we are to assume that the ratio of originally tagged bears to the whole population is equal to the ratio of recaptured tagged bears to the size of the sample. The ratio of originally tagged bears to the whole population is 50X The ration of recaptured tagged bears to the sample size is 2100 50X=2100 This is the proportion set up and ready to solve. I will cross multiply setting the extremes equal to the means. 10050=2x 100and 50 are the extremes, while X and 2 are the means. 50002=2x2 Divide both sides by 2 X = 2500 The bear population on the Keweenaw Peninsula is around 2500 bears. The second problem for assignment one week one I am asked to solve the below equation for y. The first thing I notice is that a single fraction (ratio) on both sides of the equal sign so basically it is a proportion which can be solved by cross multiplying the extremes and the means. y-1x+3=-34 Is the equation I am asked to solve. 3y-1=-3x-4 The result of the cross multiplying. 3y-3=-3x+12 Distribute 3 on the left side and -3 on the right. 3y-3+3=-3x+12+3 Subtract 3 from both sides. 3y=-3x+15 3y3=-3x3+153 Divide both sides by 3 y=-x+5 This is a linear equation in the form of y=mx+b. This equation is in its simplest form. I like how we can take just a couple of numbers from a word equation and put it in an order that will help us solve many estimates...

Words: 340 - Pages: 2

Premium Essay

Pt1420 Unit 3 Assignment

...6.1) Comparison of three numbers Problem is of comparison of values among three values , a very simple problem. Here, the decision making process is done based on the value of the parameters a,b and c. The data types of all these variables are interger. 6.2) Binary search A binary search locates the position of an item in a sorted array. Binary search works by comparing an input value to the middle element of the array. The comparison determines whether the element equals the input, less than the input or greater. When the element being compared to equals the input the search stops and typically returns the position of the element. If the element is not equal to the input then a comparison is made to determine whether the input is less than or greater than the element. Depending on which it is the algorithm then starts over but only searching the top or bottom subset of the array's elements. If the input is not located within the array the algorithm will usually output a unique value indicating this. Binary search algorithms typically halve the number of items to check with each successive iteration, thus locating the given item (or determining its absence) in logarithmic time. A binary search is a dichotomic divide and conquer search algorithm. 6.3) Quadratic problem In Quadratic equation solver problem(called the procedure under test) is chosen. This procedure has three predicates involving both linear and non-linear functions of the input variables. In this there are three...

Words: 728 - Pages: 3

Free Essay

Mobile Service Provider

...11108944 Name: ASHWINI KUMAR Roll No. : RE3R02B32 PART- A 1. Ans :- (a) unary and ternary operator Unary operator:- It pecedes an operand . The operand (the value on which the operator operates ) of the unary operator must have arithmetic or pointer type and the result is the value of the argument. Example:- If a=5 then +a means 5 If a=0 then +a means 0. If a=-4 then +a means -4. Ternary operator:- It precedes an operand. The operand of the unary operator must have arithmetic type and the result is the negation of the operand’s value. Example:- If a=5 then –a means -5 If a=0 then –a means 0 If a=-4 then –a means 4. (b) Assignment and equalto operator Assignment operator:- Equal to operator: An assignment operator assigns value In this we put the To a variable. value as it is. Example – Example- a*=5 means a=5*5. Int a; a=5 means a is initialized with 5 if(a==5) { return true; } return false; (c) Expression and statement Expression:- An expression is any valid combination of operators , constants , and variables. Example:- ...

Words: 399 - Pages: 2

Premium Essay

Student

...Problem Solving with Computing Homework - WEEK 2 [30 points] This is a review of some of the material from Chapter 2 and lectures from class. No credit for answers that are copies or near verbatim transcripts – please use your own words1 and document sources where appropriate. 1 This will apply to all assignments in this class. Answer the following questions: Chapter 2 1. Short Answers [1 point each, 2 points total] 1. What does a professional programmer usually do first to gain an understanding of a problem? The first thing that a professional programmer usually do first to gain an understanding of a program is to closely relate customer (Interview ) to inquire or gather information about the problem. 2. What two things must you normally specify in a variable declaration? The two things normally specified in a variable declaration are the variable type and identifier. 2. Algorithms / Pseudocode [1 point each, 5 points total] 1. Design an algorithm that prompts the user to enter his or her height and stores the user’s input in a variable named height. Declare height Display “Enter Your Height” Input Height Display “Height” 2. Write assignment statements that perform the following operations with the variables a and b. - Adds 2 to a and stores the result in b. - Subtracts 8 from b and stores the result in a Set b=2+a Set a=b-8 3. Write a pseudocode statement that declares the variable cost so it can hold real numbers. Floating Point-Variable...

Words: 1823 - Pages: 8

Free Essay

Prg/211 Calorie Count Tool

...Team B Calorie Count Tool PRG/211 May 5, 2014 Team B Calorie Count Tool PROBLEM STATEMENT Team B was asked to develop a program which would calculate the user’s daily intake of calories and measure those calories against the overall calories expended. The core purpose of this program will do two primary functions. First, it will record the user intake of calories as acquired through meals throughout the day. Second, the user will record caloric output associated with physical activity. This information will be calculated together to determine the caloric surplus or deficit for the user. In order for the program to execute accurately, and provide customized results, the user will be required to input personal data to include gender, age, weight, and height. This additional information is essential to determine the user’s default caloric burn rate, otherwise known as the basal metabolic rate (BMR). The BMR and the calories burned as a result of physical activity will be calculated against the intake of calories to determine the overall success for the user. As the program is executed it must: * Record user name, age, height, weight to enable more accurate calculations * Record the users specific caloric values entered for each meal * Record the user activity and caloric burn values for that activity * Calculate the basal metabolic rate (BMR) for the individual * Subtotal the total caloric values for the day * Combine the physical activity and...

Words: 1524 - Pages: 7

Premium Essay

Week 2 Assigment

...Week 2 Assignment: Understanding Effective Money Management Assessment A, Part 1: Creating a Personal Financial Statement - Assets | 1 point | Car: Bluebook value $1250.00Cash: $378.00Savings Accounts: $826.00 | Assessment A, Part 2: Creating a Personal Financial Statement - Debts | 1 point | Rent: $750.00Electric/ Gas bill: $131.75Cable/ internet/ Phone bill: $80.42Credit Card: $31.00Cell phone bill: $72.37 | Assessment A, Part 3: Identify Money Management Tool | 1 point | Explain to Monica how the money management tools were identified. | Students should explain how they evaluated various cash management products and services. | Assessment A, Part 4: Creating a Personal Financial Statement – Steps | 1 point | Drag the steps listed on the right into their correct sequences on the left. When done click the Send button | Step 1: I got all my financial stuff together – bills, loans, bank statements, etc. | Step 2: I balance my checkbook. | Step 3: I decided what were my assets and what were my debts. | Step 4: I enter my assets in the program. | Step 5: I enter my debts in the program. | Step 6: The program gave me a Net worth figure at the end. | Assessment B: Creating a Monthly Cash Flow Statement ...

Words: 255 - Pages: 2

Premium Essay

Andy Owes Bill a Debt.

...Law Written Assignment 3 Case Study 1 Parks, a 7-foot, 265-pound center for the San Diego Slick, objected when his contract was assigned from the ABC Corporation to the XYZ Corporation, the team’s new owner. The change of owners did not cause a change in the composition of the team although a new coach was hired. Parks’s compensation and his responsibilities remained the same. Was this contract assignable? Facts of the Case: 1) Parks contract was assigned from the ABC Corporation to XYZ Corporation. 2) Parks compensation and his responsibilities remained the same. Issues: 1) The reason why we are in court today is to identify if Park’s contract was assignable. Rules of the Law: 1) Personal Service Contract – The parties agree that a personal service contract may be assigned. This allows the trade of an athlete from one team to another team. 2) Notice of Assignment – Assignee is under a duty to notify the obligor that the assignment has been made and performance must be rendered to the assignee. 3) Anti-Assignment Clause – Prohibits the assignment of rights under the contract. 4) Approval Clause – requires that the obligor approves any assignment of contract. Analysis & Conclusion: Since we do not have all the facts we can assume the following: 1) Parks contract did include the Personal service contract. 2) Notice of assignment was made by XYZ Corporation. 3) Parks contract did NOT include Anti-Assignment Clause. ...

Words: 495 - Pages: 2

Free Essay

Eopp

...Reflection assignment In this assignment I will be using the Gibbs Reflective Model, reference, to reflect on an incident that occurred in placement that demonstrates an understanding of the Outcome : 3.1: Demonstrate that they respect diversity and individual preferences and value differences, regardless of their own personal views. To do this I will first, briefly describe the event, supporting my outline with relevant information. I will then explore the event, and discuss why it is important and how it relates to the learning outcome. I will also be discussing why materials such as law and guidelines say this is important. I will then proceed to analyse the incident by breaking it down and picking out the main features of the experience, discussing why they are important, whilst linking the main points together. I will attempt to think about opposing arguments to what I have explored, and discuss the advantages and disadvantages of the arguments. Finally I will be using SMART goals, to create an action plan for future development. Explain incident with evidence Whilst on a shift, we had an elderly patient arrive on the ward. The patient suffered from a Frank Haematuria, Colovesciular fistula as well as incontinence. It was suggested that the patient received surgery to have this corrected, but the patient refused surgery, stating that at his age he did not want to go through with it, and wanted to put a DNAR in place. I along with the other nurses respected his choice...

Words: 370 - Pages: 2

Free Essay

Misconceptions of Algebra

...Diagnostic Algebra Assessment Definitions Categories Equality Symbol Misconception Graphing Misconception Definition Concept of a Variable Misconception Equality Symbol Misconception As algebra teachers, we all know how frustrating it can be to teach a particular concept and to have a percentage of our students not get it. We try different approaches and activities but to no avail. These students just do not seem to grasp the concept. Often, we blame the students for not trying hard enough. Worse yet, others blame us for not teaching students well enough. Students often learn the equality symbol misconception when they begin learning mathematics. Rather than understanding that the equal sign indicates equivalence between the expressions on the left side and the right side of an equation, students interpret the equal sign as meaning “do something” or the sign before the answer. This problem is exacerbated by many adults solving problems in the following way: 5 × 4 + 3 = ? 5 × 4 = 20 + 3 = 23 Students may also have difficulty understanding statements like 7 = 3 + 4 or 5 = 5, since these do not involve a problem on the left and an answer on the right. Falkner presented the following problem to 6th grade classes: 8 + 4 = [] + 5 All 145 students gave the answer of 12 or 17. It can be assumed that students got 12 since 8 + 4 = 12. The 17 may be from those who continued the problem: 12 + 5 = 17. Students with this misconception may also have difficulty with the idea that adding...

Words: 797 - Pages: 4

Free Essay

Book Report

...Selection statements Selection is used to select which statements are to be performed next based on a condition being true or false. Relational expressions In the solution of many problems, different actions must be taken depending on the value of the data. The if statement in C I used to implement such s decision structure in its simplest form – that of selecting a statement to be executed only if a condition is satisfied. Syntax: if(condtion) statement executed if condition is true When an executing program encounters the if statement, the condition is evaluated to determine its numerical value, which is then interpreted as either true or false. If the condition evaluates to any non-0 value (positive or negative), the condition is considered as a “true” condition and the statement following the if is executed; otherwise this statement is not executed. Relational Operators In C Relational operator | Meaning | Example | < | Less than | age < 30 | > | Greater than | height > 6.2 | <= | Less than or equal to | taxable <= 200000 | >= | Greater than or equal to | temp >= 98.6 | == | Equal to | grade == 100 | != | Not equal to | number !=250 | In creating relational expressions, the relational operators must be typed exactly as given in the above table. Thus, although the following relational expressions are all valid: age > 40 length <= 50 temp >= 98.6 3 < 4 flag == done day != 5 The following are invalid: length =< 50 ...

Words: 1617 - Pages: 7