Free Essay

Appliication of Image Search Engine

In:

Submitted By ysisodia
Words 11319
Pages 46
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

2257

iLike: Bridging the Semantic Gap in Vertical Image Search by Integrating Text and Visual Features
Yuxin Chen, Student Member, IEEE, Hariprasad Sampathkumar, Student Member, IEEE, Bo Luo, Member, IEEE Computer Society, and Xue-wen Chen, Senior Member, IEEE
Abstract—With the development of Internet and Web 2.0, large-volume multimedia contents have been made available online. It is highly desired to provide easy accessibility to such contents, i.e., efficient and precise retrieval of images that satisfies users’ needs. Toward this goal, content-based image retrieval (CBIR) has been intensively studied in the research community, while text-based search is better adopted in the industry. Both approaches have inherent disadvantages and limitations. Therefore, unlike the great success of text search, web image search engines are still premature. In this paper, we present iLike, a vertical image search engine that integrates both textual and visual features to improve retrieval performance. We bridge the semantic gap by capturing the meaning of each text term in the visual feature space, and reweight visual features according to their significance to the query terms. We also bridge the user intention gap because we are able to infer the “visual meanings” behind the textual queries. Last but not least, we provide a visual thesaurus, which is generated from the statistical similarity between the visual space representation of textual terms. Experimental results show that our approach improves both precision and recall, compared with content-based or text-based image retrieval techniques. More importantly, search results from iLike is more consistent with users’ perception of the query terms. Index Terms—CBIR, specialized search, vertical search engine

Ç
1 INTRODUCTION
Very large-scale multimedia repositories (e.g., the Library of Congress Prints and Photographs Catalog) are only indexed and retrieved by manually annotated metadata. Commercial web image search engines still mostly rely on text-based methods, i.e., indexing, retrieving, and ranking images based on surrounding texts or user-generated annotations. With the advances of textbased indexing, such systems demonstrate superior efficiency to handle images over the Internet. However, the search performance (precision) is not always reliable since: 1) it is not always easy to accurately identify “surrounding texts,” 2) surrounding texts do not necessarily describe the image content, and 3) perceptions and descriptions of visual contents are very subjective and inconsistent, search engine users and content creators (narrators) may use different terms, especially for short tags. To remedy the problems of text-only or visual-contentonly image retrieval systems, some recent approaches have proposed alternative routes to utilize both textual and visual features in web image search, for example, [2], [3], [4], [5], [6], [7], [8]. They are mostly two-phase hybrid approaches, which first use text retrieval to obtain a candidate result set, and employ CBIR methods to further process (e.g., cluster or rank) the candidates. In this way, visual and textual features are used separately, instead of semantically associated. Despite the deficiency of feature integration, such idea of leveraging textual information in a hybrid model shows great potential to reduce the semantic gap in CBIR. This motivates us to learn the visual representations for textual terms and obtain the semantic interpretations of visual features in an integrated model.
Published by the IEEE Computer Society

W

ITH the Internet explosion, tremendous amount of multimedia information, such as images and videos, becomes available on the web. It is highly desired to retrieve images based on their visual contents. However, unlike the great success of text search, major breakthroughs are expected to overcome key challenges in content-based multimedia retrieval. First and foremost, visual feature similarities are not necessarily correlated with content similarities. There exists a semantic gap—the gap between low-level visual features and high-level semantics, i.e., the gap between vision and perception. Second, it is difficult to handle the excessive computation for high-dimensional data. Meanwhile, advance in indexing high-dimensional data is far less promising than text indexing. Last, it is also difficult for users to provide or sketch a good query in the query-by-example scenario.

. Y. Chen is with the Department of Computer Science, ETH Zurich, CAB F 65.2, Universitaetstrasse 6, 8092 Zurich, Switzerland. E-mail: yuxin.chen@inf.ethz.ch. . H. Sampathkumar and B. Luo are with Department of Electrical Engineering and Computer Science, The University of Kansas, 2001 Eaton Hall, 1520 West 15th St, Lawrence, KS 66045. E-mail: {hsampath, bluo}@ittc.ku.edu. . X.-w. Chen is with the Department of Computer Science, Wayne State University, 5057 Woodward Ave., Detroit, MI 48202. E-mail: xwchen@wayne.edu. Manuscript received 2 June 2011; revised 20 Apr. 2012; accepted 19 Sept. 2012; published online 1 Oct. 2012. Recommended for acceptance by K. Chakrabarti. For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2011-06-0316. Digital Object Identifier no. 10.1109/TKDE.2012.192.
1041-4347/13/$31.00 ß 2013 IEEE

2258

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

In this paper, we present a vertical search engine, namely iLike, which integrates both text and visual features to improve image retrieval performance. In the vertical search, we have a better chance to integrate visual and textual features: first, text contexts are better organized; hence, focused crawlers/parsers are able to identify patterns and link text descriptions and images with higher confidence. Moreover, with domain knowledge, we can select image features and similarity measures that are more effective for the domain. Finally, computation issue becomes less critical for a smaller data set. We have implemented iLike as a vertical product search engine for apparels shopping, where textual and visual contents coexist and correlate. In iLike, we discover the relationships between textual features extracted from product descriptions and image features extracted from product pictures. We further associate both types of features to build a bridge across the semantic gap. Our technical contributions are threefold: 1) we bridge the semantic gap by integrating textual and visual features and hence significantly improve the precision of contentbased image retrieval (CBIR). We also improve the recall by yielding items that would otherwise be missed by searching with either type of the features. 2) We bridge the user intention gap between users’ cognitive intentions (information needs) and the textual queries received by the IR systems. Our system is able to perceive users’ “visual intentions” behind search terms, and apply such intention to leverage on relevance assessment and ranking. 3) By assessing representations of keywords in the visual feature space, we are able to discover the semantic relationships of the terms and automatically generate a thesaurus based on the “visual semantics” of words. Such visual thesaurus could be further utilized in our system to improve the search performance. The rest of the paper is organized as follows: We discuss related works in Section 2, and follow up with an introduction of iLike architecture in Section 3. We describe the details of our algorithms in Section 4, and present evaluation results and discussions in Section 5. Finally, we conclude the paper in Section 6.

2.2 Image Annotation Automatic image tagging. Automated image tagging techniques automatically add tags or metadata for images. It has primarily been treated as a classification problem, and hence, several supervised learning techniques have been attempted for it. In general, the goal is to build a classifier that identifies the mapping between the low-level image features and the labels that are used to classify the images in the training set. Once the classifier is trained, it assigns the testing samples to a class with the highest likelihood. Popular techniques including probabilistic models [20], [21], generative models [22], machine translation [23], and image search [24], [25] have also been used for automatic image tagging. Text-image interaction methods [14], [26], [27] that make use of visual information to help annotate images have also been proposed. Folksonomic tagging. Automatic tagging approaches were shown to be most effective when the keywords have frequent occurrence and strong visual similarities. However, it is challenging to annotate images with more specific or visually less similar keywords. Meanwhile, manual tagging is also used to help with image retrieval [28], [29]. Google Image Labeler (http://images.google. com/imagelabler/) and Flickr image tags (http:// www.flickr.com/photos/tags/) are some examples of such efforts. The practice is referred to as folksonomic tagging in Web 2.0, which aims to facilitate sharing of user generated content. To overcome the tedium of manual tagging and to improve the quality of the image tags, automated tag recommendation systems like [30] have been developed. With a growing number of social network sites that allow sharing and tagging of photos, methods like [31] have been used to develop fully automated and folksonomically scalable tag recommendation systems. Such systems leverage the collective vocabulary of a group of users, which is less susceptible to noise than an individual’s subjective annotation, resulting in high-quality image tags. 2.3 Image Search on the Web Text-based image retrieval. Current web image search engines like Google Image Search (http://images.google. com) and Bing (http://www.bing.com/images) primarily rely on textual metadata. They take textual queries and match them with the metadata associated with the images, such as image file name, URL, and other surrounding text in the webpage containing the image. Since the textual information surrounding an image may not necessarily describe the image content, and it is difficult to describe visual content using text, the retrieval performance of the metadata-based searches can still be poor. There are also more aggressive text-based methods [32], [33] to better associate semantic information with the images. Link analysis techniques [34], [35] have also been employed to improve the search performance. Hybrid methods for image search. Several prototypes for content-based image search for the web are available: [36], [37], [38], [39], [40]. Luo et al. [2] introduced a two-stage hybrid approach where a text-based search is first used to generate an intermediate result set with high recall and low precision, which is then refined at the second step by

2

RELATED WORKS

2.1 Content-Based Image Retrieval Early image retrieval systems manually annotate images with metadata, and use text-based retrieval to search on tags. However, manual annotation is extremely time consuming, especially for very large scale image databases. Also, it is difficult to describe images accurately with a controlled set of keywords. Discrepancies between the query terms and tagging terms lead to poor retrieval results. To tackle such issues, Content Based Image Retrieval systems were developed to use visual features such as color and texture to index and retrieve images. Comprehensive surveys on CBIR can be found at [9], [10], [11]. The primary challenge in CBIR is the semantic gap between the high-level image content and the low-level visual features. CBIR techniques also tend to be computationally intensive due to the high dimensionality of visual features.

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2259

applying CBIR to cluster or rerank the results. Although this approach suffers from over simplified image features and clustering methods, the idea of applying CBIR on text search results for clustering or reranking seems to be a viable alternative. More complicated clustering algorithms have been later proposed [6], [7] to group search results for better presentation to the users. More recently, Bing image search (http://www.bing.com/images/) has started to employ CBIR techniques to rerank search results [3], [4], [5], [8], when users select the “show similar images” option. In particular, IntentSearch [3], [4], [5] infers user intentions from the selected image (e.g., “the user wants scenery images,” or “the user wants portraits”), and proposes adaptive similarity to enforce such intentions. Wang et al. [8] learn query-specific visual semantic spaces through keyword expansion, and reranks images in the visual semantic space. On the other hand, Wang et al. [41] proposed a ranking-based distance metric learning method to learn a new distance measure in the visual space, which approximates the distance measure in the textual space. The approach is used to retrieve more semantically relevant images for an unseen query image. Our approach to image retrieval is significantly different from these existing approaches in the way we integrate both textual and visual features. Domain-specific image search. Some research efforts have proposed to apply CBIR to vertical search, which caters only to specific subdomains of the web. These vertical search engines employ focused crawlers to crawl constrained subsets of the general web, and evaluate user queries against such domain specific collections of documents. Besides leveraging the benefits of a smaller data set, these engines can also employ domain knowledge to help with feature selection, relevance assessment, and result ranking. Some example of vertical image search include: photo album search [42], product search (http://www. like.com/, http://www.riya.com/), airplane image search (http://www.airliners.net/), and so on. There are also offline image retrieval systems that work on domain specific collections of images, such as personal albums [43], [44], leaf image search [45], [46], fine arts images search [47], and so on. These approaches make use of domain specific knowledge in image preprocessing, feature selection, and similarity measurements. For example, leaf image searches may have emphasis on shape and texture features while personal album searches may employ face recognition methods to improve search performance.

Fig. 1. iLike system architecture: (1) crawling; (2) text processing; (3) image processing; (4) integration of visual and textual features; (5) Reranking.

yield satisfactory results. Especially, the recall can be very low when there is a discrepancy between user’s and narrator’s vocabularies. 3) Two dresses that are visually very different may have similar style in human perception. Hence, pure CBIR will not yield high recall either. Therefore, apparel shopping is an ideal scenario to demonstrate the power of integrating visual and textual features. Note that our arguments are also true in many other shopping categories—the system could be migrated to other categories with reasonable modification. The iLike system is comprised of three major components: the Crawler, the (Pre-)Processor, and the Search and UI component. As shown in Fig. 1: The Crawler fetches product pages from retailer websites. 2. A customized parser extracts item descriptions and generates the term dictionary and inverted index. 3. Simultaneously, the image processor extracts visual features from item images. 4. Next, we integrate textual and visual features in a reweighting scheme, and further construct a visual thesaurus for each text term. 5. Finally, the UI component provides query interface and browsing views of search results. In iLike, a user starts with a traditional text query (because query-by-example is not practical in this scenario), which yields a ranked list of relevant items (namely the initial set) retrieved by text algorithms. For each result in the initial set, we construct a new query by integrating textual and visual features from item images. Each new query is evaluated to find more “similar” items. More importantly, a weight vector which represents the “visual intention” behind the text query is enforced during evaluation of the new queries. For instance, with text query “silky blouse,” the weight factor will increase the significance of some texture features, and fade out irrelevant features, hence correctly interpret the visual meaning behind search term “silky.” The overall philosophy of our approach is to infer user intention from the query and enhance the features that are implicitly favored. 1.

3

SYSTEM OVERVIEW

3.1 System Architecture Our goal is to integrate textual and visual features in vertical search; therefore, we select a domain where text and image contents are directly associated and equally important. Online shopping, especially clothing shopping, is a good example of such domains: 1) users can only issue text queries to start the search; but they focus more on visual presentations of the results. 2) Due to personal tastes, the descriptions of fashionable items are very subjective, hence traditional text-based search may not

3.2 Crawling and Feature Extraction Data collection. In iLike prototype, we have crawled 42,292 items from eight online retailers: Banana Republic, Old Navy, Gap, Athleta, Piperlime, Macy’s, Bluefly, and

2260

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

Nordstrom. We use focused crawlers to collect text and images. For each product, we record the name, category, full textual description, and so on. In the dictionary, we have 1.2K “frequent” terms (df > 30). The system could be expanded to more retailers by implementing customized crawlers and parsers. Visual features. In iLike, a set of 401 popular visual features are extracted from product images. We use graylevel co-occurrence matrix [13] for textual features: contrast, correlation, energy, and homogeneity of the gray-scale images are calculated, each of which generating a 4-scale feature vector. Thirteen Haralick texture features are extracted from the gray-level co-occurrence matrices. Image coarseness and direction are obtained by three dimensions of Tamura texture features [14]. We apply Gabor wavelet filters in eight directions and five scales, acquiring a vector of 40 texture features. Besides, Fourier descriptors [16] are employed to add nine features. Contours are represented with seven geometric invariant moments [15]. We capture the spatial distribution of edges with five edge strengths generated from an edge histogram descriptor [18]. The edge orientation is represented by phase congruency features (PC) [48] and high-order moments of characteristic function (CF) [49]: a three-level Daubechies wavelet decomposition is performed before edge detection. At each level, the first four moments of phases (generated by Sobel edge detector) are obtained, together with the first three moments of the characteristic function. For color distribution [12], we first divide an image into blocks (1 Â 1, 2 Â 2, and 3 Â 3), and then extract the first three moments of all blocks in each of the YCbCr channel, i.e., 90 color moments (CM) features. The color histogram features are generated by a color quantization approach. We map the original image into the HSV color space, and implement color quantization using 72 colors (eight levels for H channel, three levels for S channel and three levels for V channel). The chosen features have been proved to work well for image classification in literature, for example, [17], [12], [18]. Meanwhile, a comparative study [50] has shown that the effectiveness of visual features is dependent on the task. However, such a specifically optimized system may not be easily migrated to other tasks, because 1) the manual feature selection process is labor-intensive, and 2) fine-tuned parameters in a certain domain may not generate good performance in a different domain. For instance, visual features selected for leaf images [45] may not generate very good performance for personal albums [44]. On the other hand, in iLike features are automatically weighted based on their significance to the keyword. Therefore, the “quality” of low-level visual features is not the key factor in our system, because “bad” features are faded out. As a side effect, our method is robust: the ranking quality is less sensitive to the selection of low-level image features. We will further discuss on feature quality and correlation in Section 5. Segmentation. Different retailers have different styles for product images, some of which may introduce difficulties in feature extraction. For instance, the presence of a lingerie model could significantly influence many features. To clean

the product images and minimize the errors, we perform an “YCbCr Skin-color Model” [51]-based image segmentation on selected domains (i.e., categories and shopping sites that usually have models in) to remove the skin area. Normalization. Features from different categories are not comparable with each other, because they take values from different domains. Without any normalization, search results will be dominated by those features taking larger values. To tackle the problem, we map the range of each feature ~ to ð0; 1Þ: x yi ¼ xi À minð~Þ x ; maxð~Þ À minð~Þ x x ð1Þ

in which i indicates the ith item. After normalizing, all the features are comparable.

4

THE METHOD

In multimedia information retrieval, the roles of textual feature space and visual feature space are complementary. Textual information better represents the semantic meaning, while visual features play a dominant role at the physical level. They are separated by the semantic gap, which is the major obstacle in content-based image retrieval. In this section, we present an innovative approach to bridge the gap and allow transformation from one space to another.

4.1 Representing Keywords For online images and their descriptions, the textual description is a projection of the narrator’s perception of the image content. However, there are difficulties using only text features to retrieve mixtures of image/textual contents: perception is a subjective matter, the same impression could be described through different words. Moreover, calculating text similarity (or distance) is difficult—distance measurements (such as cosine distance in TF/IDF space) do NOT perfectly represent the distances in human perception. For instance, from a customer’s perspective, “pale” is similar to “white” but different from “gray.” However, they are equally different in terms of textual representation (e.g., orthogonal in vector space model). To make up for the deficiency of pure text search or pure CBIR approaches, we explore the connections between textual and visual feature subspaces. The text description represents the narrator’s perception of the visual features. Therefore, items share similar descriptions may also share some consistency in selected visual features. Moreover, if the consistency is observed over a significant number of items described by the same keyword, such a set of features and their values may represent the human “visual” perception of the keyword. In addition, if items with different descriptions demonstrate a different value distribution on these selected visual features, we can further confirm the correlation between the terms and these visual features. For instance, let us look at the items with the keyword “dotted” in their descriptions (some examples are shown in Fig. 2). Although they come from different categories and different vendors, they all share very unique texture features. On the other hand, they all differ a lot in other

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2261

Fig. 2. Some items that have the keyword “dotted” in their descriptions.

features, such as color and shape. It indicates that the term “dotted” is particularly used to describe certain texture features. When a user searches with this term, her intention is to find such texture features, not about color or shape. In this way, many terms could be connected with such a “visual meaning.” In iLike, the first step is to discover such “visual meanings” automatically. Base representation. Assume that there are N items with term T in their descriptions, and each item is represented by ~ a M-dimensional visual feature vector: Xk ¼ ðxk1 ; xk2 ; . . .; xkM ÞT , where k ¼ 1; . . . ; N. The mean vector of the N feature vectors is used as a base representation of term T in the visual feature space: !T N N N 1X 1X 1X ~¼ xk ; xk ; . . . ; xk :  N k¼1 1 N k¼1 2 N k¼1 M When N is large enough, ~ will preserve the common  characteristics in the image features and smooth over the various sections. In such a manner, the mean vector is rendered as a good representation of the keyword. However, those N feature vectors may not share consistency over all visual features, hence, not all dimensions of the mean vector make sense. As shown in the “dotted” example, those items are only similar in some texture features, while they differ a lot in color and shape features. Such consistency/inconsistency on the feature is a better indicator of the significance of the feature toward human perception of the keyword. Therefore, a more important task is to quantify such consistency or inconsistency.

Fig. 3. Examples of feature distributions: solid: value distribution of positive samples; dashed: value distribution of negative samples.

4.2 Weighting Visual Features As shown in the “dotted” example, features coherent with the human perception of the keyword tend to have consistent values; while other features are more likely to be diverse. To put it another way, suppose that we have two groups of samples: 1) positive: N1 items that have the keyword in their descriptions, and 2) negative: N2 items that do not contain the keyword. In this way, if the meaning of a keyword is coherent with a visual feature, its N1 values in the positive group should demonstrate a different distribution than the N2 values in the negative group. Moreover, the feature values in the positive group tend to demonstrate a small variance, while values in the negative group are usually diversified. Fig. 3 demonstrates the value distribution of eight different features for the keyword “dotted.” In the figure, blue (solid) lines represent distributions of the positive

samples, while red (dashed) lines represent the distributions of negative samples. Note that sample sets are fitted to normal distributions for better presentation in the figure. However, when we quantitatively compare both distributions, we do not make such assumption. For the first four texture features, distributions of the positive samples are significantly different from negative samples (e.g., items described by the keyword are statistically different from other items in these features). On the contrary, the two distributions are indistinguishable for the other four features (selected from color and shape). As we can see from Fig. 3, there are still overlaps between the distributions of positive and negative samples. This indicates that there are items visually similar to the positive items on those “good” features, but they do not have the particular keyword (e.g., “dotted”) in their descriptions. In the experimental results in Section 5, we will show that iLike is able to retrieve such items without getting false hits (e.g., items with similar colors to the positive samples, but not the “dotted” texture). The difference between two distributions could be quantitatively captured by running Kolmogorov-Smirnov (K-S) test [52] across each dimension of feature vectors. The two sample K-S test is commonly used for comparing two data sets because it is nonparametric and does not make any assumption on the distribution. The null hypothesis is that the two samples are drawn from the same distribution. For n i.i.d samples X1 ; X2 ; . . . ; Xn with unknown distribution, an empirical distribution function can be defined as 8 < 0; if x < Xð1Þ k Sn ðxÞ ¼ n ; if XðkÞ x < Xðkþ1Þ ; k 2 f1; 2; . . . ; n À 1g : 1; if x ! XðnÞ; where Xð1Þ ; Xð2Þ ; . . . XðnÞ are ascending values. The K-S statistic for a given function SðxÞ is Dn ¼ max jSn ðxÞ À SðxÞj: x The cumulative distribution function K of Kolmogorov distribution is KðxÞ ¼
1 2 X Àð2iÀ1Þ2 2 =ð8x2 Þ e : x i¼1

2262

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

Fig. 4. Weight vectors for terms “pattern,” “orange,” “decorated,” and “cute.”

pffiffiffi pffiffiffi It can be proved that nDn ¼ n maxx jSn ðxÞ À SðxÞj will converge to the Kolmogorov distribution [52]. Therefore, if pffiffiffi nDn > K ¼ P rðK K Þ ¼ 1 À , the null hypothesis for the K-S test will be rejected at confidence level . Similarly, to determine whether the distributions of two data sets differ significantly, the K-S statistic is Dn;m ¼ max jSn ðxÞ À Sm ðxÞj; x and the null hypothesis will be rejected at level if rffiffiffiffiffiffiffiffiffiffiffiffiffi nm Dn;m > K : nþm

ð2Þ

The P-value from the K-S test is used to measure the confidence of the comparison results against the null hypothesis. Back to our scenario, for each keyword, a Pvalue is calculated at each dimension of the feature vector. Features with lower P-values demonstrate statistically significant difference between positive and negative groups. For instance, the P-values for the features shown in Fig. 3 row 1 are: 0, 3:901  10À319 , 2:611  10À255 , 5:281  10À250 ; and for Fig. 3 row 2 are: 2:103  10À1 , 1:539  10À5 , 8:693  10À4 , 1:882  10À5 . As we can see, items described by the keywords have significantly different values in those features, compared with items that are not described by the keyword. Therefore, such features are more likely to be coherent with visual meaning of the keyword, and hence more important to the human perception of the keyword. On the contrary, items with and without the keyword have statistically indistinguishable values on other visual features, showing that such features are irrelevant with the keyword. In this way, we can use the inverted P-value of the K-S test as the weight of each visual feature for each keyword. Note that P-values are usually extremely small, so it is necessary to map the value to a reasonable scale before using it as weight. Ideally, the mapping function should satisfy the following requirements: 1) it should be a monotone decreasing function: lower P-values should give higher weight; 2) when the variable decreases under a threshold (conceptually, small enough to be determined as “statistically significant”), the function value decreases slower. Therefore, we apply two steps of normalization. First, we design a mapping function: fðxÞ ¼ arctan ðÀlogðxÞ À CÞ þ arctan ðCÞ ; 

where C ¼ ðmaxðxÞ À minðxÞÞ=2. It is then followed by a linear scaling to map the data range from to (0, 1), rendering itself as the weight vector of the keyword. By reweighting visual features for each keyword, we amplify the features that are significant for the keyword, while fading out the others. As an example, Fig. 4 shows the normalized weight vectors computed from keywords “pattern,” “orange,” “decorated,” and “cute.” In the figure, the X-axis represents visual features (as introduced in Section 3): dimensions (1-32) are texture features: contrast, correlation, homogeneity, coarseness, direction, moment invariant, and so on; (33-112) are texture features from the frequency domain: Gabor texture, Fourier descriptors, and so on; (113-239) are shape features: shape invariant moments, edge directions, moments of characteristic function, and phase congruency; and (240-401) are color features: color moments and color histogram. Note that we group the visual features as above just for the convenience of discussion, and those groups of features might be overlapping with each other. In the figure, a large value (higher weight, lower P-value) are generated by statistically different positive and negative samples, indicating that the feature is more likely to have some kind of association with the human perception of the term. From the figures, we can see that some texture features show more significance in representing the keyword “pattern,” while the visual features of keyword “orange” is primarily captured by color features. In this way, when user queries with term “pattern,” we can infer that she is more interested in texture features, while local color and shape features are of less importance. Most importantly, we can further retrieve items with similar visual presentation in such features, but do not have the particular term (“pattern”) in their descriptions. On the other hand, it is difficult to imagine or describe the human visual perception for some keywords. Fortunately, our approach is still capable of assessing such perceptions. For instance, Fig. 4 also shows the weight vectors for terms “decorated” and “cute.” It is not easy for a user to summarize the characteristics of “cute” items. However, when we look at the figure, the visual meaning is obvious. “Cute” items share some distinctive distributions in the color and shape features, while they are diversified in intensity and high-frequency textual features.

4.3 Visual Thesaurus Thesauri are widely used in information retrieval, especially in linguistic preprocessing and query expansion. Although manually generated thesauri have higher quality, the developing process is very labor intensive. Meanwhile, we can automatically generate thesauri using statistical

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2263

statistically meaningful components are preserved. In this way, we are able to compute the semantic similarities between text terms, and such semantic similarities are coherent with human visual perception in this particular application domain. We also observe that some nonadjective terms demonstrate moderate similarity with many other terms. We eliminate the high-frequency terms through postprocessing. We are also able to compute antonyms, which are terms having a similar set of significant feature components but carrying consistently opposite values on such features, i.e., their weight vectors are similar, but weighted mean vectors are different. Examples of synonyms and antonyms are shown in Fig. 5. As we can see, weight vectors of terms “pale,” “white,” and “gray” are quite similar, indicating that they are related to a similar set of visual features in human perceptions (in this case, mostly color features). Meanwhile, the weighted mean vectors of “pale” and “white” are similar, while that of “gray” is very different. We calculate the term-wise similarity across the dictionary, to generate a domain-specific “visual thesaurus” or a “visual WordNet”. Some examples are shown in Table 1. This thesaurus could be used for query expansion for existing text-based product search engines, or in many other information retrieval applications.
Fig. 5. Weight vectors (left) and weighted mean vectors (right) for terms “pale,” “white,” and “gray” (from top to bottom): weight vectors are very similar.

analysis of textual corpora, based on co-occurrence or grammatical relations of terms. In iLike, we generate a different type of thesaurus—a visual thesaurus, based on the term distributions in the visual space, i.e., the statistical similarities of the visual representations of the terms. In iLike, two terms are similar in terms of “visual semantics” if they are used to describe visually similar items. Since each term is used to describe many items, the similarity is assessed statistically across all the items described by both terms. In particular, the visual representation (mean vector) and weight vector for two terms t1 and t2 are denoted as M-dimensional vectors: ~t1 , ~t2 , !t1 , !t2 ,   ~ ~ respectively. The similarity between t1 and t2 is defined as the cosine similarity of two weighted mean vectors: PM i¼1 ðt1 ;i wt ;i Þ Á ðt2 ;i wt2 ;i Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : simðt1 ; t2 Þ ¼ P PM M 2 2 i¼1 ðt1 ;i wt1 ;i Þ Á i¼1 ðt2 ;i wt2 ;i Þ ð3Þ

In this formula, each term vector (in visual feature space) is weighted by its weight vector, so that only values of

4.4 Weight Vector Optimization As we have introduced, product descriptions could be very subjective due to personal tastes. Different narrators/ retailers may use different words to tag similar objects. Due to the existence of synonyms, we observe false negatives in the negative sets. A false negative is an item that: 1) is actually relevant to the term, 2) demonstrates similar visual features with the positive items, 3) is described by a synonym of the term, not the term itself, and hence is categorized in the negative set of the term. As shown in the “good” features in Fig. 3, we still observe overlaps in the feature value distribution of negative and positive samples. Such overlaps will reduce the weight of the corresponding feature toward any of the synonym terms, and possibly decrease search performance. The domain-specific visual thesaurus can help us find both synonyms and antonyms. By merging items described by synonyms, we can decrease the number of false positive items caused by those synonyms; hence, we can observe higher consistency on significant features, and get higher weights out of it. In iLike, we first generate an initial visual thesaurus for all the terms in the dictionary. Next, for each term, we add the items described by its top synonyms into its positive set. A high threshold is enforced in determining

TABLE 1 Visual Thesaurus

2264

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

Fig. 6. An example of feature distributions of sets identified by terms “pale” and “cream” (synonyms), and their combination.

Fig. 7. Feature quality: (left): entropy of feature weights across all terms; (upper): a high-quality feature; (lower): a low-quality feature.

the top synonyms, so that we do not introduce false positives into the positive set. We recalculate the new weight vector according to the updated positive/negative sets. An example of the value distributions (normalized) of a color feature of the positive and negative sets identified by terms “pale” and “cream” are shown in Fig. 6 (dashed lines). The distributions of the positive and negative sets from the combined set are also shown. For demonstration, we normalized the distributions that the areas under each curve is a constant. We can see that the feature distribution of the combined positive set is cleaner and narrower. By iteratively combining similar keywords in the visual thesaurus, we can improve the quality of the weight vectors. Our experiments have shown that the number of synonyms to be merged decreases significantly after each iteration.

4.5 Feature Quality and Correlation In CBIR, the entropy of low-level visual features is widely used for feature selection [53] and image annotation [54]. In iLike, we reemploy this problem by utilizing the entropy of feature weights across all keywords. Intuitively, if a feature produces low weight for all terms in the dictionary, it is “useless” because it will always have a very low value in weighted queries. On the contrary, if a feature produces high weight for all the terms, it is not a good feature that it does not represent any distinctive semantic meaning. In practice, we do not find any feature that is significant for (almost) all keywords. In Section 4.2, we have generated a weight vector for each keyword. For each visual feature, we collect weight values across all keywords (i.e., the ith component of all weight vectors). The entropy of each collection of weights is used as a quality assessment of the particular feature [55], [56]. Entropy-based feature quality assessment is somehow empirical that “good” features are relative to “bad” features. A good feature will produce high weights for some terms, and low weights for the others. In other words, we are able to observe the semantic meaning of the features with higher entropy. The feature-quality curve is shown as Fig. 7a. On the other hand, Figs. 7b and 7c demonstrate the weight histogram for two difference features. As we can see, the feature shown in Fig. 7b has higher weights for some terms, while the feature in Fig. 7c has low weights for all terms. That is to say, the first feature is able to

distinguish the positive and negative sets for some terms, while the other feature does not work well for any term. The first feature is certainly better than the other one. Fig. 7 also shows that most of the selected features demonstrate good quality, except for a few color features (e.g., those with much lower entropy in Fig. 7a). This is consistent with the CBIR literature. On the other hand, features may be correlated. In iLike, if two features are significant for a similar set of keywords, and insignificant for the others, they are somewhat correlated. To quantitatively study the correlations among the selected visual features, we calculated the pair-wise Pearson product-moment correlation coefficient (PMCC) for all the features, and the results are shown in Fig. 8, in which black denotes maximum correlation, and white denotes no correlation. We can see that features are mostly independent, with moderate correlations among same type of features. We observe stronger correlations among CF and PC features. Such correlations introduce some computational overhead in iLike, but the impact on search precision is very limited.

4.6 Query Expansion and Search As we have introduced, in iLike, we first employ classic text-based search to obtain an initial set (since users could only provide text queries). For each keyword in the user query, the system loads its corresponding weight vector,

Fig. 8. Feature correlation: features are mostly independent; some CF and PC features are correlated.

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2265

Fig. 9. Search results for query “printed”: (a) user selection from the initial set; (b) iLike query vector and the top 2 results; (c) Baseline (CBIR) query vector and the top 2 results.

Fig. 10. Search examples: text queries, top items in the initial sets, and corresponding baseline and iLike results.

TABLE 2 Name of Similar Items Returned by iLike with Keyword “Printed”

which is generated offline. Weight vectors from query terms ~ are combined to construct the query weight vector !Q , which represents user intention in the visual feature space. For each item in the initial set, we use its visual features to construct a base query qi . We also obtain an expanded ~ ~ weight vector !E from its textual description. Therefore, given a query q, the new query corresponding to the ith item in the initial set is À Á ~ðItemi ; QueryÞ ¼ qi : Â Á !Q þ Á !E ; ~ ~ ~ q0 ð4Þ where: Â indicates component-wise multiplication. Practically, is set to a much smaller value than , to highlight the intention from users. In the new query, features that are insignificant to the search terms carry very small values. Hence, the new query could be used to search the item database on the basis of their cosine similarities (or euclidean distances), without further enforcing the weights.

displayed in columns, with Ii in the title row. We also use traditional CBIR as baseline. It skips query expansion, and ~ uses the original feature vector qi (extracted from Ii ) as the query, to find visually similar items from the database.

5

EXPERIMENTAL RESULTS

5.1 Settings We use the database collected in Section 3.2 for evaluation. For a text query Q, iLike first retrieves an initial result set fIi g, as introduced in Section 4.6. For each Ii , a reweighted 0 ~ query qi is generated as (4) ( ¼ 0:9, ¼ 0:1). The results are

5.2 Search Examples Fig. 9 shows an example of iLike and baseline results of query “printed.” As shown in Fig. 9b, the iLike query highlights the features that are coherent with “printed,” and fades out features that are insignificant to the term. iLike results share some local texture features (“printed” patterns). Meanwhile, although CBIR results are visually similar with the initial selection, they do not exhibit any relevance with “printed,” instead, color and shape features dominate visual similarities. Fig. 10 shows more examples, for example, for queries like “black boat shoes” and “yellow tote,” the color features are identified as more important. We can see that iLike understands the intention behind the text terms, and is able to select relevant visual features that are consistent with human perceptions. Meanwhile, compared with text-based search, iLike significantly increases recall by yielding items that do not contain query terms. Table 2 shows three groups of iLike results for query “printed.” Except for the initial set (retrieved by text search), there are only three items that contain “printed”

2266

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

Fig. 11. (a) iLike search results for query “ruffle shirt”; (b) Baseline search results for query “ruffle shirt.”

in title or descriptions. All other items are only retrievable by visual features. Finally, Fig. 11 compares iLike and baseline search results for query “ruffle shirt.” To sum up, we have observed that: 1) pure text-based retrieval will miss many relevant items that do not have the term in the descriptions. 2) If we only use visual features from initial set (qi ), the results will drift away from user intention. iLike is able to infer the implicit intention behind the queries, pick up a smaller subset of visual features that are significant to such intention, and yield better results.

5.3 User-Based Evaluation We have designed a user-based evaluation to compare iLike with baseline. First, 100 distinct Q in the form of “adjectiveþnoun” (e.g., “pattern shirts”) are randomly selected. The frequencies of the adjectives range from 69 to 3,561 (Fig. 12). Next, five items from the initial set of each Q are randomly selected as seed images (fI1 ; . . . ; I5 g), to generate 500 queries in total. For each query, the top 10 results from iLike and baseline (they could overlap), together with 20 randomly selected items from the same category as Ii , are prepared for user evaluation. Thirty

participants from the University of Kansas and ETH Zurich are invited for evaluation. All participants have experiences with search engines and online shopping. For each query, the participant is provided with Q and prepared items (permuted), and asked to mark items that he/she determines to be relevant with Q. We received results from 76 Qs and 201 queries. Table 3 shows the statistics of comparing iLike with baseline. In the table, a true positive (TP) is a retrieved item (from iLike or CBIR) that is confirmed (marked as relevant) by the evaluator; a false positive is a retrieved item that is marked as irrelevant. Fig. 13 shows the average Precision-Recall curves of iLike and CBIR. In the evaluation, 1,821 items have been marked as relevant, in which 1,161 are captured by iLike and 969 by Baseline.
TABLE 3 Overall Performance of iLike and Baseline

Fig. 12. Term frequencies of evaluated keywords.

Fig. 13. Average Precision-Recall curve.

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2267

Fig. 14. (a) Precision-rank curve, (b) recall-rank curve.

The overall precision and recall of iLike outperform CBIR by 20 percent. Figs. 14a and 14b show the average precision and recall rate. We then compare iLike with the baseline approach across different queries. Fig. 15 shows the R-Precision histograms for all the distinct queries. An R-Precision histogram presents the differences between the precision of iLike and baseline at recall R. A positive bar means that iLike outperforms baseline. We can see that iLike achieves better precision for the majority of queries. These results are in agreement with Fig. 14a. However, there are some queries where iLike performs worse than CBIR in both 5-precision and 10-precision, i.e., query 24 (“voile”), 30 (“crinkle”), 47 (“metallic”), 59 (“polka”), and so on. Those terms all have sophisticated “visual” meaning, which is difficult for both users and iLike to interpret.

Fig. 15. Precision Histogram RPA=B ðiÞ ¼ RPA ðiÞ À RPB ðiÞ; Upper: R ¼ 5, Lower: R ¼ 10.

5.4 Discussions The first prototype of iLike [1] has collected 20K product items and extracted 263 image features. We would like to qualitatively compare with the conference report, in terms of performance and scalability issues: Precision has improved, primarily due to the increased number of positive samples for each term, and newly added visual features. We have also observed more false-negatives in the negative sets, mostly caused by synonyms. The problem is handled by the weight vector optimization approach in Section 4.4. Computation for preprocessing has increased. All preprocessing (block 2 in Fig. 1) is conducted offline; hence, computation is not a major concern. The computational complexity of text-based indexing is Oðn log nÞ, where n denotes the total number of terms. Computation for visual feature extraction highly depends on the feature, and is linearly increasing with the number of images. K-S test could be calculated in linear time [57], and the computation for different features could be easily parallelized, if needed. Computation for retrieving the initial set by text-based retrieval has slightly increased. It is known that the computation for text-based search in an inverted index is Oðlog nÞ, where n denotes the size of the index. Computation for querying with expanded queries is increased. To compute pairwise similarity between the weighted query and every item, the computation would be OðnÞ, where n denotes the total number of items. However,

computation could be improved by creating (in offline) an index, based on the observation that generally similar items are more likely to be similar in the weighted space. For each item, we store a ranked list of “similar items” based on cosine similarities between original feature vectors. In querying, we follow the index to compute the distances between the weighted query and top items in the index. Parallelization. Moreover, the most computationally expensive steps (feature extraction, K-S test, matching weighted query against every item) could be easily parallelized to improve the system performance. Keyword “quality.” In the evaluation, iLike demonstrates better performance for some keywords. From linguistic perspective, not all keywords are equally meaningful in the visual feature space. For instance, most of the words in the sentence: “Our pique polos have a special performance finish that helps them maintain color and shape, reducing shrinkage and pilling” mean nothing visually. By further looking into the weight vectors correspond to the keywords in the user evaluation, we have two rough observations. First, iLike performs better for keywords with higher frequency—the weight vector is more reliable when we have more positive samples. Meanwhile, iLike performs better if 1) the weight vector appears to be “clean”—a relatively small number of weight values are high, and the others are very low, and 2) high weight values are somewhat clustered, so that we can clearly observe some nonrandom visual meanings from the weight vectors. We have compared iLike with conventional text-based retrieval and CBIR. There are other image retrieval approaches that we would like to discuss: Automatic tagging. Textual tags could be automatically generated from visual features and exploited in text-based search. However, as a classification problem, automatic tagging requires well tagged training sets with limited tags (e.g., [22], [58], [21]), or very large scale training sets (e.g., [26]). In our application, it is expected that existing automatic tagging algorithms may not generate optimal

2268

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

results, as none of the above conditions is satisfied. Moreover, searching on text tags does not allow weighted terms (e.g., 0.8-pattern and 0.2-stripe), nor does it yield “similar terms.” Two-stage hybrid approaches use text-based search to generate an initial set, and then utilize visual features to rerank images in the initial set [2], [6], [7], [3], [4]. Such methods get additional input through relevance feedback, and analyze visual features from the feedback images. Among these approaches, iLike is most similar to IntentSearch [3], [4], which tries to capture user intentions from the feedback images. In [3], [4], user intention is detected to be one of predefined categories (e.g., general objects, scene, people, portrait, etc.). Therefore, the ranking algorithm will employ an adaptive similarity measure, which is more suitable for the intention category. In iLike, all users are intended for the same category (“apparel shopping”), we further infer fine-grained (feature level) user intentions from the text queries, and enforce such intention in ranking. Due to the essential differences in the definition and utilization of user intentions, it would be difficult to compare such approaches with iLike in a fair way. Feature space mapping approaches (e.g., [41]) map textual and visual feature spaces using transformations in feature spaces or distance measurements. In this way, such mapping becomes a bridge across the semantic gap. In iLike, we have observed that the feature space mapping and similarity measurements are not static, instead, they are query-specific (i.e., user-intension-specific). For instance, given a query image as in Fig. 9a, when the query is “printed,” images in Fig. 9b are more relevant. Meanwhile, when the query is “white,” images in Fig. 9c become more relevant. That is, feature space transformation is dynamic, so that different transformations need to be enforced for different queries (i.e., user intentions). By exploiting implicit user intensions behind textual queries, iLike is capable to handle such dynamics. Finally, other methods could be employed in place of K-S test to derive semantic meanings in low-level visual features. Other supervised or unsupervised learning techniques could be exploited, such as: neural networks [59], support vector machines [60], Bayesian [61], [62], [63], bootstrapping [64], and so on. However, traditional multiclass classification methods suffer from two distinct characteristics in image annotation: 1) overloading with high number of categories (keywords), and 2) overlap between classes. Moreover, labels extracted from item descriptions in our data set are not as clean as manual tagging. If we consider the training data for all the classes collectively, a multiclass classifier or a feature selection method based on it would not easily find decision boundaries or relevant features. On the other hand, in iLike, K-S based feature weighting is a linear transformation of the visual feature space w.r.t. each term in the dictionary. It provides an intuitive mapping from high-level concepts to low-level features with very low overhead. It is also possible to adapt state-of-the-art machine learning techniques in the weighting scheme. Overfitting (caused by imperfect textual features), data skewness, and computational complexity will be the major issues to be considered.

6

CONCLUSION AND DISCUSSIONS

In this paper, we present iLike, a vertical search engine for apparel shopping. We aim to integrate textual and visual features for better search performance. We have represented text terms in the visual feature space, and developed a text-guided weighting scheme for visual features. Such weighting scheme infers user intention from query terms, and enhances the visual features that are significant toward such intention. Experimental results show that iLike is effective and capable of bridging the semantic gap. Through the comprehensive user study, iLike has demonstrated outstanding performance for a large number of descriptive terms. In some cases, it does not work well for some keywords (mostly nonadjectives). Many of such words have abstract meaning and are unlikely to be included in queries (e.g., zip, logo). To sum up, by combining textual and visual features, iLike manages to pick “good” features that reflect users’ perception, and therefore is effective for vertical search.

ACKNOWLEDGMENTS
This paper is significantly extended from a previous conference report [1]. This work was supported in part by the University of Kansas General Research Fund (GRF 2301677), and in part by US National Science Foundation Grant OIA-1028098.

REFERENCES
Y. Chen, N. Yu, B. Luo, and X.-w. Chen, “iLike: Integrating Visual and Textual Features for Vertical Search,” Proc. ACM Int’l Conf. Multimedia, 2010. [2] B. Luo, X. Wang, and X. Tang, “A World Wide Web Based Image Search Engine Using Text and Image Content Features,” Proc. IS&T/SPIE, vol. 5018, pp. 123-130, 2003. [3] J. Cui, F. Wen, and X. Tang, “Real Time Google and Live Image Search Re-Ranking,” Proc. 16th ACM Int’l Conf. Multimedia, 2008. [4] J. Cui, F. Wen, and X. Tang, “Intentsearch: Interactive On-Line Image Search Re-Ranking,” Proc. 16th ACM Int’l Conf. Multimedia, 2008. [5] X. Tang, K. Liu, J. Cui, F. Wen, and X. Wang, “Intentsearch: Capturing User Intention for One-Click Internet Image Search,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 34, no. 7, pp. 1342-1353, July 2012. [6] F. Jing, C. Wang, Y. Yao, K. Deng, L. Zhang, and W.-Y. Ma, “IGroup: Web Image Search Results Clustering,” Proc. 14th ACM Int’l Conf. Multimedia, 2006. [7] S. Wang, F. Jing, J. He, Q. Du, and L. Zhang, “IGroup: Presenting Web Image Search Results in Semantic Clusters,” Proc. SIGCHI Conf. Human Factors in Computing Systems, 2007. [8] X. Wang, K. Liu, and X. Tang, “Query-Specific Visual Semantic Spaces for Web Image Re-Ranking,” Proc. IEEE Conf. Computer Vision Pattern Recognition (CVPR), June 2011. [9] A.W.M. Smeulders, S. Member, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000. [10] M.S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-Based Multimedia Information Retrieval: State of the Art and Challenges,” ACM Trans. Multimedia Computing, Comm., and Applications, vol. 2, no. 1, pp. 1-19, 2006. [11] R. Datta, D. Joshi, J. Li, James, and Z. Wang, “Image Retrieval: Ideas, Influences, and Trends of the New Age,” ACM Computing Surveys, vol. 39, article 5, 2006. [12] M. Stricker and M. Orengo, “Similarity of Color Images,” Proc. SPIE, vol. 2420, pp. 381-392, 1995. [1]

CHEN ET AL.: ILIKE: BRIDGING THE SEMANTIC GAP IN VERTICAL IMAGE SEARCH BY INTEGRATING TEXT AND VISUAL FEATURES

2269

[13] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features for Image Classification,” IEEE Trans. Systems Man and Cybernetics, vol. SMC-3, no. 6, pp. 610-621, Nov. 1973. [14] H. Tamura, S. Mori, and T. Yamawaki, “Textural Features Corresponding to Visual Perception,” IEEE Trans. Systems Man and Cybernetics, vol. SMC-8, no. 6, pp. 460-473, June 1978. [15] S.A. Dudani, K.J. Breeding, and R.B. McGhee, “Aircraft Identification by Moment Invariants,” IEEE Trans. Computers, vol. C-26, no. 1, pp. 39-46, Jan. 1977. [16] A. Vijay and M. Bhattacharya, “Content-Based Medical Image Retrieval Using the Generic Fourier Descriptor with Brightness,” Proc. Int’l Conf. Machine Vision, 2009. [17] W.-Y. Ma and H.-J. Zhang, “Content-Based Image Indexing and Retrieval,” Handbook of Multimedia Computing, CRC Press, 1998. [18] B. Manjunath, J.-R. Ohm, V. Vasudevan, and A. Yamada, “Color and Texture Descriptors,” IEEE Trans. Circuits and Systems for Video Technology, vol. 11, no. 6, pp. 703-715, June 2001. [19] S. Raimondo, S. Simone, C. Claudio, and C. Gianluigi, “Prosemantic Features for Content-Based Image Retrieval,” Proc. Seventh Int’l Workshop Adaptive Multimedia Retrieval, 2009. [20] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models,” Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, 2003. [21] G. Carneiro, A.B. Chan, P.J. Moreno, and N. Vasconcelos, “Supervised Learning of Semantic Classes for Image Annotation and Retrieval,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 29, no. 3, pp. 394-410, Mar. 2007. [22] J. Li and J.Z. Wang, “Real-Time Computerized Annotation of Pictures,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 30, no. 6, pp. 985-1002, June 2008. [23] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D.M. Blei, and M.I. Jordan, “Matching Words and Pictures,” J. Machine Learning Research, vol. 3, pp. 1107-1135, Mar. 2003. [24] X.-J. Wang, L. Zhang, F. Jing, and W.-Y. Ma, “Annosearch: Image Auto-Annotation by Search,” Proc. IEEE CS Conf. Computer Vision Pattern Recognition (CVPR), 2006. [25] L.S. Kennedy, S.-F. Chang, and I.V. Kozintsev, “To Search or to Label?: Predicting the Performance of Search-Based Automatic Image Classifiers,” Proc. ACM Int’l Workshop Multimedia Information Retrieval (MIR), 2006. [26] X. Li, L. Chen, L. Zhang, F. Lin, and W.-Y. Ma, “Image Annotation by Large-Scale Content-Based Image Retrieval,” Proc. 14th Ann. ACM Int’l Conf. Multimedia, 2006. [27] Z.-H. Zhou and H.-B. Dai, “Exploiting Image Contents in Web Search,” Proc. 20th Int’l Joint Conf. Artificial Intelligence (IJCAI), 2007. [28] H. Lieberman, E. Rozenweig, and P. Singh, “Aria: An Agent for Annotating and Retrieving Images,” Computer, vol. 34, no. 7, pp. 57-62, July 2001. [29] L. von Ahn and L. Dabbish, “Labeling Images with a Computer Game,” Proc. ACM SIGCHI Conf. Human Factors in Computing Systems (CHI), 2004. [30] L. Wu, L. Yang, N. Yu, and X.-S. Hua, “Learning to Tag,” Proc. 18th Int’l Conf. World Wide Web (WWW), Apr. 2009. [31] N. Sawant, R. Datta, J. Li, and J.Z. Wang, “Quest for Relevant Tags Using Local Interaction Networks and Visual Content,” Proc. Int’l Conf. Multimedia Information Retrieval (MIR), 2010. [32] Y.A. Aslandogan, C. Thier, C.T. Yu, J. Zou, and N. Rishe, “Using Semantic Contents and Wordnet in Image Retrieval,” Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, 1997. [33] H.T. Shen, B.C. Ooi, and K.-L. Tan, “Giving Meanings to WWW Images,” Proc. ACM Eighth Int’l Conf. Multimedia (Multimedia ’00), 2000. [34] R. Lempel and A. Soffer, “PicASHOW: Pictorial Authority Search by Hyperlinks on the Web,” Proc. Int’l Conf. World Wide Web (WWW), 2001. [35] D. Cai, X. He, Z. Li, W.-Y. Ma, and J.-R. Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information,” Proc. 12th ACM Int’l Conf. Multimedia, 2004. [36] I. Kompatsiaris, E. Triantafyllou, and M. Strintzis, “A World Wide Web Region-Based Image Search Engine,” Proc. 11th Int’l Conf. Image Analysis and Processing (ICIAP), 2001. [37] C. Frankel, M.J. Swain, and V. Athitsos, “Webseer: An Image Search Engine for the World Wide Web,” technical report, 1996.

[38] S. Mukherjea, K. Hirata, and Y. Hara, “Amore: A World Wide Web Image Retrieval Engine,” J. World Wide Web, vol. 2, no. 3, pp. 115-132, 1999. [39] S. Sclaroff, L. Taycher, and M.L. Cascia, “ImageRover: A ContentBased Image Browser for the World Wide Web,” Proc. IEEE Workshop Content-Based Access of Image and Video Libraries (CAIVL), 1997. [40] Z. Chen, L. Wenyin, C. Hu, M. Li, and H.-J. Zhang, “iFind: A Web Image Search Engine,” Proc. 24th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, 2001. [41] C. Wang, L. Zhang, and H.-J. Zhang, “Learning to Reduce the Semantic Gap in Web Image Retrieval and Annotation,” Proc. 31st Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, 2008. [42] L. Zhang, L. Chen, F. Jing, K. Deng, and W.-Y. Ma, “Enjoyphoto: A Vertical Image Search Engine for Enjoying High-Quality Photos,” Proc. ACM Int’l Conf. Multimedia, 2006. [43] L. Zhang, Y. Hu, M. Li, W. Ma, and H. Zhang, “Efficient Propagation for Face Annotation in Family Albums,” Proc. 12th ACM Ann. Int’l Conf. Multimedia (Multimedia), 2004. [44] J. Cui, F. Wen, R. Xiao, Y. Tian, and X. Tang, “Easyalbum: An Interactive Photo Annotation System Based on Face Clustering and Re-Ranking,” Proc. ACM SIGCHI Conf. Human Factors in Computing Systems (CHI), 2007. [45] Z. Wang, Z. Chi, and D. Feng, “Fuzzy Integral for Leaf Image Retrieval,” Proc. IEEE Int’l Conf. Fuzzy Systems (FUZZ), 2002. [46] J.-X. Dua, X.-F. Wang, and G.-J. Zhang, “Leaf Shape Based Plant Species Recognition,” Applied Math. and Computation, vol. 185, pp. 883-893, 2007. [47] K.-P. Yee, K. Swearingen, K. Li, and M. Hearst, “Faceted Metadata for Image Search and Browsing,” Proc. ACM SIGCHI Conf. Human Factors in Computing Systems (CHI), 2003. [48] P. Kovesi, “Image Features from Phase Congruency,” J. Computer Vision Research, vol. 1, no. 3, 1999. [49] M.R. Teague, “Image Analysis via the General Theory of Moments*,” J. Optical Soc. Am., vol. 70, no. 8, pp. 920-930, Aug. 1980. [50] T. Deselaers, D. Keysers, and H. Ney, “Features for Image Retrieval - A Quantitative Comparison,” Proc. DAGM Symp. Pattern Recognition, 2004. [51] P. Kakumanu, S. Makrogiannis, and N. Bourbakis, “A Survey of Skin-Color Modeling and Detection Methods,” Pattern Recognition, vol. 40, no. 3, pp. 1106-1122, 2007. [52] W.J. Conover, Practical Nonparametric Statistics. John Wiley & Sons, Dec. 1998. [53] T.-W. Chang, Y.-P. Huang, and F. Sandnes, “Efficient EntropyBased Features Selection for Image Retrieval,” Proc. IEEE Int’l Conf. Systems, Man and Cybernetics (SMC), pp. 2941-2946, Oct. 2009. [54] A. Sohail, P. Bhattacharya, S. Mudur, and S. Krishnamurthy, “Selection of Optimal Texture Descriptors for Retrieving Ultrasound Medical Images,” Proc. IEEE Int’l Symp. Biomedical Imaging (ISBI), 2011. [55] M. Dash and H. Liu, “Handling Large Unsupervised Data via Dimensionality Reduction,” Proc. ACM SIGMOD Workshop Research Issues in Data Mining (DMKD), 1999. [56] S. Pal and B. Chakraborty, “Intraclass and Interclass Ambiguities (Fuzziness) in Feature Evaluation,” Pattern Recognition Letters, vol. 2, no. 5, pp. 275-279, 1984. [57] T. Gonzalez, S. Sahni, and W.R. Franta, “An Efficient Algorithm for the Kolmogorov-Smirnov and Lilliefors Tests,” ACM Trans. Math. Software, vol. 3, no. 1, pp. 60-64, 1977. [58] I.K. Sethi and I.L. Coman, “Mining Association Rules Between Low-Level Image Features and High-Level Concepts,” Proc. SPIE, vol. 4384, pp. 279-290, 2001. [59] C. Town and D. Sinclair, “Content Based Image Retrieval Using Semantic Visual Categories,” technical report, 2001. [60] L. Zhang, F. Lin, and B. Zhang, “Support Vector Machine Learning for Image Retrieval,” Proc. IEEE Int’l Conf. Image Processing (ICIP), 2001. [61] A. Vailaya, A. Member, M.A.T. Figueiredo, A.K. Jain, H.-J. Zhang, and S. Member, “Image Classification for Content-Based Indexing,” IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117-130, Jan. 2001. [62] D. Cai, X. He, Z. Li, W.-Y. Ma, and J.-R. Wen, “Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information,” Proc. 12th ACM Int’l Conf. Multimedia, 2004.

2270

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 25,

NO. 10,

OCTOBER 2013

[63] J. Luo and A. Savakis, “Indoor vs Outdoor Classification of Consumer Photographs Using Low-Level and Semantic Features,” Proc. IEEE Int’l Conf. Image Processing (ICIP), vol. 2, pp. 745-748, Oct. 2001. [64] H. Feng, R. Shi, and T.-S. Chua, “A Bootstrapping Framework for Annotating and Retrieving WWW Images,” Proc. 12th ACM Int’l Conf. Multimedia, 2004. Yuxin Chen received the BE degree from the University of Science and Technology of China, in 2009, the MS degree from the University of Kansas, in 2011, and is currently working toward the PhD degree in the Learning and Adaptive Systems Group at ETH Zurich. His research interests include active learning, adaptive optimization, information privacy and security, and web image retrieval. He is a student member of the IEEE.

Bo Luo received the BE degree from the University of Sciences and Technology of China in 2001, the MPhil degree from the Chinese University of Hong Kong in 2003, and the PhD degree from The Pennsylvania State University in 2008. He is currently an assistant professor with Electrical Engineering and Computer Science Department at the University of Kansas. He is interested in information retrieval, information security, and privacy. He is a member of the IEEE Computer Society. Xue-wen Chen received the PhD degree from Carnegie Mellon University in 2001. He is currently the Department Chair and the professor of Computer Science Department at Wayne State University. Before joining Wayne State in 2012, he was a professor in the Electrical Engineering and Computer Science Department, University of Kansas. He is serving as the conference chair for ACM CIKM 2012. He is the recipient of the US National Science Foundation (NSF) CAREER Award. He has published more than 100 journal and conference papers. His research interest includes machine learning, data mining, bioinformatics, and systems biology. He is a senior member of the IEEE and the IEEE Computer Society and a member of the ACM.

Hariprasad Sampathkumar is working toward the PhD degree in the Electrical Engineering and Computer Science Department at The University of Kansas. His research interests include machine learning and image retrieval. He is a student member of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Similar Documents