Free Essay

Building a Text Corpus

In:

Submitted By vinothini
Words 3989
Pages 16
MEDINFO 2001 V. Patel et al. (Eds) Amsterdam: IOS Press © 2001 IMIA. All rights reserved

Building a Text Corpus for Representing the Variety of Medical Language
Pierre Zweigenbauma, Pierre Jacquemarta, Natalia Grabara, Benoît Habertb a DIAM — Service d’Informatique Médicale/DSI, Assistance Publique – Hôpitaux de Paris & Département de Biomathématiques, Université Paris 6 b LIMSI-CNRS & Université Paris 10

Abstract Medical language processing has focused until recently on a few types of textual documents. However, a much larger variety of document types are used in different settings. It has been showed that Natural Language Processing (NLP) tools can exhibit very different behavior on different types of texts. Without better informed knowledge about the differential performance of NLP tools on a variety of medical text types, it will be difficult to control the extension of their application to different medical documents. We endeavored to provide a basis for such informed assessment: the construction of a large corpus of medical text samples. We propose a framework for designing such a corpus: a set of descriptive dimensions and a standardized encoding of both meta-information (implementing these dimensions) and content. We present a proof of concept demonstration by encoding an initial corpus of text samples according to these principles. Keywords: Natural language processing, documents, French. text corpus, medical

apply to another type [2]1. This has consequences for the design and development, or simply for the use, of natural language processing tools for medical information processing. Without better informed knowledge about the differential performance of natural language processing tools on a variety of medical text types, it will be difficult to control the extension of their application to different medical documents. We propose here a basis for such informed assessment: the construction of a large corpus of medical text samples. We address this task for French, but we believe the same reasoning and methods and part of the results are applicable to other languages too. This text corpus must be useful for testing or training NLP tools. It must provide a variety of medical texts: diversity must be obtained in addition to mere volume, since our specific aim is to represent the many different facets of medical language. We need to characterize this diversity by describing it along appropriate dimensions: origin, genre, domain, etc. These dimensions have to be documented precisely for each text sample. This documentation must be encoded formally, as meta-information included with each document, so that sub-corpora can be extracted as needed to study relevant families of document types. Finally, text contents must also be encoded in a uniform way, independently of the many oritinal formats of documents We present here a framework for designing a medical text corpus: a set of descriptive dimensions, inspired in part from previous relevant literature, a standardized encoding of both meta-information (implementing these dimensions) and content, using the TEI XML Corpus Encoding Standard [10], and an initial set of text samples encoded according to these principles. This work takes place in the context of a larger corpus collection initiative, project CLEF 2, whose goal is to build a large corpus of French text samples and to distribute it widely to researchers.

Introduction
Medical language processing has focused until recently on a few types of textual documents. Medical narratives, including discharge summaries and imaging reports, have been the most studied ones [1,2,3,4]. Short problem descriptions, such as signs, symptoms or diseases, have been the subject of much attention too, in relation to standardized vocabularies [5]. Some authors have also examined abstracts of scientific literature [6]. And indeed, web pages are today the most easily available source of medical documents. All these constitute different kinds of documents. They vary both in form and in content; it has even been showed that within a single document, subparts can consistently display very different language styles [7]. The natural language processing (NLP) tools that have been tailored for one document type may therefore be difficult to

The precision of French taggers evaluated within the framework of GRACE [8], measured in relation to a manually tagged reference corpus, similarly shows significant variations depending on the part of the corpus under examination [9]. 2 www.biomath.jussieu.fr/CLEF/

1

Chapter 4: Knowledge Representation

Background
Nowadays, for “general” language, “mega-corpora” are available, such as the BNC (British National Corpus) [11]: 100 million words (about 1,000 medium-size novels), comprising 10 million words of transcribed spoken English as well as written language. This corpus provides a set of textual data whose production and reception conditions are precisely defined and which is representative of a great variety of communication situations. The available medical corpora we are aware of are collections of abstracts of scientific literature, e.g., MEDIC, cited in [6]. Medical textbooks and scientific literature have been collected in project LECTICIEL [12] for French for Special Purposes learning. Users could add new texts to the database and compare them with the existing subcorpora. One medical corpus was specifically built for the purpose of linguistic study: MEDICOR [13]. Although its focus is on published texts (articles and books), with no clinical documents, it is an example of the kind of direction that we wish to take. The initial version of the corpus provides limited documentation about the features of each document (intended audience, genre and writer qualification), which is planned to be extended. Very large collections of medical texts indeed exist within hospital information systems, the DIOGENE system being among the earliest ones [14]. The issue here is that of privacy and therefore anonymization, to which we return below. Beyond bibliographic description, descriptive dimensions for characterizing text corpora have been proposed by Sinclair [15] and Biber [16] among others. A related strand of work is that around the standardization of metainformation for documenting web pages [17]; but this covers more limited information than that we shall need. In the medical informatics domain, the standardization efforts of bodies such as HL7 [18] and CEN [19] focus on clinical documents for information interchange: both their aim and coverage are different from ours. The development of standards for the encoding of textual documents has been the subject of past initiatives in many domains (electronic publishing, aeronautics, etc.), using the SGML formalism, and now its XML subset. The Text Encoding Initiative was a major international effort to design an encoding standard for scholarly texts in the humanities and social sciences, including linguistics and natural language processing. It produced document type definitions (DTDs) and a Corpus Encoding Standard (CES) [10]. The CES DTD is therefore the natural format for encoding a corpus that is targeted at NLP tools.

We then explain how to populate the corpus with texts, and illustrate the method on currently integrated documents. Studying and Representing Diversity A large palette of medical textual documents are in use in different contexts. Our aim here is to identify the main kinds of medical texts that can be found in computerized form, and to characterize each of them by specifying values for a fixed set of orthogonal dimensions. Informants in a specific domain such as medicine have intuitions about the major relevant registers for the domain, even if they do have difficulties in establishing clear-cut borderlines. [20] relies on folk names of genres (to give a talk / a paper / an address / a lecture / a speech) as an important source of insight inside communicative characteristics of a given community. It has been shown [21] that, while there is no well-established genre palette for Internet materials, it is possible, through interviewing users of Internet (students and teaching staff in computer science), to define genres that are both reasonably consistent with what users expect and conveniently computable using measures of stylistic variation. So the very first step consists in asking people from the domain the main communicative routines or speech act they identify. We started from a series of prototypic contexts, and listed the types of texts related to these starting points: medical doctor (in hospital or in town), medical student, patient (consumer); patient care, research; published and unpublished documents. It is now possible to restate more precisely what we mean by variety : a domain corpus should represent the main communicative acts of the domain. In our opinion, a corpus can only represent some limited subsets of the language, and not the whole of it. No corpus can contain every type of communicative language. In order to gather a corpus, one must explicitly choose the language use(s) (s)he wants to focus on. The resulting variety is twofold: external and internal. External variety refers to the whole range of parameter settings involved in the creation of a document: document producer(s), document user(s), context of production or usage, mode of publication, etc. Internal variety: a communicative routine is often associated with consistent stylistic choices, that is, observable restrictions in the choice of linguistic items: lexical items, syntactic constructions, textual organization, such as the standard four-part organization of experimental studies: Introduction, Methods, Results, Discussion3. Besides a given cluster of linguistic features can be shared between different communicative routines (for instance between discharge summaries and imaging reports). We listed this way 57 different genres of medical texts. They include various reports (e.g., discharge, radiology), letters (e.g., discharge, referral), teaching material (e.g., lecture notes), publications (e.g., journals, books, articles), reference material (e.g., encyclopedia, classifications, directories), guidelines (e.g., recommendations, protocols), and official documents (e.g., French Bulletin Officiel, code of deontology). These document types are difficult to
3

Material and Methods
We explain in turn each of the main phases of the design of our corpus: (i) assessing document diversity and choosing dimensions to describe this diversity, i.e., a kind of multiaxial terminology for describing textual documents, and (ii) implementing them in a standard XML DTD; then (iii) selecting the main classes of documents we want to represent and documenting them with these dimensions.

Each of these parts was shown to have distinct linguistic features [7].

Chapter 4: Knowledge Representation

classify into non-overlapping groups. Therefore modelling the corpus with descriptive dimensions is all the more useful. To produce this set of dimensions, we first studied how the dimensions proposed in the literature covered differences in text types, and added to them as needed. Within the TEI standardization group, much attention has been devoted to the definition of headers [22]. A header is a normalized way of documenting electronic texts. It describes the electronic text and its source (bibliographic information, when available), it gives the encoding choices for the text (editorial rationales, sampling policy...), nonbibliographical information that characterize the text, and a history of updates and changes. In the non-bibliographical part of the header, the text is described according to one or more standard classification schemes, which can mix both free indexes and controlled ones (such as standard subject thesauri in the relevant field). It is then possible to extract sub-corpora following arbitrary complex constraints stated in these classification schemes. For instance, the interface to the BNC relies on such an approach [23] and permits to restrict queries to sub-corpora (spoken vs written language / publication date / domain / fiction vs non-fiction... and any combination of these dimensions). Implementing a Corpus Header We checked whether our corpus model, with all its dimensions, could fit in the standard TEI XML CES model [10]. In the CES model, a corpus consists of a corpus header followed by a collection of documents, each of which is a pair of document header and text (figure 1). The corpus header caters for documenting the corpus as a whole, whereas each document header contains metainformation for its text. We could find a mapping into the CES header for each dimention of our model, and therefore implemented it in the CES framework. An added advantage is that the CES model provides additional documentation dimensions, e.g., information about the corpus construction process (text conversion, normalization, annotation, etc.). Giving a Shape to the Corpus: Document Sampling Several parameters influence the overall contents of the corpus: we focus here on the types and sizes of documents that it will include. There is debate in the corpus linguistics community as to whether a corpus should consist of text extracts of constant size, as has been the case of many pioneering corpora, or of complete documents. The overall strategy of project CLEF is to opt for samples in the order of 2,000 words each. The expected benefits are a more manageable size and less trouble with property rights: it may be more acceptable for a publisher to give away extracts rather than full books or journals, so that text samples should be easier to obtain. The drawback is that textual phenomena with a larger span may not be studied on such samples. We thus plan to be flexible on sample size. To initiate the construction of our corpus, we selected an initial subset of text types as target population for the corpus. As explained above, we tried to represent the main communicative acts of the domain. The main text types we

aim to represent initially include types from all the groups of genres listed above: hospital reports, letters (discharge), teaching material (tutorials), publications (books chapters, journal articles, dissertations), guidelines (recommendations) and official documents (code of deontology). We cautiously avoided to over-represent web documents, which could bias corpus balance because of their immediate ease of obtention. An additional interesting family of genres would be transcribed speech; but the cost of transcription is too high for this to be feasible.

Figure 1: Overall corpus form: corpus header (: upper rectangle) then documents , each containing document header (: lower, inner rectangle) and actual A generic documentation for each text type was prepared. The rationale for implementation is then to encode a document header template for each text type: this template contains prototypical information for texts of this type. This factorizes documentation work, so that the remaining work needed to derive suitable document headers for individual texts is kept to a minimum. Document templates were implemented for the text types included so far in the corpus. Populating the Corpus with Document Instances The addition of documents to the corpus comprises several steps. The documents must first be obtained. This raises issues of property. A standard contract has been established for the project with the help of the European Language Resources Agency (ELRA), by which document providers agree with the distribution of the texts for research purposes. For texts that describe patient data, a second issue is that of privacy. We consulted the French National Council for Informatics and Liberties (CNIL). They accepted that such texts be included provided that all proper names (persons and locations) and dates be masked. The contents of each document are then converted from their original form (HTML, Word) to XML format. Minimal structural markup is added: that corresponding to the TEI CES level 1 DTD. This includes paragraphs (; this is marked automatically) and optionally sections. The document header template for the appropriate

Chapter 4: Knowledge Representation

document type is then instanciated. For series of similar samples (e.g., a series of discharge summaries), most of this instantiation can be performed automatically.

Results
The main results in the current state of the project are (i) a model of document description (the dimensions), (ii) an implementation of this model and (iii) the inclusion of a series of documents in this implementation (the current corpus). We settled on 30 dimensions, partly derived from [15], [7] and [17]. The two main groups of dimensions are “external”: bibliographic reference (e.g., title, author, date; size and localisation of sample) and context of production (e.g., institutional vs private, published or not, mode of production, of transmission, frequency of publication, source, destination). The dimensions of the last group are “internal”: level of language, distance from readership, personalization of the message, factuality, technicity, style. Allowed values are specified for each dimension. One of the dimensions is the domain of the text, here the medical specialties involved. We reused and slightly adapted the list of domains that help to index medical web sites on the CISMEF directory (www.cismef.org). The implemented model fits as an instance of the XML CES DTD (xcesDoc.dtd) (www.cs.vassar.edu/XCES/). Bibliographical dimensions are explicitly modelled in that DTD within each document header. For dimensions pertaining to the context of production and internal dimensions, “taxonomies” are defined in the corpus header: they consist of hierarchies of category descriptions. Each document in the corpus is characterized by a set of such categories: this is implemented by referring to these standard categories in the “profile description” section of that document’s header. Figure 2 shows a slice of the implemented corpus. As a proof of concept, we integrated 374 documents in the corpus: 294 patient discharge summaries from 4 different sites and 2 different medical specialties (cardiology, from project Menelas [4], and haematology), 78 discharge letters, one chapter of a handbook on coronary angiography and one “conference of consensus” on post-operative pain. The total adds to 143 kwords, with an average of 385 words per document. Many colleagues have kindly declared their intent to contribute documents, so that a few million words should be attainable. The corpus can be manipulated through standard XML tools. We ran the Xerces Java XML library of the Apache XML project and James Clark’s XT library under Linux, Solaris and HP-UX. The corpus was checked for syntactic well-formedness (“conformance”) and adherence to the xcesDoc DTD (“validity”). We use XSL stylesheets to produce tailored summaries of the corpus contents and to extract sub-corpora.

our corpus model in a principled way with a very reasonable effort. Besides, the general move towards XML observed in recent years facilitates the conversion of existing documents and the subsequent manipulation of the corpus. A few lines of XSL instructions suffice to design extraction methods which are then executed in seconds on the whole corpus. Adding new documents to the corpus and documenting them requires a varying amount of work depending on the type of document. Patient documents require the most attention because of anonymization. Their actual documentation also raises an issue: a precise documentation would re-introduce information on locations and dates, so that we must here sacrifice documentation for privacy. A pre-specified model for document description is a need if a corpus is to be used by many different people. The dimensions of our model, implemented as taxonomic “categories”, will probably need some update with the introduction of the other main types of documents. We expect however that they should quickly stabilize. The XCES DTD was designed to cope with multilingualism, including for non-western languages and scripts. It caters for language declarations at every level of granularity. This facilitates the extension of the corpus to multiple languages or the parallel development of corpora for different languages based on a common model.

Conclusion and Perspectives
We have proposed a framework for designing a medical text corpus and a proof of concept implementation: a set of descriptive dimensions, a standardized encoding of both meta-information (implementing these dimensions) and content, and a “small”-size corpus of text samples encoded according to these principles. This corpus, once sufficiently extended, will be useful for testing and training NLP tools: taggers, checkers, term extractors, robust parsers, encoders, information retrieval engines, information extraction suites, etc. We plan to distribute it to Medical Informatics and NLP researchers. We believe that the availability of such a resource may be an incentive to attract more generalist NLP researchers to work on medical texts. The corpus will also allow more methodological, differential studies on the medical lexicon, terminology, grammar, etc.: e.g., terminological variation across genres within the same medical specialty, or the correlation of observed variation with documented dimensions, which should teach us more about the features of medical language.

Acknowledgments
We wish to thank the French Ministry for Higher Education, Research and Technology for supporting project CLEF, D Bourigault and P Paroubek of project CLEF’s management board for useful discussions, B Séroussi and J Bouaud for help about the document genres, and the many colleagues who agreed to contribute documents to the corpus.

Discussion
Adherence to an existing standard enabled us to implement

Chapter 4: Knowledge Representation

References
[1] Sager N, Friedman C, and Lyman MS, eds. Medical Information Processing - Computer Management of Narrative Data. Addison Wesley, Reading Mass, 1987. [2] Friedman C. Towards a comprehensive medical natural language processing system: Methods and issues. J Am Med Inform Assoc 1997;4(suppl):595–9. [3] Rassinoux AM. Extraction et Représentation de la Connaissance tirée de Textes Médicaux. Thèse de doctorat ès sciences, Université de Genève, 1994. [4] Zweigenbaum P and Consortium MENELAS . MENELAS: an access system for medical records using natural language. Comput Methods Programs Biomed 1994;45:117–20. [5] Tuttle M, Olson N, Keck K, et al. Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Methods Inf Med November 1998;37(4-5):373–83. [6] Grefenstette G. Explorations in Automatic Thesaurus Discovery. Kluwer, London, 1994. [7] Biber D and Finegan E. Intra-textual variation within medical research articles. In: Ooostdijk N and de Haan P, eds, Corpus-based research into language, number 12. Rodopi, Amsterdam, 1994:201–22. [8] Adda G, Mariani J, Paroubek P, Rajman M, and Lecomte J. Métrique et premiers résultats de l’évaluation GRACE des étiqueteurs morphosyntaxiques pour le français. In: Amsili P, ed, Actes de TALN 1999, Cargèse. July 1999:15–24. [9] Illouz G. Méta-étiqueteur adaptatif : vers une utilisation pragmatique des ressources linguistiques. In: Amsili P, ed, Actes de TALN 1999, Cargèse. July 1999:185–94. [10]Ide N, Priest-Dorman G, and Véronis J. Corpus encoding standard. Document CES 1, MULTEXT/EAGLES, http://www.lpl.univaix.fr/projects/eagles/TR/, 1996. [11]The British National Corpus. http://info.ox.ac.uk/bnc/, Oxford University Computing Services, 1995. [12]Lehmann D, de Margerie C, and Pelfrêne A. Lecticiel – rétrospective 1992–1995. Technical report, CREDIF –

ENS de Fontenay/Saint-Cloud, Saint-Cloud, 1995. [13]Vihla M. Medicor: A corpus of contemporary American medical texts. ICAME Journal 1998:73–80. [14]Scherrer JR, Lovis C, and Borst F. DIOGENE 2, a distributed information system with an emphasis on its medical information content. In: van Bemmel JH and McCray AT, eds, Yearbook of Medical Informatics 95. Schattauer, Stuttgart, 1996. [15]Sinclair J. Preliminary recommendations on text typology. Technical report, EAGLES (Expert Advisory Group on Language Engineering Standards), june 1996. [16]Biber D. Representativeness in corpus design. Linguistica Computazionale 1994;IX-X:377–408. Current Issues in Computational Linguistics: in honor of Don Walker. [17]The Dublin Core element set version 1.1. WWW page http://purl.org/dc/documents/, Dublin Core Metadata Inititative, 1999. [18]Dolin R, Alschuler L, Boyer S, and Beebe C. An update on HL7’s XML-based document representation standards. In: Proc AMIA Symp, 2000:190–4. [19]Rossi Mori A and Consorti F. Structures of clinical information in patient records. In: Proc AMIA Symp, 1999:132–6. [20]Wierzbicka A. A semantic metalanguage for a crosscultural comparison of speech acts and speech genres. Language in society 1985(14):491–514. [21]Dewe J, Karlgren J, and Bretan I. Assembling a balanced corpus from the internet. In: 11th Nordic Conference on Computational Linguistics, Copenhagen. 1998:100–7. [22]Giordano R. The TEI header and the documentation of electronic texts. Comput Humanities 1995(29):75–85. [23]Dunlop D. Practical considerations in the use of TEI headers in large corpora. Comput Humanities 1995(29):85–98. Address for correspondence
Pierre Zweigenbaum DIAM — SIM/DSI/AP-HP 91, boulevard de l’Hôpital, 75634 Paris Cedex 13, France pz@biomath.jussieu.fr, http://www.biomath.jussieu.fr/ pz/

Figure 2: A slice of the implemented corpus: the first lines of document 4 (viewed with Xerces TreeViewer).

Similar Documents

Premium Essay

Codification of Nigerian English

...Nigerian English variety comparable to the British or American Standard English exists. Codification is one such step but prior to it must come a compilation of an extensive database of English language use in Nigeria and the application of empirical methods in examining and determining the character of English in the Nigerian context so that the continuum of forms of the language can be properly ascertained, classified and documented. With such reliable evidence based on valid findings arising from empirical investigations, we can then hope for realistic descriptions of English in Nigeria which qualifies for codification for general use as a representational variety of English in Nigeria. Key words: Codification, Nigerian English, Corpus and...

Words: 6571 - Pages: 27

Free Essay

Political Discourse Between American and British Corpus

...speeches of four US presidents and party political manifestos of two British political parties during the period between 1974 and 1997 are analysed. The main purpose of undertaking this kind of comparative study of the British and the American political discourses is quite evident, these discourses symbolize intriguing and complex methods of cultural values and political differences as depicted in the respective linguistic contexts. The key findings are that metaphors from the domains of conflict, journey and buildings are general across the divide. However, the British corpus contain metaphors that draw on the source domain of plants whereas the American corpus hugely draws on source domains like fire and light and the physical environments that are excluded in the context of the British corpus. Therefore, the variations offer quite a significant dissimilarity in the use of metaphors among the two set of political discourse and cultural differences. Keywords: Metaphor, Corpus, discourse, manifesto, and politics. Introduction Political and socio-cultural dimensions have been applied extensively in all kinds of linguistic studies on the...

Words: 6092 - Pages: 25

Premium Essay

Mamgt

...Lexical cohesion and the organization of discourse First year report PhD student: Ildikó Berzlánovich Supervisors: Prof. Dr. Gisela Redeker Dr. Markus Egg Center for Language and Cognition Groningen University of Groningen 2008 Table of contents 1 Introduction.........................................................................................................1 2 Lexical cohesion...................................................................................................2 2.1 Lexical cohesion and discourse organization................................................2 2.1.1 Introduction.............................................................................................2 2.1.2 Lexical cohesion and genre.....................................................................2 2.1.3 Lexical cohesion and coherence .............................................................3 2.2 The role of lexical cohesion in the segmentation and centrality of discourse units......................................................................................................................5 2.2.1 Introduction.............................................................................................5 2.2.2 Discourse segmentation ..........................................................................6 2.2.3 Central discourse units............................................................................8 2.2.4 Conclusion .........................................

Words: 14120 - Pages: 57

Premium Essay

Asdfgnhdgh

...An Approach to Corpus-based Discourse Analysis: The Move Analysis as Example THOMAS A . UPTON AND MARY ANN COHEN Abstract This article presents a seven-step corpus-based approach to discourse analysis that starts with a detailed analysis of each individual text in a corpus that can then be generalized across all texts of a corpus, providing a description of typical patterns of discourse organization that hold for the entire corpus. This approach is applied specifically to a methodology that is used to analyze texts in terms of the functional/communicative structures that typically make up texts in a genre: move analysis. The resulting corpus-based approach for conducting a move analysis significantly enhances the value of this often used (and misused) methodology, while at the same time providing badly needed guidelines for a methodology that lacks them. A corpus of ‘birthmother letters’ is used to illustrate the approach. Biber et al. (2007) explore how discourse structure and organization can be investigated using corpus analysis; they offer a structured, seven-step corpusbased approach to discourse analysis that results in generalizable descriptions of discourse structure. This article draws on the themes in this book, but focuses in particular on analyses that use theories on communicative or functional purposes of text as the starting point for understanding why texts in a corpus are structured the way they are, before moving to a closer examination and description of...

Words: 8985 - Pages: 36

Premium Essay

Polysemy in Translation

...Investigating the Complementary Polysemy of the Noun ‘Destruction' in an English to Arabic Parallel Corpus Hammouda Salhi University of Carthage, Tunisia hammouda_s@hotmail.com Abstract: This article investigates a topic at the interface between translation studies, lexical semantics and corpus linguistics. Its general aim is to show how translation studies could profit from the work done in both lexical semantics and corpus linguistics in an attempt to help ‘endear’ linguists to translators (Malmkjær, 1998). The specific objective is to capture the semantic and pragmatic behavior of the noun ‘destruction’ from its different translations into Arabic. The data are taken from an English-Arabic parallel corpus collected from UN texts and their translations (hereafter EAPCOUNT). While it seems that ‘destruction’ is monosemous, it turns out, after exploring its occurrences, to be highly polysemous and shows a case of complementary polysemy, where a number of alternations can be captured. These findings are broadly in line with the results reached in recent developments in lexical semantics, and more particularly the Generative Lexicon (GL) theory developed by James Pustejovsky. Some concrete suggestions are made at the end on how to enhance the relation between linguists and translators and their mutual cooperation. Key words: Lexical semantics, corpus linguistics, translation studies, complementary polysemy, coercion, parallel corpora, lexical ambiguities ...

Words: 8055 - Pages: 33

Premium Essay

Investigating the Presentation of Speech

...Investigating the presentation of speech, writing and thought in spoken British English: A corpus-based approach1 Dan McIntyre a, Carol Bellard-Thomson b, John Heywood c, Tony McEnery c, Elena Semino c and Mick Short c a Liverpool Hope University College, UK, b University of Kent at Canterbury, UK, c Lancaster University, UK Abstract In this paper we describe the Lancaster Speech, Writing and Thought Presentation (SW&TP2) Spoken Corpus. We have constructed this corpus to investigate the ways in which speakers present speech, thought and writing in contemporary spoken British English, with the associated aim of comparing our findings with the patterns revealed by the previous Lancaster corpus-based investigation of SW&TP in written texts. We describe the structure of the corpus and the archives from which its composite texts are taken. These are the spoken section of the British National Corpus, and archives currently housed in the Centre for North West Regional Studies (CNWRS) at Lancaster University. We discuss the decisions that we made concerning the selection of suitable extracts from the archives, the re-transcription that was necessary in order to use the original CNWRS archive texts in our corpus, and the problems associated with the original archived transcripts. Having described the sources of our corpus, we move on to consider issues surrounding the mark-up of our data with TEI-conformant SGML, and the problems associated with capturing in electronic form the CNWRS...

Words: 10539 - Pages: 43

Premium Essay

Call

...The Call Triangle: student, teacher and institution Learning register variation. A web-based platform for developing diaphasic skills∗ Adriano Allora, Elisa Corino and Cristina Onesti Dipartimento di Scienze Letterarie e Filologiche - Università di Torino Via Sant’Ottavio 20, Torino 10124, Italy Abstract The present paper shows the first results of a linguistic project devoted to the construction of web learning tools for reinforcing sensitivity to diaphasic varieties and for learning style variation in L2/LS learning. Keywords: L2/LS learning; style variation; web based platform 1. Introduction This paper aims at presenting a project analysing formal varieties of European online languages, working on a suite of Net Mediated Communication (NMC) corpora and studying lexical, discourse and macro-syntactic phenomena. The practical implications of the studies aim at developing freely available resources devoted to L2/LS learning and the development of a web-based delivery platform. 2. The VALERE project The project (‘Varietà Alte di Lingue Europee in REte’: Formal Varieties in Newsgroups of European Languages: Structural Features, Interlinguistic Comparison and Teaching Applications) aims at investigating the wide range of formal language in some main European languages with particular attention to NMC. The research is based on the NUNC (i.e. Newsgroup UseNet Corpus1), a collection of corpora created from newsgroup messages implemented at the University...

Words: 1730 - Pages: 7

Free Essay

The Media

...The media Anne O’Keeffe Historical overview of media discourse ‘The media’ is a very broad term, encompassing print and broadcast genres, that is anything from newspaper to chat show and, latterly, much more besides, as new media emerge in line with technological leaps. The study of ‘the media’ comes under the remit of media studies from perspectives such as their production and consumption, as well as their aesthetic form. The academic area of media studies cuts across a number of disciplines including communication, sociology, political science, cultural studies, philosophy and rhetoric, to name but a handful. Meanwhile, the object of study, ‘the media’, is an ever-changing and ever-growing entity. The study of ‘the media’ also comes under the radar of applied linguistics because at the core of these media is language, communication and the making of meaning, which is obviously of great interest to linguists. As Fairclough (1995a: 2) points out, the substantively linguistic and discoursal nature of the power of the media is a strong argument for analysing the mass media linguistically. Central to the connection between media studies and studies of the language used in the media (media discourse studies) is the importance placed on ideology. A major force behind the study of ideology in the media is Stuart Hall (see, for example, Hall 1973, 1977, 1980, 1982). Hall (1982), in his influential paper, notes that the study of media (or ‘mass communication’) has had...

Words: 7914 - Pages: 32

Free Essay

Slexipedia: Word Formation and Social Use

...nature and purpose of his publication as "a cross between a dictionary (lexicon) and an encyclopaedia" (Crystal, 2004: vii). For each term in the glossary there is information one would look up in a dictionary, and the sort of knowledge one would expect to find in an encyclopaedia, such as an etymology of the entry and a hint of its sociolinguistic use. For example: newbie A newcomer to a chatgroup or virtual-world environment, especially one who has not yet learned the way to behave when participating in the dialogue. >>chatgroup; netiquette; virtual world (Crystal, 2004: 79) The coinage of the neologism Slexipedia compounds the acronym SL with lexipedia to provide a term to describe the Second Life-specific lexis in my corpus. In addition to providing a SL glossary according to Crystal's method (Appendix X), this chapter investigates the creative and innovative word-formation processes of SL English and Arabic vocabulary by its residents. Since use of vocabulary reflects identity (Crystal, 2001; Benwell and Stokoe, 2006; Boellstorff, 2008), the final concern of this chapter is the manner in which these SL terms are used in conversational interaction inworld, to reflect the social purposes and circumstances in which these words are utilized. Forming a coherent slexipedia will provide more insight for forming an account of SL identity, or Slidentity. It is argued by myself that communication in SL shares many attributes with internet chat, as they are both...

Words: 11436 - Pages: 46

Free Essay

Lexical Borrowing

...Lexical borrowing = slovní výpůjčky - adoption from another lg with the same meaning English is tolerant to other lgs, nenasytný vypůjčovatel (70% non-anglosaxon origin), welcomes foreign words, not homogenous lg like French (majority of expressions was taken from F.) reasons: lg feels a need for a new word; to pre-denote a special concept (Sputnik, gradually disappeared from lg; certain lg has a kind of prestigious position (matter of fashion, but overuse of English words; matter of political force); distinction of functional style (matter of development) – three synonymical expressions of diff. origin (anglo-saxon origin: home, French words (additional meanings): resindence, Latin words: domicile, Greek origin, etc.) layers of three origins : hunt/chase/pursue rise/mount/ascend ask/question (certain amount of intensity)/interrogate high tolerance in English; in French and in German – used to avoid it; in Czech – had to defend its position to German, Linguists tried to set certain rules for using words=re-establishion of Czech lg English changes pronunciation of borrowed words (E. is simply a germanic lg, but more Romans lg in vocabulary) the basic vocabulary=core vocabulary (be, have, do) is Anglo-Saxon, surrounding periphery of v. maybe borrowed (count a word each time that occurs) wave of new adoptions: swift adotion - in some periods in lg more words than usual are adopted, in the 13. century after the Norman conquest, natural mechanism!! self-regulated – if there...

Words: 7575 - Pages: 31

Free Essay

Hgchlg

...I am extremely grateful to him for providing me the necessary links and material to start the project and understand the concept of Twitter Analysis using R. In this project “Twitter Analysis using R” , I have performed the Sentiment Analysis and Text Mining techniques on “#Kejriwal “. This project is done in RStudio which uses the libraries of R programming languages. I am really grateful to the resourceful articles and websites of R-project which helped me in understanding the tool as well as the topic. Also, I would like to extend my sincere regards to the support team of Edureka for their constant and timely support. Table of Contents Introduction 4 Limitations 4 Tools and Packages used 5 Twitter Analysis: 6 Creating a Twitter Application 6 Working on RStudio- Building the corpus 8 Saving Tweets 11 Sentiment Function 12 Scoring tweets and adding column 13 Import the csv file 14 Visualizing the tweets 15 Analysis & Conclusion 16 Text Analysis 17 Final code for Twitter Analysis 19 Final code for Text Mining 20 References 21 Introductions Twitter is an amazing micro blogging tool and an extraordinary communication medium. In addition, twitter can also be an amazing open mine for text and social web analyses. Among the different softwares that can be used to analyze twitter, R offers a wide variety of...

Words: 2107 - Pages: 9

Premium Essay

An Application for Automated Evaluation of Student Essay

...Criterion SM Online Essay Evaluation: An Application for Automated Evaluation of Student Essays Jill Burstein Educational Testing Service Rosedale Road, 18E Princeton, NJ 08541 jburstein@ets.org Martin Chodorow Department of Psychology Hunter College 695 Park Avenue New York, NY 10021 martin.chodorow@hunter.cuny.edu Claudia Leacock Educational Testing Service Rosedale Road, 18E Princeton, NJ 08541 cleacock@ets.org Abstract This paper describes a deployed educational technology application: the CriterionSM Online Essay Evaluation Service, a web-based system that provides automated scoring and evaluation of student essays. Criterion has two complementary applications: E-rater®, an automated essay scoring system and Critique Writing Analysis Tools, a suite of programs that detect errors in grammar, usage, and mechanics, that identify discourse elements in the essay, and that recognize elements of undesirable style. These evaluation capabilities provide students with feedback that is specific to their writing in order to help them improve their writing skills. Both applications employ natural language processing and machine learning techniques. All of these capabilities outperform baseline algorithms, and some of the tools agree with human judges as often as two judges agree with each other. 2. Application Description Criterion contains two complementary applications that are based on natural language processing (NLP) methods. The scoring application, e-rater®, extracts...

Words: 5634 - Pages: 23

Free Essay

Hiljbaefl

...effective at reproducing human assessment, which requires weighing the complexity and thoroughness of a ... Machine Learning Algorithms for Problem Solving in ... - صفحة 136 https://books.google.com.sa/books?isbn... - ترجم هذه الصفحة Kulkarni, Siddhivinayak - 2012 - ‏معاينة - ‏المزيد من الإصدارات It consists of essays written by English language students who are studying English in their third or fourth year at university. The corpus currently has over 3 million words from students from 16 different native languages. The target for each ... Psychology of Learning and Motivation: Advances in ... - صفحة 58 https://books.google.com.sa/books?isbn... - ترجم هذه الصفحة Brian H. Ross - 2002 - ‏معاينة - ‏المزيد من الإصدارات The larger the number and variety of essay grades there were to mimic, the better the human graders agreed with each ... for machine-learning techniques to outperform humans, for example, because they can compare every essay to every ... Artificial Intelligence in Education: Building Technology ... - صفحة 545 https://books.google.com.sa/books?isbn... - ترجم هذه الصفحة Rosemary Luckin, ‏Kenneth R. Koedinger, ‏Jim E. Greer - 2007 - ‏معاينة - ‏المزيد من...

Words: 745 - Pages: 3

Premium Essay

Chapter 20--Income Taxation of Trusts and Estates

...in deciding whether to create a trust. True False 2. A trust might be used by one running for a political office. True False 3. Like a corporation, the fiduciary reports and pays its own Federal income tax liability. True False 4. An estate’s income beneficiary generally must wait until the entity is terminated by the executor to receive any distribution of income. True False 5. With respect to a trust, the terms creator, donor, and grantor are synonyms. True False 6. Corpus, principal, and assets of the trust are synonyms. True False 7. If provided for in the controlling agreement, a trust might terminate when the income beneficiary reaches age 35. True False 8. The decedent’s estate must terminate within four years of the date of death. True False 9. Trusts can select any fiscal Federal income tax year. True False 10. A complex trust pays tax on the income that it retains and adds to corpus. True False 11. A complex trust automatically is exempt from the Federal AMT. True False 12. The first step in computing an estate’s taxable income is the determination of its gross income for the year. True False 13. Generally, capital gains are allocated to fiduciary income, because they arise from current-year transactions as directed by the trustee. True False 14. A realized loss is recognized by a trust when it distributes a non-cash asset. True False 15. A decedent’s...

Words: 16214 - Pages: 65

Premium Essay

Translation Quality

...QUALITY ASSESSMENT Translation quality assessment has become one of the key issues in translation studies. This comprehensive and up-to-date treatment of translation evaluation makes explicit the grounds of judging the worth of a translation and emphasizes that translation is, at its core, a linguistic operation. Written by the author of the world’s best known model of translation quality assessment, Juliane House, this book provides an overview of relevant contemporary interdisciplinary research on translation, intercultural communication and globalization, and corpus and psycho- and neuro-linguistic studies. House acknowledges the importance of the socio-cultural and situational contexts in which texts are embedded, and which need to be analysed when they are transferred through space and time in acts of translation, at the same time highlighting the linguistic nature of translation. The text includes a newly revised and presented model of translation quality assessment which, like its predecessors, relies on detailed textual and culturally informed contextual analysis and comparison. The test cases also show that there are two steps in translation evaluation: firstly, analysis, description and explanation; secondly, judgements of value, socio-cultural relevance and appropriateness. The second is futile without the first: to judge is easy, to understand less so. Translation Quality Assessment is an invaluable resource for students and researchers of translation...

Words: 66245 - Pages: 265