Free Essay

# Data Anallysis

In:

Submitted By b4bhargav
Words 2004
Pages 9
Chapter 2

Getting Data into R

In the following chapter we address entering data into R and organising it as scalars (single values), vectors, matrices, data frames, or lists. We also demonstrate importing data from Excel, ascii files, databases, and other statistical programs.

2.1 First Steps in R 2.1.1 Typing in Small Datasets
We begin by working with an amount of data that is small enough to type into R. We use a dataset (unpublished data, Chris Elphick, University of Connecticut) containing seven body measurements taken from approximately 1100 saltmarsh sharp-tailed sparrows (Ammodramus caudacutus) (e.g., size of the head and wings, tarsus length, weight, etc.). For our purposes we use only four morphometric variables of eight birds (Table 2.1).
Table 2.1 Morphometric measurements of eight birds. The symbol NA stands for a missing value. The measured variables are the lengths of the wing (measured as the wing chord), leg (a standard measure of the tarsus), head (from the bill tip to the back of the skull), and weight. Wingcrd Tarsus Head Wt 59 55 53.5 55 52.5 57.5 53 55 22.3 19.7 20.8 20.3 20.8 21.5 20.6 21.5 31.2 30.4 30.6 30.3 30.3 30.8 32.5 NA 9.5 13.8 14.8 15.2 15.5 15.6 15.6 15.7

The simplest, albeit laborious, method of entering the data into R is to type it in as scalars (variables containing a single value). For the first five observations of wing length, we could type:
A.F. Zuur et al., A Beginner’s Guide to R, Use R, DOI 10.1007/978-0-387-93837-0_2, Ó Springer ScienceþBusiness Media, LLC 2009 29

30

2 Getting Data into R

> > > > >

a b c d e

Wing1 Wing2 Wing3 Wing4 Wing5 sqrt(Wing1) 2 * Wing1 Wing1 + Wing2 Wing1 + Wing2 + Wing3 + Wing4 + Wing5 (Wing1 + Wing2 + Wing3 + Wing4 + Wing5) / 5

Although R performs the calculations, it does not store the results. It is perhaps better to define new variables: > > > > > SQ.wing1 sum(Wingcrd) [1] 440.5 Obviously, we can also store the sum in a new variable > S.win S.win [1] 440.5 Again, the dot is part of the variable name. Now, enter the data for the other three variables from Table 2.1 into R. It is laborious, but typing the following code into an editor, then copying and pasting it into R does the job. > Tarsus Head Wt sum(Head) [1] NA You will get the same result with the mean, min, max, and many other functions. To understand why we get NA for the sum of the head values, type ?sum. The following is relevant text from the sum help file. ... sum(..., na.rm = FALSE) ... If na.rm is FALSE, an NA value in any of the arguments will cause a value of NA to be returned, otherwise NA values are ignored. ... Apparently, the default ‘‘na.rm = FALSE’’ option causes the R function sum to return an NA if there is a missing value in the vector (rm refers to remove). To avoid this, use ‘‘na.rm = TRUE’’ > sum(Head, na.rm = TRUE) [1] 216.1 Now, the sum of the seven values is returned. The same can be done for the mean, min, max, and median functions. On most computers, you can also use na.rm = T instead of na.rm = TRUE. However, because we have been confronted with classroom PCs running identical R versions on the same operating system, and a few computers give an error message with the na.rm = T option, we advise using na.rm = TRUE. You should always read the help file for any function before use to ensure that you know how it deals with missing values. Some functions use na.rm, some use na.action, and yet others use a different syntax. It is nearly impossible to memorise how all functions treat missing values. Summarising, we have entered data for four variables, and have applied simple functions such as mean, min, max, and so on.We now discuss methods of combining the data of these four variables: (1) the c, cbind, and rbind functions; (2) the matrix and vector functions; (3) data frames; and (4) lists. Do Exercise 1 in Section 2.4 in the use of the c and sum functions.

34

2 Getting Data into R

2.1.3 Combining Variables with the c, cbind, and rbind Functions
We have four columns of data, each containing observations of eight birds. The variables are labelled Wingcrd, Tarsus, Head, and Wt. The c function was used to concatenate the eight values. In the same way as the eight values were concatenated, so can we concatenate the variables containing the values using: > BirdData BirdData [1] 59.0 55.0 [10] 19.7 20.8 [19] 30.6 30.3 [28] 15.2 15.5

53.5 20.3 30.3 15.6

55.0 20.8 30.8 15.6

52.5 57.5 53.0 55.0 22.3 21.5 20.6 21.5 31.2 30.4 32.5 NA 9.5 13.8 14.8 15.7

BirdData is a single vector of length 32 (4 Â 8). The numbers [1], [10], [19], and [28] are the index numbers of the first element on a new line. On your computer they may be different due to a different screen size. There is no need to pay any attention to these numbers yet. R produces all 32 observations, including the missing value, as a single vector, because it does not distinguish values of the different variables (the first 8 observations are of the variable Wingcrd, the second 8 from Tarsus, etc.) . To counteract this we can make a vector of length 32, call it Id (for ‘‘identity’’), and give it the following values. > Id Id [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [24] 3 4 4 4 4 4 4 4 4 Because R can now put more digits on a line, as compared to in BirdData, only the indices [1] and [24] are produced. These indices are completely irrelevant for the moment. The variable Id can be used to indicate that all observations with a similar Id value belong to the same morphometric variable. However, creating such a vector is time consuming for larger datasets, and, fortunately, R has functions to simplify this process. What we need is a function that repeats the values 1 –4, each eight times:

2.1 First Steps in R

35

> Id Id [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [24] 3 4 4 4 4 4 4 4 4 This produces the same long vector of numbers as above. The rep designation stands for repeat. The command can be further simplified by using: > Id Id [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [24] 3 4 4 4 4 4 4 4 4 Again, we get the same result. To see what the 1 : 4 command does, type into R: > 1 : 4 It gives [1] 1 2 3 4 So the : operator does not indicate division (as is the case with some other packages). You can also use the seq function for this purpose. For example, the command > a a creates the same sequence from 1 to 4, [1] 1 2 3 4 So for the bird data, we could also use: > a rep(a, each = 8) [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [24] 3 4 4 4 4 4 4 4 4 Each of the digits in ‘‘a’’ is repeated eight times by the rep function. At this stage you may well be of the opinion that in considering so many different options we are making things needlessly complicated. However, some functions in R need the data as presented in Table 2.1 (e.g, the multivariate analysis function for principal component analysis or multidimensional scaling), whereas the organisation of data into a single long vector, with an extra variable to identify the groups of observations (Id in this case), is needed for other functions such as the t-test, one-way anova, linear regression, and also for some graphing tools such as the xyplot in the lattice package (see Chapter 8). Therefore, fluency with the rep function can save a lot of time.

36

2 Getting Data into R

So far, we have only concatenated numbers. But suppose we want to create a vector ‘‘Id’’ of length 32 that contains the word ‘‘Wingcrd’’ 8 times, the word ‘‘Tarsus’’ 8 times, and so on.We can create a new variable called VarNames, containing the four morphometric variable designations. Once we have created it, we use the rep function to create the requested vector: > VarNames VarNames [1] "Wingcrd" "Tarsus" "Head" "Wt" Note that these are names, not the variables with the data values. Finally, we need: > Id2 Id2 [1] "Wingcrd" "Wingcrd" "Wingcrd" [5] "Wingcrd" "Wingcrd" "Wingcrd" [9] "Tarsus" "Tarsus" "Tarsus" [13] "Tarsus" "Tarsus" "Tarsus" [17] "Head" "Head" "Head" [21] "Head" "Head" "Head" [25] "Wt" "Wt" "Wt" [29] "Wt" "Wt" "Wt"

Id2 is a string of characters with the names in the requested order. The difference between Id and Id2 is just a matter of labelling. Note that you should not forget the "each =" notation. To see what happens if it is omitted, try typing: > rep(VarNames, 8) [1] "Wingcrd" "Tarsus" [5] "Wingcrd" "Tarsus" [9] "Wingcrd" "Tarsus" [13] "Wingcrd" "Tarsus" [17] "Wingcrd" "Tarsus" [21] "Wingcrd" "Tarsus" [25] "Wingcrd" "Tarsus" [29] "Wingcrd" "Tarsus"

"Wt" "Wt" "Wt" "Wt" "Wt" "Wt" "Wt" "Wt"

It will produce a repetition of the entire vector VarNames with the four variable names listed eight times, not what we want in this case. The c function is a way of combining data or variables. Another option is the cbind function. It combines the variables in such a way that the output contains the original variables in columns. For example, the output of the cbind function below is stored in Z. If we type Z and press enter, it shows the values in columns:

2.1 First Steps in R

37

> Z Z Wingcrd Tarsus [1,] 59.0 22.3 [2,] 55.0 19.7 [3,] 53.5 20.8 [4,] 55.0 20.3 [5,] 52.5 20.8 [6,] 57.5 21.5 [7,] 53.0 20.6 [8,] 55.0 21.5

Tarsus, Head, Wt) Head 31.2 30.4 30.6 30.3 30.3 30.8 32.5 NA Wt 9.5 13.8 14.8 15.2 15.5 15.6 15.6 15.7

The data must be in this format if we are to apply, for example, principal component analysis. Suppose you want to access some elements of Z, for instance, the data in the first column. This is done with the command Z [, 1]: > Z[, 1] [1] 59.0 55.0 53.5 55.0 52.5 57.5 53.0 55.0 Alternatively, use > Z[1 : 8, 1] [1] 59.0 55.0 53.5 55.0 52.5 57.5 53.0 55.0 It gives the same result. The second row is given by Z [2,] : > Z[2, ] Wingcrd 55.0 Tarsus 19.7 Head 30.4 Wt 13.8

Alternatively, you can use: > Z[2, 1:4] Wingcrd Tarsus 55.0 19.7 Head 30.4 Wt 13.8

The following commands are all valid. > > > > > > > Z[1, 1] Z[, 2 : 3] X > > > > > > > > W > > > > > >

x1

### Similar Documents

Free Essay

#### Dangers of Strava-Fications

...front tire is quickly going flat, and without a spare tube on me, it’s going to be at least an hour walk back to the car. So, what’s the first thing I do? Do I assess my injuries? Get off the trail so the next rider doesn’t run me over? Nope. I pull out my smashed iPhone, cutting my finger on the broken screen to hit “pause” on Strava. No way will I have my time bloated by this idleness. How can I possibly break into the top 10 on the leader board for this segment that way? And once I make top 10—well, King of the Mountain (KoM) is just around the corner… My poor Moots Strava—for those of you who don’t know—is a social fitness application that allows bikers and runners to share, compare, and compete with each other’s personal fitness data. The application lets you track your rides and runs via your iPhone, Android, or GPS device to analyze and quantify your performance and match it against people inside and outside your social circle. I got hooked because Bruce (our director of PM) is hooked. And he’s hooked because Ryan, Tim, and Chris (our developer, QA manager, and lead engineer respectively) are hooked. My Strava Dashboard I’ve been riding the trails at Valley Green for years. Previously, I’ve always seen it as a 17-mile loop with a that you can ride clockwise or counterclockwise. I kept track of my time, focusing on how quickly I could complete the entire loop and how well I was handling the technical challenges. But with Strava, I now see the trails as a series...

Words: 1045 - Pages: 5

#### Project Manager

...Follow-on Activity Qualitative Risk Analysis Purpose: Use this follow-on activity to perform a data quality assessment with data of previous projects. Instructions for use: To use this tool, gather historical and lessons-learned risk data from past projects. Then review its accuracy and relevance by evaluating it against a set of listed criteria. Does the data fulfill the criteria or not? Provide a reason. Finally, conclude whether your findings give you confidence in the data. Document your answers in the tables provided. Evaluate how complete the data is |Completeness | |Question |Yes/No |Reasons                        |Conclusions                       | |Is the data complete? |Row 2 Column 2 [pic] |Row 2 Column 3 [pic] |Row 2 Column 4 [pic] | |Are charts graphics, and tables completely |Row 3 Column 2 [pic] |Row 3 Column 3 [pic] |Row 3 Column 4 [pic] | |filled in? | | | | For online use, complete each row as described in the instructions. If you would like to work with the page as hard copy, simply print it out using the Print link at the top of this page. Evaluate the data's clarity |Clarity...

Words: 530 - Pages: 3

Free Essay

#### Itm501

...Derrick Chapman Jr. ITM501- Module I Case November 11, 2013 In review of my position on information overload, there would be no such overload if avenues such as the various social media outlets, informative readings with little or no credibility, and networking forums with no proven success records were not so heavily relied upon within organizations. The course background readings shed light on how social media is hindering the notions of the Data, Information, Knowledge, and Wisdom. Data is defined as unprocessed information, while information is data that has had a chance to be processed, and finally knowledge and wisdom is something that can be reflected upon (Green, P. 2010). If you are constructively processing the data that you are receiving you will be come a learning organization, possessing the attributes of knowledge and wisdom. A learning organization will be taught through experience or simply stated trial and error. Learning will maximize innovation, effectiveness, and performance, and this knowledge should be spread throughout the organization creating a very reliable, proven, and stable structure. From a personal perspective if your organizations structure is designed to support and manage information there should be no overload. There are endless consequences to information overload, especially when the overload is at the hands of social media technologies. Most of the technologies were designed with the expectations...

Words: 899 - Pages: 4

Free Essay

#### Differential Manchester Encoding

...transition at the start of the bit if the data is a logic ‘0’ Note: Tanenbaum has a transition for a logic ‘1’ instead. 2. There is always a transition in the middle of the bit. 3. The direction of the transition is immaterial (hence there are two possible waveforms for any data stream depending upon the initial conditions). This gives us the following sample test data assuming pairs of logic levels for one actual bit: Data 1 1 0 0 1 0 1 1 Differential 01 10 10 10 01 01 10 01 Manchester (1) Differential 10 01 01 01 10 10 01 10 Manchester (2) After Halsall 2. Design Steps The output is toggling which suggests a flip flop. If Data = ‘0’ Output = Clock or inverse clock If Data = ‘1’ Output = 2 on –ve clock or inverse 2 on –ve clock Hence the output must be made up of two AND gates and an OR gate to select either * clock or inverse clock when Data = ‘0’ or * 2 on –ve clock or inverse 2 on –ve clock if Data = ‘1’ By De Morgan’s theorem (A.B)+(C.D) = (A.B).(C.D) So we can use three 2 input NAND gates instead of two 2 input AND gates and one 2 input OR gate. Finally we need to flesh out the additional circuitry required. 3. Test Data set The actual test data needs to be more along the following lines: Data 0 0 0 0 0 0 0 Output 10 10 10 10 10 10 10 01 01 01 01 01 01 01 Data 1 1 1 1 1 1 1 Output 01 10 01 10 01 10 01 10 01 10 01 10 01 10 Data 0 1 0 1 0 1 0 Output 10 01 01 10...

Words: 429 - Pages: 2

#### Research Map - Cert Perf

...global sense. My objective is to benchmark industries with respect to the data shared between business partners and business to business transactions. The Food industry is not known as the leader in Customer – Vendor data sharing, so my research will first seek to define the leading industry and then define what characteristics separate the leaders. The secondary research will result in the following outputs. • Journal article summaries, with citations, of relevant information • Book chapters or segments that establish an academic foundations for the B to B interactions including relevant history and future expectations • A repository of my findings to share with my cohort 1. Research the current world class state-of-the-art in customer service. a. Define the world class quality reporting (WCQR) and service currently available i. By Industry segment ii. Include Depth of disclosure, delivery timing, iii. Business to Business commitment to achieve WCQR iv. Define WCQR Customer satisfaction and service levels v. Find, interview and evaluate the best companies, as possible ← Phase 2: Primary Data Collection – September to October 2010 Primary research will include Farmland Foods stakeholders and key Customer’s chosen to participate. Research will be conducted various methods that will be defined and changed to fit the environment as the data collection progresses. 1. Conduct an environmental scan within Farmland...

Words: 1049 - Pages: 5

Free Essay

#### Student

...[pic] [pic] Data Loss and Misuse [pic] [pic] [pic] [pic] [pic] [pic] [pic] [pic] [pic] [pic] Question: The service provider shall provide Client Based Data Leakage Services necessary to provide services and support for Data Loss Protection (DLP) with the following activities: a) Deploy the Clinet endpoint agent (XEA) to all new client machines. b) Deploy the XEA to at least 95% of existing in-scope client machines within 90 days of its initial release. c) Deploy any patches or updates to the XEA out to 95% of existing XEA-equipped machines (both clients and servers) within 45 days of those patches or updates being released from testing with approval to deploy. d) Monitor, investigate and remediate instances where the XEA ceases to function on any machine (client or server) that is still connecting to the XGI. e) Monitor, initiate investigation, and escalate alerts generated by the DLP system indicating mishandling of Clinet classified data. f) Distribute reports and data extracts as required. g) Support Tier I and II help-desk end-users’ and server application support questions arising from the XEA. Can you meet this requirement? Please explain below. ORGANIZATION understanding of Requirements: Clinet is looking for Client Based Data Leakage Services necessary to provide services and support for Data Loss Protection (DLP)...

Words: 1129 - Pages: 5

#### Integrated Info Management

...management External data and information considerations consists of four external factors that are economic, sociological, political and technological. Economic factor consists of funding sources, contributors, consumers and competitors. Sociological factors include the local community where the agency functions. Political factors are all the regulatory and accrediting bodies including the agencies board of directors. The technological domain is about all the areas an agency needs to improve regarding technological advancements. All four domains must be kept in check and any questions that may come up need to be addressed so that the agency will have the necessary information when it is needed. Internal data and information considerations consists of organizational purpose, organizational planning, organizational operations, human resources, technological resources, and financial resources. The first three domains have to do with the vision of the agency, reviewing the short term, and long term plans for the agency and the everyday expectations of agency and what data will be needed for the purpose, planning and operations of the agency. Human resources domain is about what data or information is needed regarding employees of the agency. What data is to be tracked regarding employees licensing, certifications, trainings, health information. Technological resources domain is about the agency finding new technology to keep, track and modify data and information. This domain...

Words: 289 - Pages: 2

#### Data

Words: 857 - Pages: 4

Free Essay

#### School

...Students that struggle with reading in school is not a new problem. This has been a challenge for teachers for years and continues to be an issue in school systems nationwide. As stated in video program five, “While a child’s development may be delayed, the developmental pattern will remain the same.” (Bear, 2004 ). This really lets school officials know that these students are reachable, but the teachers need to provide appropriate instruction for the student’s developmental level. There are several things to be considered such as: grouping, type of instruction, spelling words, and vocabulary.      Teaching special education, it seems that my students are usually grouped in the teacher/child ratio. Within those small groups there are a variety of reading levels and adjustments that have to be made. We have reading groups everyday in my classroom. My students along with my teammates students are grouped according to ability. Even with that type of grouping remediation for some students is needed because of their rate of progress. It was stated in video program five that struggling readers need repetition, practice, and explicit instruction (Bear,2004). I try to provide this through different modalities. One strategy is a computer program called Intelli-talk.      This is a program similar to Cowriter or Write Out Loud. It allows information to be inputted into the system by the teacher and it will orally read the directions, any...

Words: 654 - Pages: 3

#### History of Ais

...different business functions, organizations had to develop complex interfaces for the systems to communicate with each other. In ERP, a system such as accounting information system is built as a module integrated into a suite of applications that can include manufacturing, supply chain, human resources. These modules are integrated together and are able to access the same data and execute complex business processes. With the ubiquity of ERP for businesses, the term “accounting information system” has become much less about pure accounting (financial or managerial) and more about tracking processes across all domains of business. Software architecture of a modern AIS A modern AIS typically follows a multitier architecture separating the presentation to the user, application processing and data management in distinct layers. The presentation layer manages how the information is displayed to and viewed by functional users of the system (through mobile devices, web browsers or client application). The entire system is backed by a centralized database that stores all of the data. This can include transactional data generated from the core business processes (purchasing, inventory, accounting) or...

Words: 2186 - Pages: 9

Free Essay

#### Uses of Statistical Data

...delivery of health care. This is particularly true as it relates to the cost of providing health care services (Eaton, 2006). At Mercy Medical Center, not unlike any other health care facility, the use of statistics is pervasive throughout the organization. First and foremost Mercy uses statistics to develop and maintain its financial imperatives (Minnis, 2008). Simply stated if actual cost of providing health care services exceeds the revenue generated the organization will have difficulty keeping its doors open. This paper will discuss examples of descriptive and inferential statistics in use at Mercy Medical Center. Also discussed will be how data at nominal, ordinal, interval, and ratio levels of measurement are used within the organization. Finally, the advantages of accurate interpretation of statistical data and improved decision making within the organization will be discussed. Descriptive Statistics An example of a descriptive statistic used at Mercy Medical Center is time spent by the Emergency Department on yellow alert status. Yellow alert is defined by the Maryland Institute for Emergency Medical Services Systems (2012) as ambulance diversion from a designated emergency department that is unable to effectively manage additional patient volume at that time. Several years ago it became an organizational imperative to minimize and indeed eliminate barriers for patients accessing health care services provided by Mercy Medical Center. Yellow alert...

Words: 917 - Pages: 4

#### Statistics in Psychology

...descriptive and inferential statistics, as well as, introduce some key terms that are frequently used. It will also describe the functions of statistics and describe how they are applied in the field of psychology. Having a better understanding of the various statistical functions and definitions, we will have a better opportunity at providing examples and prove that statistics is more than just colorful charts and graphs. Statistics is where a large amount of data is put together in a format that allows the viewer to understand it better. Whenever choosing an experiment that results in statistics, one would start with a hypothesis, or idea. This gives the entire process a purpose. The function of statistics appears for various reasons. When there is a large amount of data, it organizes it so that a viewer and/or a presenter can comprehend or present it easier. A way that it is organized is through charts and graphs, which shows the clarity. Another function is to show comparisons between two or more clumps of data. Statistics helps in forecasting trends and tendencies. Statistical techniques are used for predicting future values and variables. An example of this could be a producer forecasting for a future production. If he created a set of numbers based on his past experiences and compared to his present demand conditions, he could get a better idea for the future. Similarly, city, state, and federal planners can forecast future increases, or decreases, in population....

Words: 745 - Pages: 3

...London Churchill College | BTEC Higher National Diploma (HND) in Business | Business Decision Makingby Edina TosokiTutor: Rahaman Hasan | | LETTER OF TRANSMITTAL 29th of November 2013 Dear Mr. Rahaman Hasan, Enclosed is a formal report for your attention on the subject of Kellogg’s case analysis as per requested in September 2013 to analyze the market response to Kellogg’s products in the UK compared to the historical data of response in India focusing on the failed launch. This report includes introduction, literature, methodology, findings and analysis and finally a conclusion and recommendation section to make clear each step of the process. All data that we collected, organized and analyzed have been presented in charts and graphs for the better understanding, then a final presentation was produced to communicate the whole process through visualization. The workload that this formal report has been based on was assigned both to small groups as well joint class work, however this particular report mainly based on my individual input. During the whole preparation of this report I have tried to stay objective and record accurate information as to the best of my knowledge. Some sections of this report may reflect my own conclusions, suggestions and justifications relating to the subject. Thank you for your time reading, marking my report and giving the opportunity to learn and develop new skills by your guidance. Yours sincerely, Edina Tosoki ...

Words: 5712 - Pages: 23