[ad_1]
CAP 5768 Data Visualization with R Project
Preliminary instructions
All analyses must be performed in R using the tidyverse and glmnet packages discussed in class. Fill in all your solutions in the appropriate spaces provided in this Word document, and then upload a PDF copy of your solutions to Canvas. Only PDF copies will be graded.
Brief overview of assignment
In this assignment you will be using the dataset GlobalAncestry.csv, which is available on Canvas. You will be analyzing genetic data from 242 humans sampled across the world from six ancestries. The first column in each dataset, labeled ancestry, takes the following values:
African San and Yoruban individuals from sub-Saharan Africa
European Italian and Russian individuals from Europe
EastAsian Chinese and Japanese individuals from East Asia
Oceanian Melanesian and Papuan individuals from Oceania
NativeAmerican Pima and Mayan individuals from the Americas
Mexican Mexican individuals from the Americas
Unknown1 Unknown ancestry
Unknown2 Unknown ancestry
Unknown3 Unknown ancestry
Unknown4 Unknown ancestry
Unknown5 Unknown ancestry
The GlobalAncestry.csv is a large dataset with genetic data for individuals 242 at 8916 genomic locations. As we discussed in our introductory lecture for this course, each individual will have a value of 0, 1, or 2 at each of these genomic locations, indicating “genotype” that the individual has at this location.
Training a lasso penalized multinomial regression classifier
The goal is to train a multinomial regression classifier to predict K=5 ancestries (African, European, EastAsian, Oceanian, and NativeAmerican). The training dataset will consist only of individuals with African, European, EastAsian, Oceanian, and NativeAmerican ancestries, and the best classifier will be determined by lasso-penalized multinomial regression and 10-fold cross validation. You will consider 500 tuning parameter values (λ), taking values between 0.001 and 1000 evenly on a base-10 logarithmic scale, as we have highlighted several times in class. You will then choose the classifier that is the simplest classifier that is within 1 standard error of the best classifier. CAP 5768 Data Visualization with R Project
Predicting ancestry of individuals with unknown ancestry
You will then use this classifier to predict the ancestries of the five unknown individuals (Unknown1, Unknown2, Unknown3, Unknown4, and Unknown5) based on their genetics.
Predicting ancestry proportions of individuals with Mexican ancestry
You will also use predicted class probabilities to estimate the fraction of ancestry that each individual of Mexican descent has from each of the five continental ancestries used to train the classifier. You will then use violin plots to visualize the distributions of these probabilities across the set of individuals of Mexican ancestry, and hypothesize about the historical reasons for the ancestry distributions you observe.
Instructions for loading GlobalAncestry dataset into your RStudio Cloud environment
Recall that to upload a file to RStudio Cloud, you first must download the GlobalAncestry.csv file to your computer. Once the file is downloaded, within the “Files” panel of the RStudio Cloud environment, click “Upload” and browse to the appropriate directory on your computer to upload the GlobalAncestry.csv file.
The GlobalAncestry.csv file can be loaded using the read_csv() function of the readr package that comes loaded with tidyverse, and assigned to an object called GlobalAncestry as
GlobalAncestry <- read_csv(“GlobalAncestry.csv”)
If you are having trouble loading the file, then refer back to the video lecture on Linear Regression where this was demonstrated in class. CAP 5768 Data Visualization with R Project
Note about using glmnet for classification
When using glmnet, you will not need to recode classes as values 1, 2, 3, etc. We only performed this recoding in class to illustrate the connection with using linear regression applied to a response with values 0 and 1, as linear regression requires a quantitative response. Therefore, do not recode the ancestry values in the dataset, and simply use the values as is.
Questions and problems
CAP 5768 Data Visualization with R Project
The post CAP 5768 Data Visualization with R Project appeared first on EssayPanthers.
[ad_2]
Source link
Hi there! Click one of our representatives below and we will get back to you as soon as possible.