CAP 5768 Data Visualization with R Project


CAP 5768 Data Visualization with R Project

Preliminary instructions

All analyses must be performed in R using the tidyverse and glmnet packages discussed in class. Fill in all your solutions in the appropriate spaces provided in this Word document, and then upload a PDF copy of your solutions to Canvas. Only PDF copies will be graded.

Brief overview of assignment

In this assignment you will be using the dataset GlobalAncestry.csv, which is available on Canvas. You will be analyzing genetic data from 242 humans sampled across the world from six ancestries. The first column in each dataset, labeled ancestry, takes the following values:

African            San and Yoruban individuals from sub-Saharan Africa

European            Italian and Russian individuals from Europe

EastAsian            Chinese and Japanese individuals from East Asia

Oceanian            Melanesian and Papuan individuals from Oceania

NativeAmerican        Pima and Mayan individuals from the Americas

Mexican            Mexican individuals from the Americas

Unknown1            Unknown ancestry

Unknown2            Unknown ancestry

Unknown3            Unknown ancestry

Unknown4            Unknown ancestry

Unknown5            Unknown ancestry

The GlobalAncestry.csv is a large dataset with genetic data for individuals 242 at 8916 genomic locations. As we discussed in our introductory lecture for this course, each individual will have a value of 0, 1, or 2 at each of these genomic locations, indicating “genotype” that the individual has at this location.

Training a lasso penalized multinomial regression classifier

The goal is to train a multinomial regression classifier to predict K=5 ancestries (African, European, EastAsian, Oceanian, and NativeAmerican). The training dataset will consist only of individuals with African, European, EastAsian, Oceanian, and NativeAmerican ancestries, and the best classifier will be determined by lasso-penalized multinomial regression and 10-fold cross validation. You will consider 500 tuning parameter values (λ), taking values between 0.001 and 1000 evenly on a base-10 logarithmic scale, as we have highlighted several times in class. You will then choose the classifier that is the simplest classifier that is within 1 standard error of the best classifier. CAP 5768 Data Visualization with R Project

Predicting ancestry of individuals with unknown ancestry

You will then use this classifier to predict the ancestries of the five unknown individuals (Unknown1, Unknown2, Unknown3, Unknown4, and Unknown5) based on their genetics.

Predicting ancestry proportions of individuals with Mexican ancestry

You will also use predicted class probabilities to estimate the fraction of ancestry that each individual of Mexican descent has from each of the five continental ancestries used to train the classifier. You will then use violin plots to visualize the distributions of these probabilities across the set of individuals of Mexican ancestry, and hypothesize about the historical reasons for the ancestry distributions you observe.

Instructions for loading GlobalAncestry dataset into your RStudio Cloud environment

Recall that to upload a file to RStudio Cloud, you first must download the GlobalAncestry.csv file to your computer. Once the file is downloaded, within the “Files” panel of the RStudio Cloud environment, click “Upload” and browse to the appropriate directory on your computer to upload the GlobalAncestry.csv file.

The GlobalAncestry.csv file can be loaded using the read_csv() function of the readr package that comes loaded with tidyverse, and assigned to an object called GlobalAncestry as

GlobalAncestry <- read_csv(“GlobalAncestry.csv”)

If you are having trouble loading the file, then refer back to the video lecture on Linear Regression where this was demonstrated in class. CAP 5768 Data Visualization with R Project

Note about using glmnet for classification

When using glmnet, you will not need to recode classes as values 1, 2, 3, etc. We only performed this recoding in class to illustrate the connection with using linear regression applied to a response with values 0 and 1, as linear regression requires a quantitative response. Therefore, do not recode the ancestry values in the dataset, and simply use the values as is.

Questions and problems

  1. [10%] Load the GlobalAncestry.csv dataset, and split and store the dataset into three separate datasets: training dataset, test dataset of unknown ancestries, and test dataset of Mexican ancestry. That is, create the following three datasets:
  2. Training data frame called train, which only includes observations with ancestry values African, European, EastAsian, Oceanian, and NativeAmerican. > Train <- filter(GlobalAncestry, ancestry %in% c(“African”, “European”, “EastAsian”, “Oceanian”, “NativeAmerican”)
  1. Test data frame called test, which only includes observations with ancestry values Unknown1, Unknown2, Unknown3, Unknown4, and Unknown5.Test <- filter(GlobalAncestry, ancestry %in% c(“Unknown1”, “Unknown2”, “Unknown3”, “Unknown4”, “Unknown5”))
  1. Test data frame called testmex, which only includes observations with ancestry value Mexican.> Testmex <- filter(GlobalAncestry, ancestry %in% c(“Mexican”))
  1. [20%] Apply glmnet to the training dataset train from Question 1, to train a multinomial regression classifier with a lasso penalty across 500 tuning parameter (λ) values, taking values between 0.001 and 1000 evenly on a base-10 logarithmic scale. The response will be ancestry, and the input features will be the values at the set of 8916 genomic locations. Train this lasso-penalized multinomial regression model across the 500 tuning parameter values, and plot the regression coefficients for each of the K=5 classes as a function of log(λ). Based on these results, does it appear that regularization and feature selection is working? Explain your answer.

CAP 5768 Data Visualization with R Project

Need Help with a similar Assignment?

The post CAP 5768 Data Visualization with R Project appeared first on EssayPanthers.


Source link

"Looking for a Similar Assignment? Get Expert Help at an Amazing Discount!"

Hi there! Click one of our representatives below and we will get back to you as soon as possible.

Chat with us on WhatsApp