7089CEM Introduction to Statistical Methods for Data Science.

Publish By: Admin,
Last Updated: 10-Jul-23
Price: $120

Modelling and analysis of gene expression data

Learning Outcome 1: Demonstrate knowledge of underlying concepts in probability and statistics used in Data Science.

Learning Outcome 2: Select and apply appropriate statistical methods or techniques to solve problems or analyse data sets.

Learning Outcome 3: Use modern software to solve real world problems and analyse large data sets.

Learning Outcome 4: Interpret the results of their analyses and communicate those results accurately.

Coursework Description:

The aim of this assignment is to fit a non-linear time series model to the gene expression data set. Gene expression is one of the most important biological processes where information from a gene is used to synthesize a functional gene product, such as protein. The expression of a gene can be controlled (or regulated) by another gene or several other genes, through a gene product (protein) called transcription factor. Understanding how genes regulate each other, i.e. gene regulation, is important to investigate a complex diseases, and how cell respond to environmental stimuli.

Data:
The simulated` 5 gene expression time-series data, are given in the excel file (gene_data.csv). The first column contains the sampling time in minutes, the rest 5 columns are the time-course expression data of 5 genes{"#, "%, "&, "`, "(}, respectively. All these 5 genes are subject to additive noise (assuming independent and identically distributed ("i.i.d") Gaussian with zero-mean) with unknown variance.

Part 1: Preliminary data analysis
You should first perform an initial exploratory data analysis, by investigating:

Time series plots.

Distribution for each gene.

Correlation and scatter plots (between combination of two genes) to examine their dependencies.