MATH20811 Regression and Goodness

Posted 2020-11-26 blackni

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了MATH20811 Regression and Goodness相关的知识，希望对你有一定的参考价值。

Coursework 3 – Regression and Goodness of fit
MATH20811 Practical Statistics : Coursework 3
The marks awarded for this coursework constitute 40% of the total assessment for the module.
Your solution to the coursework should be a consice report (max 10 pages) and it should take, on
average, about 20 hours to complete.
The submission deadline is 10am on Monday 23 December 2019.
Late Submission of Work: Any student’s work that is submitted after the given deadline will
be classed as late, unless an extension has already been agreed via mitigating circumstances or
a DASS extension. The following rules for the application of penalties for late submission are
quoted from the University guidance on late submission document, version 1.3 (dated July 2019):
“Any work submitted at any time within the first 24 hours following the published submission
deadline will receive a penalty of 10% of the maximum amount of marks available. Any work
submitted at any time between 24 hours and up to 48 hours late will receive a deduction of 20% of
the marks available, and so on, at the rate of an additional 10% of available marks deducted per
24 hours, until the assignment is submitted or no marks remain.”
Your submitted solutions should all be in one document. This must be prepared using LaTeX.
For each part of the question you should provide explanations as to how you completed what is
required, show your workings and also comment on computational results, where applicable.
When you include a plot, be sure to give it a title and label the axes correctly.
When you have written or used R code to answer any of the parts, then you should list this R code
after the particular written answer to which it applies. This may be the R code for a function you
have written and/or code you have used to produce numerical results, plots and tables. R code
should also be clearly annotated.
Avoid using screenshots of R code/output. Instead, to include R code use the verbatim environment
and summarise R output in tables using the table environment, as demonstrated in the solution of
Example Sheet 2.
Your file should be submitted through the module site on Blackboard to the Turnitin assessment
in the Coursework folder entitled “MATH20811 CW3” by the above time and date. The work
will be marked anonymously on Blackboard so please ensure that your filename is clear but that
it does not contain your name and student id number. Similarly, do not include your name and
id number in the document itself.
Turnitin will generate a similarity report for your submitted document and indicate matches to
other sources, including billions of internet documents (both live and archived), a subscription
MATH20811作业代写、代做R课程设计作业
repository of periodicals, journals and publications, as well as submissions from other students.
Please ensure that the document you upload represents your own work and is written in your own
words. The Turnitin report will be available for you to see shortly after the due date.
This coursework should hopefully help to reinforce some of the methodology you have been studying,
as well as the skills in R you have been developing in the module. Correct interpretation and
meaningful discussion of the results (i.e. attempt to put the results into context) are as important
as correct calculation of the results, in order to achieve a high mark for the coursework.
Coursework 3 – Regression and Goodness of fit
1. The data for this question were taken from the public records. They comprise the prices of
a random sample of 1728 homes in Saratoga County, New York, USA in 2006, together with
the values of 15 other variables which may be influential in explaining the price of a house
in this area.
To obtain the data, firstly install the R package mosaicData using the install command
in the Packages menu in R. Then run the following R code to load the data into your R
workspace:
library(mosaicData)
data(SaratogaHouses)
names(SaratogaHouses)
Read the data into R using the above commands.
The first part of the coursework asks you to fit multiple regression models to the data. The
house prices in the SaratogaHouses data object are in the variable price, while the 15
covariates are:
lotSize, age, landValue, livingArea, pctCollege, bedrooms, fireplaces, bathrooms,
rooms, heating, fuel, sewer, waterfront, newConstruction, centralAir.
However, we propose to use the following variables instead of the original variables price,
lotSize, age and landValue in regression models:
tprice=price/1000
tlotSize=sqrt(lotSize)
tage=sqrt(age)
tlandValue=landValue/1000
Please note that no other transformations to any of the variables should be used.
(a) Fit a regression model (call this lm1) to predict the transformed house prices in (tprice)
using the 12 original predictors and the 3 transformed predictors which are defined
above.
Using the relevant results from the summary of the model, informally comment (with
justifications) on which variables appear to be significant in predicting the response
(you may assume a 5% significance level).
Note: For categorical variables, if at least one of the categories is significantly different
to the reference category, then the variable should be retained in the model. [3]
Coursework 3 – Regression and Goodness of fit
(b) Remove the variables which you identified as non-significant in (a) and refit the model
(call this lm2).
Using dummy variables to represent the categorical variables, write down the model in
algebraic form, explaining your notation. [4]
(c) Define the hypotheses for an appropriate test to compare models lm1 and lm2, in terms
of β
d
, where β
d
is a vector containing the coefficients of the variables not included in
lm2 (β
d needs to be defined explicitly). Carry out the test (you may use an R built-in
function) and interpret the results. [4]
(d) Based on summary(lm2), explain how each variable affects the response and discuss
any other findings about the model. [5]
(e) Using diagnostic plots of the residuals against the fitted values and the various predictors,
assess whether any of the model assumptions appear to be violated. [3]
In the next parts we will examine the assumption that the errors in our model lm2 are
Normally distributed. It can be shown theoretically that the estimated errors from a
fitted model (the residuals) have differing variances which depend on the values of the
covariates. Consequently, we will work with the standardised residuals and check if
their underlying distribution is consistent with a N(0, 1) probability model. To obtain
the standardised residuals from model lm2 we can use the following code which performs
the standardisation in a special way that recognises the unequal variances.
std.res=rstandard(lm2)
(f) Manually construct (rather than using an existing R function for this purpose) a
quantile-quantile plot of the standardised residuals and superimpose a suitable reference
line to help gauge Normality. Comment on the form of your plot and say whether
you think that Normality was a tenable assumption or not. [4]
(g) We now wish to carry out a Kolmogorov-Smirnov test to assess whether the distribution
of the standardised residuals is N(0, 1). Write code in R to calculate (manually) the
value of the KS test statistic and find the std residual value where the difference between
the empirical and N(0, 1) cdf’s is a maximum. [4]
(h) Produce a plot containing the empirical cdf of the standardised residuals and the N(0, 1)
cdf and indicate the point at which the maximum difference between the curves occurs.
[4]
(i) Write a function in R to simulate the sampling distribution of the Kolmogorov-Smirnov
test statistic when the N(0, 1) null distribution is true. [4]
Run your function and use the results to plot a histogram of the estimated sampling
distribution. [2]
Use your simulated test statistic values to obtain an estimated 5% critical value for
your test. Compare your observed value with this and report your conclusions. [2]
(j) If you found significant evidence against the assumption of Normally distributed errors
in (i), how does this impact your conclusions about the significance of the predictors
in (a)? [1]

因为专业，所以值得信赖。如有需要，请加QQ：99515681 或微信：codehelp

以上是关于MATH20811 Regression and Goodness的主要内容，如果未能解决你的问题，请参考以下文章

linear regression and logistic regression with pytorch

课程一(Neural Networks and Deep Learning)总结：Logistic Regression

Tensflow & Numpy to implement Linear Regresssion and Logistic Regression

决策树算法之分类回归树 CART（Classification and Regression Trees）

Keras Regression using Scikit Learn StandardScaler with Pipeline and without Pipeline

决策树算法之分类回归树 CART（Classification and Regression Trees）