Saturday, October 7, 2023

LIS4273 - Module 7 Assignment

 For this assignment, I will be answering the following questions:

Question 1:

In this assignment's segment, we will use the following regression equation:

Y = a + bX + e

Where:

Y is the value of the Dependent variable (Y), what is being predicted or explained

a or Alpha, a constant; equal the value of Y when the value of X = 0

b or Beta, the coefficient of X; the slope of the regression line; how much Y changes for each one-unit change in X.

X is the value of the Independent variable (X), what is predicting or explaining the value of Y.

e is the error term; the error in predicting the value of Y, given the value of X (it is not displayed in most regression equations).

A reminder about the lm() function:

lm([target variable] ~ [predictor variables], data = [data source])

1.1

The data in this assignment:

1.1 Define the relationship model between the predictor and the response variable

To determine the predictor and response variables, we can simply plug the two datasets into R and use the lm function to define the relationship model. Analyzing the output via the summary function, we can conclude that there is a positive linear relationship between x and y. If we notice the asterisk (*) on the x row, the predictor value x is statistically significant at the 0.5 significance level. All in all, the predictor variable (x) makes a statistically significant impact on the response variable (y).

1.2 Calculate the coefficients

To calculate the coefficients, we can use the coefficients function to pull those coefficients directly

Question 2:

The following question is posted by Chi Lau...

Apply the simple linear regression model(see the above formula) for the data set called "visit" (see below) and estimate the discharge duration if the waiting time since the last eruption has been 80 minutes.

Employ the following formula: (discharge ~ waiting and data = visit)

Please see the given coding solution in R:

2.1 Define the relationship model between the predictor and the response variable

Interestingly, when we look at waiting, our predictor variable, it’s rather meaningful as indicated by the low p-value (<2e-16). This allows us to see that waiting is indeed a meaningful predictor of the response variable “discharge”.

2.2 Extract the parameters of the estimated regression equation with the coefficients function.

2.3 Determine the fit of the eruption duration using the estimated regression equation.

To understand the fit of the eruption duration, we can look at the summary of the model. Just looking at the Residuals section of the output, the values tell us that there are extreme outliers presents which means there might be issues within the model. When we examine the multiple R2, we see that it is 0.8115 or 81.15%. This value indicates to us that the variability in discharge is explained by the model and the higher the value, the better the fit. However, this high value is penalized by the adjusted R-squared which is 0.8108. Nonetheless, the given estimated regression equation provides a good fit of the eruption duration.

Question 3: Multiple Regression

We will use a very famous datasets in R called mtcars. This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973--74 models).

This data frame contains 32 observations on 11 (numeric) variables.

To call mtcars data in R

R comes with several built-in data sets, which are generally used as demo data for playing with R functions. One of those datasets build in R is mtcars.

In the question, we will use four of the variables found in mtcars by using the following function:

R will display

3.1 Examine the relationship Multi Regression Model as stated above and its Coefficients using 4 different variables from mtcars (mpg, disp, hp and wt). Report on the result and explanation what does the multi regression model and coefficients tells about the data?

Looking at the following output via summary:

As we analyze the results, we can immediately note from the coefficients section of the data that disp, hp, and wt are all negative values. What does this mean? Well, starting with disp, we can see that there is no linear relationship between mpg and is close to zero (-0.000937). As for hp, that value is (-0.031157) and what we can derive from that is as hp increases, mpg decreases which is a fairly reasonable observation. Lastly, with wt (-3.800891) it shows us that as wt increases, mpg decreases.

Question 4:

From our textbook pp. 110 Exercises # 5.1

With the rmr data set, plot metabolic rate versus body weight. Fit a linear regression to the relation. According to the fitted model, what is the predicted metabolic rate for a body weight of 70 kg?

The data set rmr is in R, make sure to install the book R package: ISwR. After installing the ISwR package, here is a simple illustration to the set of the problem.

Looking at the data points on the scatter plot, it appears to suggest a strong positive linear relationship. For a body weight of 70 kg, we can see that it corresponds to a predicted metabolic rate of around 1300.



To see the exact value, we execute the following code to get the exact metabolic rate (please see lines 19 through 26 and its corresponding output).

~ Katie

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much...