Wednesday, November 15, 2023

LIS4273 - UFO Sightings Final Project

Important Background Information:

For this project, I decided to analyze a dataset containing UFO sightings in the United States, Mexico, Canada, and certain places in Europe. However, to keep it simple I focused solely on the data records from the United States.

Here is the link to the original dataset: https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data

Note: when you go on to download the dataset, you will actually download two datasets: scrubbed.csv and complete.csv

For this analysis, I worked from scrubbed.csv which had less entries. After some data munging, I was able to get it down to approximately 63,561 rows and 11 variables which will work as the base of my analysis.

Establishing the Hypothesis (Problem Description):

From this dataset, I intended to determine the following:

Is there any difference in the duration (seconds) of the sightings between the north and south regions of the United States?

From the flowchart provided in the project instructions, I will be working with two samples (North and South) and developed the null and alternative hypotheses.

Null Hypothesis (H0): The average duration of UFO sightings is the same in the North and South regions, and there is no significant difference in the variability of durations between the two regions.

Alternative Hypothesis (H1): The average duration of UFO sightings differs between the North and South regions, and there is a significant difference in the variability of durations between the two regions.

Through conducting the Levene Test, I was able to determine that there are equal variances, and a Two Sample t-Test would be suitable for this analysis.

Also, important to note, to determine what states could be classified as “North” or “South”, I created a new variable known as region which through an if-else statement classifies a state as “North” or “South” by checking if it is above or below my selected latitude bench line. Given that I am from Maryland, this latitude value is 39.718457 and is based off the Mason-Dixon Line which I have crossed on multiple occasions. Fun times!

Related Work:

As for how my problem and chosen method for analysis relates to what was learned during the semester, it can be traced back to Module #6 Random Variable(s) and Probability Distribution(s) and One-Sample and Two-Sample Tests. Through the provided table in the Module #6 One-sample and Two-sample tests page, I used it as a guide when it came to forming my hypotheses.

Solution:

To solve the problem, I used various R functions to conduct my analysis. To determine if there are equal variances, I made use of the Levene test function, and my primary methodology was using a two sample t-test.

Why a two sample t-test, you might ask? Well, this test was determined to be the best choice by following the guidelines in the Final Project flowchart.

Given that I had two samples (North and South), the first question I had to ask myself was whether the data was in pairs or not. In this case, it was not as for North there was only 28,423 records and for South there was 35,020 records. For data to be paired, the observations must be of equal length.

Therefore, I followed the “No” arrow and proceeded to the “Equal Variances?” prompt. This part was a little tricky as the p-value was 0.05573 as determined by the Levene test. It was just barely above the 0.05 significance level so it was natural to check “Assume Equal Variances”, but it could also be considered as no equal variances as well. I was concerned about a potential violation so I performed two t-tests with var.equal set to TRUE or FALSE to see if there was any difference. From the results, I was able to confirm that there was no significant difference. Thankfully, whether I selected “Yes” or “No” to the Equal Variances prompt, it led to the same Two-Sample t-test.

To understand why a two sample t-test was preferred over other analysis methods, I realized that I needed a test that could handle two samples so naturally, I crossed out the idea of using a one-sample t-test as I had two samples (North and South). Furthermore, I did not consider the one-way ANOVA test as you need to have a minimum of three independent samples to make it work. Also, because my data was not paired nor related to each other, I did not execute a paired t-test. Through these insights, I was able to conclude that the Two-Sample t-test was the best fit for the data given the choices in the flowchart. 

Findings and Conclusions:

Upon executing my first t-test will var.equal set to TRUE, I soon became aware that I was in marginal violation of the equal variance assumption as the p-value (0.05573) from the Levene test was only slightly above the significance level of 0.05. With this minor issue in mind, I wanted to see via two t-tests with var.equal set to TRUE or FALSE respectively if there was any major difference in output. Quickly, I saw that their differences were minor and essentially alluded to the same findings.

Thus, we can infer the following.

Seeing that the p-value is just above the significance level of 0.05, we can determine that there is not enough evidence to reject the null hypothesis. In other words, there is no significant difference in the average duration of UFO sightings in the north and south regions of the United States. While the means for duration do appear to differ between North and South, one must understand that there truly is no significant difference in the length of the sightings.

~ Katie

Thursday, November 9, 2023

LIS4273 - Module 12 Assignment

 For this assignment, I will be answering the following questions:

The table below represents charges for a student credit card.

a. Construct a time series plot using R

b. Employ Exponential Smoothing Model as outlined in Avril Voghlan's notes and report the statistical outcome.

For these two parts, please see the following R code and its associated output:

c. Provide a discussion on time series and Exponential Smoothing Model result you led to.

Through the output, we can first see with the Time Series plot, it clearly shows the fluctuations in charges throughout the months and years. Furthermore, it does seem to peak in charges around mid-year, which may refer to high spending during the winter holiday months.

When we review the Exponential Smoothing model and its impact on the dataset, the first thing to take away is both parameters (beta and gamma) are set to FALSE which means that without trend and seasonality, it will emphasize the short-term variations of the data.

Looking at the high alpha value (0.8232442), this indicates that the forecasts are primarily influenced by recent observations. As for the coefficient ‘a’ value (62.44453), we can see that the model expects future values to revolve around this particular value. Now, when this new model is plotted, it more closely aligns with one another as opposed to the first plot which did not appear to be as closely aligned. Moving on the calculation of the sum-of-squared-errors (SSE), the value outputted was 835.38 which may infer that a model containing both trend and seasonality may better capture the underlying patterns in the data. Thus, it is key that you a strike a good balance between model refinement and goals of the analysis to increase model accuracy and reliability.

~ Katie


Tuesday, October 31, 2023

LIS4273 - Module 11 Assignment

 For this assignment, I will be answering the following questions:

10.1

From out textbook: pp. 188 Question 10.1

Set up an additive model for the ashina data, as part of the ISwR package

This data contains additive effects on subjects, period, and treatment. Compare the results with those obtained from t tests.

R code:

Result Interpretation:

Analyzing the results of the additive model, we first see that with the variable treat, its coefficient is -42.87 suggesting that the treatment group experienced a 42.87 unit decrease in vas as compared to the control group. As for period, its coefficient value is 80.50 which means that period 2 experienced an 80.50 unit increase in vas compared to period 1. Moving on to subjects, the first thing that sticks out are the significance levels and it is learned that for the intercept, treat, subject3, subject5, subject7, subject8, and subject10 are all significant (**) at 0.1 significance level where we can infer that their effects are not likely due to random chance. As for residuals, the range appears to go from -48.94 to positive 48.94. Ideally, these values should be randomly distributed around zero but a pattern such as this indicates the model is not capturing some aspect of data. As for the r-squared value, it is 0.7566 or 75.66% and while a good fit is a value close to 1, it means the 75.66% of the variability in vas scores is accounted for by the model. Lastly, with the F-statistic 2.914 and its associated p-value 0.02229, we can see that the model is indeed statistically significant.

Moving on to t test for treatment, we can see that there is a significant difference between the treatment and the control group via the p-value (0.02099). As for the t test for period, it is only marginally significant with the p-value (0.0672).

10.3 Consider the following definitions


Note:

The rnorm() is a built-in R function that generates a vector of normally distributed random numbers. The rnorm() method takes a sample size as input and generates that many numbers.

Your assignment:

Generate the model matrices for models z ~ a*b, z ~ a:b, etc. In your blog posting discuss the implications. Carry out the model fits and notice which models contain singularities.

Hint

We are looking for...

model.matrix(~ a:b); lm(z ~a:b)

R code:

Result Interpretation:

Through executing the code, the only model that contained singularities is model2 which held the expression a:b. The rest of the models yielded false when asked if they held singularities. What are the implications of this? Well, first off, it means the model2 has perfect collinearity. In other words, one predictor can be exactly predicted from the other which may lead to numerical instability in estimating the coefficients. Additionally, singularity means that there is an infinite number of solutions to model and R cannot uniquely determine the coefficients of a and b. Further, because a and b cannot be separated due to collinearity, the coefficient estimates will be unreliable.

As for the other models which yielded false when prompted for collinearity, we know that the coefficient estimates are reliable. Seeing that only one model had perfect collinearity and the rest did not, it might be best to focus on the non-collinear models for future analysis.

~ Katie

Monday, October 23, 2023

LIS4273 - Module 10 Assignment

 For this assignment, I will be answering the following questions:

Question 1:

From our textbook, Introductory Statistics with R pp. 159 Exercises, 9.1 and 9.2

9.1 I revised this question , so please follow my description only. Conduct ANOVA (analysis of variance) and Regression coefficients to the data from cystfibr (> data("cystfibr")) database. You can choose any variable you like. in your report, you need to state the result of Coefficients (intercept) to any variables you like both under ANOVA and multivariate analysis. I am specifically looking at your interpretation of R results.

Extra clue:

The model code:

In R:

Interpretation of the results:

Reviewing the output of the model, we can see that in terms of significance, the intercept is highly significant while the other variables: weight, bmp, and fev1 are all significant. Reading further into the results, we can take away the following ideas:

With age, we can see the pemax (maximum expiratory pressure) will decrease by approximately -3.4181 units for every one unit increase in age. However, the R results do not deem this variable as significant so we must move on.

As for weight, pemax (maximum expiratory pressure) will increase by 2.6882 units for every one unit increase in weight. R views this variable as statistically significant so this variable may be quite helpful in future models.

For bmp, which is body mass, the pemax variable will decrease by -2.0657 units for every one unit increase in bmp. This variable is also statistically significant and will definitely be helpful in future analyses.

Lastly, with fev1, which refers to forced expiratory volume, pemax will increase by 1.0882 units for every one unit increase in fev1.

Taking into account the p-values for each of the variables it seems that only fev1 gets rather close to 0.05 significance level at around 0.04695. The rest of the variables and their corresponding p-values are lower than fev1 with the exception of age, but that value is not at all significant to the analysis.

Moving on to the ANOVA table and its output, there are some surprising results. Looking first to the sum of squares which refers to concept of that a higher sum of squares equates to greater variability in the data. The variable age proves to have the highest value at 10098.5. The rest of the variables do not have as high of a sum of squares value. As we review each of the p-values and their corresponding significance levels we immediately see that age is highly significant at the three-star (***) range. Additionally, bmp and fev1 are considered as only marginally significant while there are no stars at all given to weight. Hence, age is a highly significant predictor and not far behind it are bmp and fev1 which alludes to the idea that they contribute to the variability in pemax variable.

Question 2:

9.2 The secher data (> data("secher")) are best analyzed after log-transforming birth weight as well as the abdominal and biparietal diameters. Fit a prediction weight as well as abdominal and biparietal diameters. For a prediction equation for birth weight.

How much is gained by using both diameters in a prediction equation?

The sum of the two regression coefficients is almost identical and equal to 3.

Can this be given a nice interpretation to our analysis?

Please provide a step by step on your analysis and code you use to find out the result.

Extra clue:

In R:

Interpretation of the results:

Analyzing the output of this model, we can see that when using both parameters bwt (birth weight) and ad (abdominal) in the prediction equation, it can be said that the log of ad indeed contributes to the prediction of log(bwt). As we look at the output, most notably the coefficients section, for each one unit increase of log(ad), the estimated change in log(bwt) is 2.2365. Now, given that the sum of the two regression coefficients is almost identical and equal to 3, this suggests that the model is inferring that the total effect on log(bwt) is additive when log(ad) as well as the intercept are considered. Just for reference, additive simply means that the effect of one predictor variable on the response variable is independent of the values of the other predictor variables. Moving on to the R-squared value, it is 0.7959 which means that 79.59% of the variability in log(bwt) can be explained by the model. As for the F-statistic, seeing its high value and the low p-value, one can infer that the model is indeed statistically significant.

~ Katie

Friday, October 20, 2023

LIS 4273 - Module 9 Assignment

For this assignment, I will be answering the following questions:

Question 1:

Your data frame is

 


Generate a simple table in R that consists of four rows: Country, age, salary, Purchased


Question 2

Generate a contingency table also known as rx C table using the mtcars dataset

Note: the following sections of this question can be found within the gist:

~ Katie

Tuesday, October 10, 2023

LIS4273 - Module 8 Assignment

 For this post, I will answering the following questions.

Question 1:

A researcher is interested in the effects of drug against stress reaction. She gives a reaction time test to three different groups of subjects: one group that is under a great deal of stress, one group under a moderate amount of stress, and a third group that is under almost no stress. The subjects of the study were instructed to take the drug test during their next stress episode and to report their stress on a scale of 1 to 10 (10 being most pain).



Report on the drug and stress level by using R. Provide a full summary report on the result of ANOVA testing and what does it mean? More specifically, report using the following R functions:


After running a summary report after using ANOVA, we first see that the degrees of freedom results in 2 meaning that there are two degrees of freedom associated with the variability in stress levels because of the stress group. As for Sum Sq or Sum of Squares, the value 82.11 represents the sum of squares associated with the stressGroup variable. Given the larger value, it can be interpreted that there is a significant amount of variability in stressLevel that is being explained by the stressGroup. Moving on to Mean Sq, 41.06 is the value associated with the stressGroup variable. Seeing how the value is larger than the residuals value, it can be taken away that the stress group indeed explains more variability in stressLevel than expected by random chance. For the F value, the summary produced the value of 21.36 which seems to indicate that it is a large value coupled with a rather small p-value (4.08e-05). Thus, the stress group variable has a statistically significant effect on stress levels.

Question 2:

From our Textbook: Introductory Statistics with R, Chapter 6, Exercises 6.1, pp. 127

The zelazo data (taken from textbook's R package IWsR) are in the form of a list of vectors, one for each of the four groups. Convert the data to a form suitable for the user of lm and calculate the relevant test. Consider t tests comparing selected subgroups or obtained by combining groups.

2.1 Consider ANOVA test (one-way or two-way) to this dataset (zelazo)

Recommendations



After reflecting upon the zelazo dataset, it can be seen that the one-way ANOVA test is more appropriate given that data set contains only one independent variable (type of training) with four levels like “active”, “passive”, “none”, and “ctr.w8” which means control. Seeing that one-way ANOVA tests work best when given a categorical independent variable (factor) with more than two levels (groups), it is clear that this test is more fitting to the problem than two-way ANOVA. We have four groups and one factor.

~ Katie

Saturday, October 7, 2023

LIS4273 - Module 7 Assignment

 For this assignment, I will be answering the following questions:

Question 1:

In this assignment's segment, we will use the following regression equation:

Y = a + bX + e

Where:

Y is the value of the Dependent variable (Y), what is being predicted or explained

a or Alpha, a constant; equal the value of Y when the value of X = 0

b or Beta, the coefficient of X; the slope of the regression line; how much Y changes for each one-unit change in X.

X is the value of the Independent variable (X), what is predicting or explaining the value of Y.

e is the error term; the error in predicting the value of Y, given the value of X (it is not displayed in most regression equations).

A reminder about the lm() function:

lm([target variable] ~ [predictor variables], data = [data source])

1.1

The data in this assignment:

1.1 Define the relationship model between the predictor and the response variable

To determine the predictor and response variables, we can simply plug the two datasets into R and use the lm function to define the relationship model. Analyzing the output via the summary function, we can conclude that there is a positive linear relationship between x and y. If we notice the asterisk (*) on the x row, the predictor value x is statistically significant at the 0.5 significance level. All in all, the predictor variable (x) makes a statistically significant impact on the response variable (y).

1.2 Calculate the coefficients

To calculate the coefficients, we can use the coefficients function to pull those coefficients directly

Question 2:

The following question is posted by Chi Lau...

Apply the simple linear regression model(see the above formula) for the data set called "visit" (see below) and estimate the discharge duration if the waiting time since the last eruption has been 80 minutes.

Employ the following formula: (discharge ~ waiting and data = visit)

Please see the given coding solution in R:

2.1 Define the relationship model between the predictor and the response variable

Interestingly, when we look at waiting, our predictor variable, it’s rather meaningful as indicated by the low p-value (<2e-16). This allows us to see that waiting is indeed a meaningful predictor of the response variable “discharge”.

2.2 Extract the parameters of the estimated regression equation with the coefficients function.

2.3 Determine the fit of the eruption duration using the estimated regression equation.

To understand the fit of the eruption duration, we can look at the summary of the model. Just looking at the Residuals section of the output, the values tell us that there are extreme outliers presents which means there might be issues within the model. When we examine the multiple R2, we see that it is 0.8115 or 81.15%. This value indicates to us that the variability in discharge is explained by the model and the higher the value, the better the fit. However, this high value is penalized by the adjusted R-squared which is 0.8108. Nonetheless, the given estimated regression equation provides a good fit of the eruption duration.

Question 3: Multiple Regression

We will use a very famous datasets in R called mtcars. This dataset was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973--74 models).

This data frame contains 32 observations on 11 (numeric) variables.

To call mtcars data in R

R comes with several built-in data sets, which are generally used as demo data for playing with R functions. One of those datasets build in R is mtcars.

In the question, we will use four of the variables found in mtcars by using the following function:

R will display

3.1 Examine the relationship Multi Regression Model as stated above and its Coefficients using 4 different variables from mtcars (mpg, disp, hp and wt). Report on the result and explanation what does the multi regression model and coefficients tells about the data?

Looking at the following output via summary:

As we analyze the results, we can immediately note from the coefficients section of the data that disp, hp, and wt are all negative values. What does this mean? Well, starting with disp, we can see that there is no linear relationship between mpg and is close to zero (-0.000937). As for hp, that value is (-0.031157) and what we can derive from that is as hp increases, mpg decreases which is a fairly reasonable observation. Lastly, with wt (-3.800891) it shows us that as wt increases, mpg decreases.

Question 4:

From our textbook pp. 110 Exercises # 5.1

With the rmr data set, plot metabolic rate versus body weight. Fit a linear regression to the relation. According to the fitted model, what is the predicted metabolic rate for a body weight of 70 kg?

The data set rmr is in R, make sure to install the book R package: ISwR. After installing the ISwR package, here is a simple illustration to the set of the problem.

Looking at the data points on the scatter plot, it appears to suggest a strong positive linear relationship. For a body weight of 70 kg, we can see that it corresponds to a predicted metabolic rate of around 1300.



To see the exact value, we execute the following code to get the exact metabolic rate (please see lines 19 through 26 and its corresponding output).

~ Katie

Saturday, September 30, 2023

LIS4273 - Module 6 Assignment

 For this assignment, I will be answering the following questions:

A.

Consider a population consisting of the following values, which represents the number of ice cream purchases during the academic year for each of the five housemates. 8, 14, 16, 10, 11

For the four parts of the question, I will embed each of the sub questions within the following R code.

B.

Suppose that the sample size n = 100 and the population proportion p = 0.95.

1. Does the sample proportion p have approximately a normal distribution? Explain.

When it comes to determining if sample proportion p has an approximate normal distribution, it is important that we refer to the Central Limit Theorem which states that the larger the given sample size, the greater the chance that the sample proportion will have a normal distribution. Additionally, it is often said that if the population proportion is very close to 0 or 1, there is a higher likelihood that there must be larger sample size for there to be a normal distribution.

Given that the sample size n is 100 and the population proportion p is 0.95, this sort of scenario should be suitable enough to render a normal distribution. However, to assure that there is straight normality, a larger sample size would be needed to achieve an approximate normal distribution.

Let's break this idea down mathematically:

Using the above values, let’s solve for np and nq and determine whether it is greater than 10 for the distribution to be normal.

np = 100 * 0.95 = 95

nq = 100 * (1 – 0.95) = 5

Only np is greater than 10 so we can conclude that this may not exactly be a normal distribution. While these circumstances may not be completely unreasonable, if precision is prioritized then a larger sample size is needed.

2. What is the smallest value of n for which the sampling distribution of p is approximately normal?

To answer this question, I must preface that there is no one-size-fits-all answer when it comes achieving the smallest possible value for n for which the sampling distribution is normal. Typically, the smallest value of n is dependent on the population proportion and the desired level of approximation. Many sources have suggested n to be greater than 10 but determining the best value varies depending on the population distribution and how close p is to 0 or 1.

The sample mean from a group of observations is an estimate of the population mean μ . Given a sample of size n, consider n independent random variables X1, X2, ..., Xn, each corresponding to one randomly selected observation. Each of these variables has the distribution of the population, with mean μ and standard deviation σ.

A. Population mean = (8 + 14 + 16 + 10 + 11) / 5 (5 represents the number of values in the set)

B. Sample of size n = 5

C. Mean of sample distribution = 11.8

We can put together some samples using some R code:

 And Standard Error Qm / Q

           Square root of n 4.4 / square root of 5

D. I am looking for table with the following variables X, x=u, and (x-u)^2

Here's a little hint

The sample size n =100 and the population proportion p = 0.95

Does the sample proportion p have approximately a normal distribution? The distribution is expected to be normal if both np and nq are greater ....... (Your Turn)

Since p = .95, q = .05.

p * n = .95 * 100 = 95

q * n = .05 * 100 = 5

It is often said that there must be some kind of benchmark or value to determine normality. To have a reasonably decent normal distribution, we can refer to classic Central Limit Theorem guidelines that state that np and nq should be greater than or equal to 10. Seeing that nq is not greater than 10, this tells us that the sample proportion does not have an approximately normal distribution. What can be taken away from this is that more information is needed based on the precision level and context to determine that this is a reasonable enough normal distribution.

C.

From our textbook, Chapter 2 Probability Exercises # 2.4 Simulated coin tossing is probability better done using function called rbinom than using function called sample. Explain.

Comparing the functions rbinom and sample, we can immediately see that in the parameters taken in by rbinom, it is better equipped to handle something like a simulated coin toss as contrasted with sample. From the textbook and this week’s lecture slides, the binomial distribution maintains constant probability with each trial and these trials are always independent. Furthermore, rbinom is more suited to handle complex statistical scenarios where there are multiple trials or varying levels of probabilities for each trial. From a functional standpoint, rbinom is much more powerful as sample is a bit more general in its function.

~ Katie

Thursday, September 21, 2023

LIS4273 - Module 5 Assignment

 For this post, I will be answering the following questions.

Question 1:

The director of manufacturing at a cookie factory needs to determine whether a new machine is production a particular type of cookies according to the manufacturer's specifications, which indicate that cookies should have a mean of 70 and standard deviation of 3.5 pounds. A sample pf 49 of cookies reveals a sample mean breaking strength of 69.1 pounds.

A. State the null and alternative hypothesis

B. Is there evidence that the machine is not meeting the manufacturer's specifications for average strength? Use a 0.05 level of significance.

C. Compute the p value and interpret its meaning.

D. What would be your answer in (B) if the standard deviation was specified at 1.75 pounds?

E. What would be your answer in (B) if the sample mean was 69 pounds and the standard deviation was 3.5 pounds?

Answer:

A. Null hypothesis: The new machine is making a particular type of cookie according to the manufacturer's specifications with the mean breaking strength of 70 pounds.

    Alternative hypothesis: The new machine is NOT making a particular type of cookie according to the manufacturer's specifications with the mean breaking strength of 70 pounds.

We can write the null and alternative hypothesis as follows:


The hypotheses correspond to a two-tail test.

B. Using the values that were given to us, we must compute the test statistic to determine if there is evidence that the machine is not meeting manufacturer’s specifications.

The formula for the test statistic is as follows:

Where x bar refers to the sample mean 69.1, μ is the mean or in this case, 70. The sigma in the denominator is the standard deviation or 3.5 and n will be represented by the sample of 49 cookies. The calculation can be performed as follows: 69.1 – 70 / (3.5 / sqrt(49)) = -1.8

We must now determine the critical values at the 0.05 level of significance and we can use R to calculate this now that we have the test statistic.

The output tells us that the p_value is 0.07186064. We fail to reject the null hypothesis because there is no evidence that the machine is not meeting specifications.

C. Compute the p value and interpret its meaning.

We can calculate the p_value using this code:

The p_value is 0.07186064 and we can understand from the value that it is greater than the significance level of 0.5 or alpha so we must fail to reject the null hypothesis.

D. What would be your answer in (B) if the standard deviation were specified as 1.75 pounds?

Given different results for the test statistic (-3.6) and the p-value (0.0003182172), we would reject the null hypothesis.

E. What would be your answer in (B) if the sample mean were specified as 69 pounds and the standard deviation is 3.5 pounds?

Given different results for the test statistic (-2) and the p-value (0.04550026), we would reject the null hypothesis.

Question 2:

If x̅ = 85, σ = standard deviation = 8, and n=64, set up 95% confidence interval estimate of the population mean μ.

Answer:

Looking at the 95% confidence interval, it is important to note that the z-score will be 1.96. 1.96 will be our margin of error. With these values in mind, we can plug these values into the following equation to determine the range within which the true population mean will fall given a certain level of confidence.


At the 95% confidence interval, the population mean μ lies within the range of 83.04 to 86.96.

Question 3:

Use dataset found in week 5 folder.

The accompanying data are: x = girls, y = boys (goals, time spend on assignment)

A. Calculate the correlation coefficient for this dataset

Through the provided code from the question we generate the plot of the dataset. but we also generate a matrix of the correlation coefficient to determine the value.

Correlation matrix:

When we compare girls and boys in terms of time spent and goals, we immediately see that there is extremely high correlation. Boy goals and Girl goals are practically 1 to 1 and time spent is pretty high up there with the value of 0.9991175.

B. Pearson correlation coefficient

To calculate the Pearson correlation coefficient, we can use the following code:

Looking at the output, we can say the Pearson correlation coefficient is 1

C. Create plot of the correlation

Executing this code, brings up the following plot that shows us where these values are from a numeric perspective.


If we were to change out panel.cor for panel.shade we can see the high correlation shading in action:

~ Katie


Wednesday, September 13, 2023

LIS4273 - Module 4 Assignment

 For this week's post, I will be answering the following questions.

Question 1:

Based on Table 1, what is the probability of:

B

B1

A

10

20

A1

20

40

A1: Event A

Answer: P(A) = matching outcomes / total outcomes => 30 / 90 => 1 / 3

A2: Event B

Answer: P(B) = matching outcomes / total outcomes => 30 / 90 => 1 / 3 

A3: Event A or B

Answer: 

Using the Addition Rule: P(A OR B) = P(A) + P(B) - P(A AND B)

                                                                    1/3 + 1/3 - P(A AND B)

                                                                    2/3 - P(A AND B)

Using the Independent Events Rule: P(A AND B) = P(A) * P(B)

                                                                                    1/3 * 1/3

                                                                                    1/9

P(A OR B) = 2/3 - 1/9

                   = 5/9

A4: P(A OR B) = P(A) + P(B)

Answer: The above statement is false. Under this context, 5/9 does not equal 2/3.


Question 2:

B. Applying Bayes' Theorem

Jane is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year. Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. 

What is the probability that it will rain on the day of Jane's wedding?

Solution: The sample space is defined by two mutually-exclusive events - it rains or it does not rain. Additionally, a third event occurs when the weatherman predicts rain. Notation for these events appears below. 

Event A1: It rains on Jane's wedding.

Event A2: It does not rain on Jane's wedding.

Event B. The weatherman predicts rain.

In terms of probabilities, we know the following:

P(A1) = 5/365 =0.0136985 [It rains 5 days out of the year.]

P(A2) = 360/365 = 0.9863014 [It does not rain 360 days out of the year.]

P(B|A1) = 0.9 [When it rains, the weatherman predicts rain 90% of the time.]

P(B|A2) = 0.1 [When it does not rain, the weatherman predicts rain 10% of the time.]

We want to know P(A1|B), the probability it will rain on the day of Marie's wedding, given a forecast for rain by the weatherman. The answer can be determined from Bayes' theorem, as shown below.

P(A1|B) = P(A1) P(B|A1)

P(A1) P(B|A1) + P(A2) P(B|A2)

P(A1|B) = (0.014)(0.9) / [ (0.014)(0.9) + (0.986)(0.1) ]

P(A1|B) = 0.111

Note the somewhat unintuitive result. Even when the weatherman predicts rain, it only rains only about 11% of the time. Despite the weatherman's gloomy prediction, there is a good chance that Marie will not get rained on at her wedding.

B1. Is this answer true or false?

Answer: true

B2. Please explain why?

Tracing back to the formula used to determine the probability of whether it will rain on the day of Jane’s wedding day as predicted by the weatherman, one must simply review the above calculations to see that the chance of rain is quite unlikely. The chance of rain is only 11% while there is an 89% percent chance of it not raining on Jane’s wedding day.

To break this down further, let’s look at the variables used in the Bayes' Theorem which can be expressed as follows:


While the given variables do look a bit different than the provided example, the theorem still applies. So, let’s gather up the necessary variables.

P(A) => P(A1) => 5/365 => 0.0136985

P(B|A) => P(B|A1) => 0.9

P(B) contains a few more variables than usual so let’s break this down. The formula for P(B) as provided is:

P(B) = [P(A1)P(B|A1) + P(A2)P(B|A2)]

P(B) = [(0.014)(0.9) + (0.986)(0.1)]

P(B) = 0.1109589041

Bringing it all together:

(0.0136985)(0.9) / 0.1109589041

0.01232865 / 0.1109589041

= 0.1111100556

In conclusion, we can see that indeed, there is only a 11% percent probability of it raining on Jane’s wedding day and while there is a small chance, there is greater chance of it NOT raining than raining.

Question 3:

C. For a disease known to have a postoperative complication frequency of 20%, a surgeon suggests a new procedure. She/he tests it on 10 patients and found there are not complications. What is the probability of operating on 10 patients successfully with the traditional method?

A hint, use dbinom function - it is part of R functions that count Density, distribution function, quantile function, and random generation for the binomial distribution with parameters size and prob.

You will answer the question with

dbinom(XXX, size=XXX, prob=XXX)

Answer:

dbinom(0, size = 10, prob = 0.2)

The result is 0.1073742

~ Katie Burkhart

Tuesday, September 5, 2023

LIS4273 - Module 3 Assignment

 For this assignment, I will be examining the following two sets of data that each consist of 7 observations.

Set #1: 10, 2, 3, 2, 4, 2, 5

Set #2: 20, 12, 13, 12, 14, 12, 15

For these sets, I will compute the mean, median, and mode under Central Tendency as well as compute the range, interquartile, variance, and standard deviation under Variation. Lastly, I will compare the results between Set #1 and Set #2 by discussing the differences between the two sets.

Question 1:

Compute the mean, median, and mode under Central Tendency for both sets.

Set #1

Set #2

Question 2:

Compute the range, interquartile, variance, and standard deviation under Variation for both sets.

Set #1

Set #2

Question 3:

Compare the results between Set #1 and Set #2 by discussing the differences between the two sets.

To begin this discussion about the differences between Set1 and Set2, we can first see differences beginning to arise when we look at the results of Set1's and Set2's mean and median. Naturally, these two sets will render differing results because the vectors contain different values and are thus, not alike. 

Moving on to the mode, Set1 and Set2 are both classified as numeric by R because it seems R was unable to find a mode in either dataset. 

As for the range, both Set1 and Set2 produce differing results because once again, the two datasets contain different numeric values. In other words, the largest value in Set1 is 10 while the largest value in Set2 is 20. However, when we subtract the max value from the min values in both Set1 and Set2, both datasets result in a range of 8. 

Transitioning to the interquartile and variance calculation, we can note that the outputs of both sets are identical. Lastly, with the standard deviation of Set1 and Set2, the outputs are once again identical.

~ Katie

Monday, August 28, 2023

LIS4273 Module 2 Assignment (New)

For this assignment, I will evaluate the function myMean and the variable assignment which contains a vector of numeric values. Due to some inconsistencies between the Module 2 assignment text and Module 2 example code, I will evaluate the code described in the assignment text as well as the example code and their associated outputs.

Assignment text input:

# A vector of numeric values assigned to assignment

assignment <- c(6, 18, 14, 22, 27, 17, 19, 22, 20, 22)

myMean <- function(assignment2){

  return(sum(assignment2) / length(assignment2))

}

Output:

Assignment example code input:

# Missing the value 19 from the vector and called assignment2 rather than assignment

 assignment2 <- c(6, 18, 14, 22, 27, 17, 22, 20, 22)

myMean <- function(assignment2) {return(sum(assignment2)/length(assignment2))}

Output:

To describe what the function assignment2 does, we can see that it is an argument passed through the function. When we call the myMean function and insert the variable assignment or assignment2, it returns the simple mean of the chosen vector variable. The myMean function takes in a vector as input, and returns the sum of the vector values divided by the length of the vector.

Below are the following outputs of assignment text and assignment example code if we were to call the function and insert a variable to perform a calculation. 

Assignment text input:

myMean(assignment)

Output:

Assignment example code input:

myMean(assignment2)

Output:

As you can see, the outputs vary because the assignment vector contains the value 19 and holds 10 values as compared to the assignment2 vector containing only 9 values and is missing the value 19.

~ Katie 

Sunday, August 20, 2023

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much...