Thursday, February 29, 2024

LIS 4317 Visual Analytics - Module 8 Assignment

In this week in Visual Analytics, we are asked to generate a visual from the mtcars dataset that is either based on correlation and regression analysis.

In this case, I will be conducting correlation analysis on the variables number of cylinders (cyl) and horsepower (hp). That is, I want to know if these variables have a positive correlation or negative correlation relationship.

Let's begin by plotting the relationship via ggplot2 in a scatterplot. I will also draw in a line of best fit using the lm argument.

Here's my code:

The resulting output:

Interestingly, there appears to be a rather strong positive correlation between the number of cylinders and horsepower. Perhaps, this is alluding to cars that have a greater number of cylinders have a higher horsepower. Let's take a look at the correlation coefficient.

With a value of .832, that's a pretty high correlation value!

Lastly, let's finish up by creating a linear regression model of these variables:

Looking at the significance codes, cyl is highly significant at the (***) mark at 0.001. However, the model can only capture about 68% of the variability so it is a moderately strong correlation.

Reflecting on the design remarks of Few from the textbook, my visual does follow a few of his best practices. To begin, I do feature some grid lines for greater readability of graph and one can easily determine the points and where they fall in terms of cyl and hp. However, I do see how grid lines can be helpful if multiple scatter plots are used in the analysis. In addition, my visual also has a line of best fit so that we easily see the correlation's linear shape and positive slope.

I appreciate Few's recommendations when it comes to visualizing correlations between variables. Although my visual does not include all his suggestions and I know that they will be helpful for other datasets.

~ Katie


LIS4370 R Programming - Module 8 Assignment

For this week's assignment in R Programming, we are to do the following:

Step 1: Import txt file in R. This file is called Assignment 6 Dataset.txt

Step 2: Use ddply from the plyr package to generate mean of Age and Grade variables split by Sex variable, this will be saved to new variable, Grade.Average

Step 3: Create a new txt file containing new variable Grade.Average. This new txt file will be called Sorted.Average

Step 4: Separate values in Sorted.Average by comma using sep argument, and save it back to Sorted.Average

Step 5: Using the original txt file, filter the names in the list that contain the letter (i). Then, save the result to a new file called DataSubset with the values separated by comma.

Let's take a look at our new files:

Sorted.Average:

DataSubset:

As you can see by the resulting output, the values are separated by commas and the variables are differentiated as they are surrounded by double quotes. 

See the code on GitHub: Module 8 Code

~ Katie

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much...