Thursday, April 25, 2024

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much easier. By automating much of the code statements required to perform such an analysis, all a user must do is execute a few lines of code to get the sentiment results that they are seeking.

In sentimentTextAnalyzer2, users can...

  • See their files read and their html file links parsed and ready for analysis.
  • Have their file/URL go through the necessary preprocessing steps like:
    • Removing punctuation
    • Removing numbers
    • Makes text lowercase
    • Removes any English stopwords like (and, but, then, the, etc.)
  • Get fast results regarding the frequency of positive and negative words courtesy of the Bing lexicon.
  • Visualize their findings with word clouds.

Installing this package in RStudio:
Please follow this link to the sentimentTextAnalyzer2 repository: sentimentTextAnalyzer2 Repository

There you will find various files relating to the documentation of the package but also how to install this file in your own RStudio. Please download the file that reads sentimentTextAnalyzer2_0.1.0.tar.gz. Place it in your preferred R directory and install. The installation steps should look something like this. 


Package Installation Option 2:
You can also run these lines of code to call the library into your RStudio.
# If you haven't already installed devtools...
# install.packages("devtools")
library(devtools)
devtools::install_github("ProfessorHatsuneMiku/sentimentTextAnalyzer2")

Quick Start Guide:
Also included in the repository is a demo guide to using the functions with a working example. I highly recommend you try it out to get a feel for the functions: Quick Start Demo

~ Katie

Sunday, April 14, 2024

LIS 4317 Visual Analytics - Final Project

For this project, I want to answer the following research question:

How does the employment and unemployment rates of recent 4-year college graduates change over time in the state of Florida? Furthermore, how does their median earning income change over time?

Project Objectives:

  • Perform trend analysis on employment and unemployment rates of graduates from 4-year public universities in Florida.
  • Analyze changes in the median earning income of recent graduates from 4-year public universities in Florida over time.
  • Address data gaps resulting from inconsistent reporting by certain universities.
  • Focus the analysis specifically on public universities in Florida.

To solve this question, I will be using an extensive dataset called scorecard from the Vincent Arel Bundock Dataset Archive. This dataset contains information about my posed research question from the years 2007 to 2016. To add, this dataset is not limited to schools in Florida but contains data from all schools in the United States and also includes data on trade schools and community colleges. 

Problem Description:

Based on this information, the goal of this project is to perform trend analysis to understand how the employment and unemployment rates of graduates from 4-year public universities in Florida have changed over time. In addition, I would like to see how the median earning income of graduates has changed over time as well.

It is important to note that some schools that are listed do not provide data for all the years so it is important to take into consideration that there may be some gaps in the time. Upon analyzing the data from the Florida schools, it seems that they reported these values every other year. Given the extensiveness of the data, it may not be feasible to show every school in Florida including community colleges and trade schools. So, I will filter the data to only show public universities in the state of Florida.

Related Work:

Reflecting on concepts applicable to this problem, it can be related to time series analysis as I want to figure out if there are notable patterns evident throughout this stretch of time. As mentioned in chapter 4 of Nathan Yau’s book, Visualize This, it is recommended that looking at the big picture, that is, the full recorded length of time can allow us to analyze irregularities or spikes and dips in the data. However, it may also bring up outliers that do not add much to the overall visualization so it is important to take into account whether the full time period should be graphed or only part of it. For exploratory purposes, I intend to graph the full time period to determine if there are any spikes or dips I should be concerned with. Further, when I see a particular stretch of time that is noteworthy, I will graph that subset of the data.

Secondly, there is the concept of discrete points in time. As Yau points out, it is preferred that the recorded values are from specific points or blocks of time and there is a finite number of possible values. Thankfully, the observations in my selected dataset are finite from the count of college graduates working or not working after completing their degree to the median value of their earnings upon graduating college.

Reviewing Yau’s time series analysis visuals, I found this one to be the most interesting for its simplicity while also being quite informative to readers regarding when a new record was made:

Making sure to touch upon the ideas of time series analysis by Stephen Few of Now You See It, the concept of trend will play a big role in understanding earnings as well as how employment and unemployment rates increase and decrease over time.

Lastly, here is a line graph that I located from the R-Graph Gallery website which served as inspiration regarding line graph design:

Solution:

Before I began my visualization, I had to preprocess my data beforehand to identify the schools in Florida. Thankfully, for trade schools, community colleges, and four-year institutions, they all could be easily identified through the variable: pred_degree_awarded_ipeds which implemented the following numbering system for one to quickly identify what type of school is listed.

1 - Trade Schools

2 - Community Colleges

3 - Four Year Institutions

Then, to prevent having too many schools in the visual, I focused solely on schools that were assigned 3 and filtered out my data to only include public institutions rather than private institutions.

After some filtering, I was able to prepare my data for visualization:

After some consideration, it became clear that the line graph would be the preferred method of visualization. Given that I am dealing with time series data which does involve some spikes and dips in the values, it was clear that it was the most suitable.

I decided to create three line graphs using the ggplot2 package but to give myself more options when it came to design, I also enabled the packages “hrbrthemes” and “viridis”.

When it came to plotting the lines, I knew that I had to go beyond the standard ggplot2 design defaults. Although I do enjoy the color scheme, I needed to make sure that each of the lines were easily distinguishable from one another. After exploring the internet in search of palettes, I found the "Paired" palette to be the most aesthetically pleasing amongst the other color palettes. As for the design of the graph itself,  I made use of theme_ipsum of the hrbrthemes package to give my graphs a more mature design. 

Below are the graphs:

Median incomes over time

Unemployment after graduation over time

Employment after graduation over time

Key takeaways:

Note: It is important to understand that the employment and unemployment values as seen in visuals 2 and 3 are simply counts of how many individuals responded saying that they were employed or not. In this case, individuals were more likely to report that they were employed after graduation as opposed to being unemployed after graduation. Additionally, it seems that USF alumni might be overrepresented in the dataset.

Viewing the first line graph, Median Earning Income of College Graduates, we can immediately see that University of Florida clearly has turned out undergraduate students which make a high income upon graduating. Not far behind it is Florida International University and Florida State University. What is notable about the graph is the sudden dip in incomes during the year 2012 and then its immediate spike back up again. From this graph, we can conclude that median incomes were at first a downward trend but immediately went up becoming an upward trend. More data is needed to further understand this pattern.

Moving on to the second graph, Number of Unemployed Students After Graduation, this graph shows a somewhat concerning upward trend with University of South Florida-Main Campus guiding the way. Not good! Anyway, I must add that it is hard to say exactly if unemployment is necessarily a bad thing. For instance, it could just mean that the graduates went on immediately to attain an advanced degree such as a Masters. However, more data is needed to further understand this situation.

As for the third graph, Number of Employed Students After Graduation, USF redeems itself as it gradually increases with the number of students employed after graduation but I am surprised by the University of Florida being in forth place behind University of Central Florida and Florida State University. As I mentioned from the previous, it could be the case that these numbers are down because these students have gone on to attain advanced degrees elsewhere.

All in all, these graphs are informative but I still have some concerns regarding count of employment and unemployment as it is not the best method to show variations across the institutions. For example, the researchers could have gotten a really small number of responses from these institutions and thus, the data could be skewed and not accurately providing us with the actual trend.

To combat this issue, I have taken the percentage of both employment and unemployment to determine if these trends are really as significant as count makes it out to be:

Unemployment Percentage:

Employment Percentage:

Looking at both these graphs, although we can see that there were clear spikes and dips throughout the years, the trends themselves tell a slightly different story. First, in the Unemployment Percentage graph, there seems to be a downward trend and the percentages themselves are very low with the highest recorded being 15 percent of students at Florida Gulf Coast University report being unemployed after graduation in 2007. As for the Employment Percentage graph, it follows more of a flat trend. Yes, there are highs and lows but there really is no trend with Employment percentage. This allows us to see that although the count of students employed has increased, it really does not make a difference in the overall employment trend.

Although the universities appear to be going in the right direction, it is also critical to determine how a line of best fit impacts how one reads the graph:

Unemployment Percentage:

Employment Percentage:

Through the line of best fit, it deviates from the original story as perceived by the previous plots. When the line of best fit for unemployment (not working) is flat or slightly increasing, and the line of best fit for employment (working) is flat or slightly decreasing, it suggests that there may be little to no change or a slight deterioration in employment and unemployment rates over the years for the institutions.

Furthermore, we must also look towards outside factors like:

Stagnation in Employment and Unemployment: The flat or slightly changing trend lines indicate that there hasn't been significant improvement or worsening in employment and unemployment rates over the years. This could imply a stagnant job market or consistent labor force dynamics within these institutions.

Economic Stability or Stagnation: It may suggest overall economic stability or stagnation in the regions where these institutions operate. Stable economic conditions might lead to steady employment rates, while stagnant conditions might result in little change in both employment and unemployment rates.

Structural Factors: There could be underlying structural factors within these institutions or industries that contribute to the observed trends. For example, if these institutions operate in sectors with slow growth or high job security, it could lead to relatively stable employment and unemployment rates over time.

Overall, interpreting the implications of these trends requires considering broader economic context, institutional factors, and potential limitations of the data. Further analysis or contextual information may help provide a clearer understanding of the observed patterns.

~ Katie

Monday, April 8, 2024

LIS 4317 Visual Analytics - Module 13 Assignment

In this week of Visual Analytics, we are asked to create a simple animation using R and describe it. So, I decided to use a dataset that shows the steady progression of Artificial Intelligence test scores on a variety of subjects slowly meeting the benchmark of human performance. The benchmark is represented by that horizontal gray line.

The dataset can be found here: AI Test Scores

To begin, let's look at just the static visual:

Now, the animation:

What I find interesting about this visual is that it perfectly encapsulates how AI was slowly progressing in the early 2000s (Handwriting recognition, Speech recognition, Image recognition) but is wasn't until the mid-2010s to early 2020's that AI began meeting and surpassing the human performance benchmark. There are still some new AI technologies that have yet to meet the human performance benchmark like math problem solving, code generation, and general knowledge tests but it is only a matter of time before AI surpasses the benchmark.

Code:

~ Katie

Monday, April 1, 2024

LIS 4317 Visual Analytics - Module 12 Assignment

For this week's assignment, we were tasked with creating a social network visualization. To create mine, I used RStudio and included the following R packages:

As for the dataset selection, I must admit it was a bit tricky at first but then I located this interesting resource on GitHub which linked me to a plethora of different social network analysis datasets.

Here's the resource: awesome-network-analysis

From this list, I decided to check out this website called Moviegalaxies, which dedicates itself to documenting character interactions in popular movies. Although they do provide their own analysis, individuals are welcome to download their .json data files for further analysis. After perusing their movies, I decided to go with their Toy Story dataset and was able to visualize the following.

Code:

Visual:

Success or Failure:

I would say that this visual was a success. Immediately we see that Buzz has the most interactions just based on all the links and the centrality he has in the visual. However, there are some data discrepancies. For example, Sid's Mom and Mom are the same person. Thinking about how I would make this visual better, I would like to place the labels to be above the nodes instead of directly on top of the node. Additionally, it would be great to adjust the node placement a bit so that the labels are a little easier to read.

~ Katie

Thursday, March 28, 2024

LIS 4370 R Programming - Module 12 Assignment

In the following link, please see the main functions I have created for the sentimentTextAnalyzer package. Each of the functions are completely functional but I have some testing to do to make sure that it can handle a variety of different types of text and URL and file types. 

Link to RMD file on GitHub: RMD File

easyRead:

The first function I created is called easyRead and its main purpose is to do any preprocessing before the file or link is properly cleaned by easyClean which is called within easyRead. The input is the user's selected file or link and the output is ready-to-use matrix, the appropriate format for analysis.

easyClean:

easyClean takes the preprocessed text from easyRead and cleans it by making the words lowercase, removing punctuation, removing numbers, and removing common English stopwords. The input is the preprocessed text and the output is a matrix.

easyFrequency:

In easyFrequency, it takes the previously created matrix and outputs the frequency of words found within the text. By reading in positive and negative lexicons, the function then determines of the frequency of those types of words found within the text. The input is the word_matrix, the positive and negative lexicons and the output is a list of the frequency results.

easyWordCloud:

This function takes in a dataframe and returns a default wordcloud. At this time, users must create a dataframe from the easyFrequency results for this function to work properly. 

Quick Demo:



Insights, Challenges, Improvements:

For the most part, I am satisfied with easyRead and easyClean but easyFrequency and easyWordCloud could use some polish. At this moment, user's have to input their own lexicons which I understand is not feasible for everyone. Thus, I will have to figure out how to include a few more ready-to-use lexicons. Additionally, I think I will try to change the output to be a dataframe rather than a list as individuals do have to do a bit of coding to get the results ready for visualization. As for easyWordCloud, it works but it could be better. I would like to include some style options for the user to choose from and provide some more control over the number of words shown on the wordcloud. 

~ Katie

Tuesday, March 26, 2024

LIS 4317 Visual Analytics - Module 11 Assignment

After reviewing the many visualizations Dr. Piwek made on his website,

Tufte and Minard Post

I decided to replicate the following visuals:

Density Plot Code:

Visual:

Box Plot Code:

Visual:

Reflection:

Going through Dr. Piwek's post on graphing visuals inspired by Tufte and Minard was quite interesting. There were many complex visualizations included and I have found the ones with added interactivity through the use of the package highcharter to be particularly fascinating. Going forward, I will have to refer back to the post when I need a refresher on style. 

~ Katie

Wednesday, March 20, 2024

LIS 4317 Visual Analytics - Module 10 Assignment

For this week's assignment, we are asked to make improvements to any of the given data visualizations from the Yau textbook or the economics dataset visualizations.

Right off the bat, I must admit that many of these visuals already looked perfect but I attempted to make some improvements.

Starting off with visual 2 of the hotdog dataset from the Yau textbook, the first thing that came to mind was how it could use an annotation to point out the highest on record number of hotdogs eaten as well as the name of the record holder. To do that, I simply added onto the current graph and placed a point so that viewers have an easier time picking it out.

Moving onto the economics dataset, I was intrigued by the line chart of visual 4 that showed the unemployed population over time. Looking at the other variables in the dataset, I wanted to see if I could a do a direct comparison of the unemployed over time versus the median duration of unemployment or uempmed. To do that, I was to generate two line charts of these variables and then place one on top of the other in a single visual. 

All in all, it was interesting working with time series data and I hope to come across more datasets that specifically pertain to time series.

~ Katie

LIS 4370 R Programming - Module 11 Assignment

In this week's assignment, we are asked to locate a bug that was deliberately placed inside a function.

The objective:

Find the bug and fix the code and discuss your debugging procedure.

Buggy code:

Fixed code:

Debugging Procedure:

To begin the debugging process, the first step was to run the code to look for any glaring errors that pop up. Upon running the code, I got the following syntax error message:

This error appears to be from the second for loop in the function, more specifically, the placement of the return statement. Currently, it is after one of the curly braces when it should be placed outside the loop. With this information in hand, all you need to do is to move the return statement to outside the loop. For better readability, it is a good idea to drop a line for each ending curly brace.

Check out the code here: Module 11 Code

~ Katie

Saturday, March 16, 2024

LIS 4370 R Programming - Module 10 Assignment

 sentimentTextAnalyzer R Package Proposal

Introduction:

The sentimentTextAnalyzer package aims to provide a comprehensive tool for analyzing textual data to extract sentiment insights. With the increasing volume of text data generated on various platforms, understanding sentiment is crucial for businesses and researchers alike. sentimentTextAnalyzer offers a robust solution for sentiment analysis, enabling users to extract positive and negative sentiment signals from diverse sources such as URLs and flat text files.

Objectives:

  • Develop a versatile R package, sentimentTextAnalyzer, capable of analyzing text data for positive and negative sentiment.
  • Implement algorithms to parse text efficiently and extract sentiment insights.
  • Enable the package to generate word clouds highlighting the most frequent words in the text data.
Key Features:
  • Text Parsing: Implement algorithms to parse text from various sources, including URLs and flat text files.
  • Sentiment Analysis: Develop algorithms to identify positive and negative sentiment words and calculate their frequency in the text.
  • Word Cloud Customization: Enable the package to generate word clouds depicting the most common words in the text data.
  • Customization: Allow users to customize sentiment analysis parameters and word cloud generation options.
Methodology:
  • Text Parsing: Utilize natural language processing (NLP) techniques to preprocess and tokenize text data. 
  • Sentiment Analysis: Implement sentiment lexicons and algorithms to identify positive and negative sentiment words.
  • Word Cloud Generation: Utilize packages such as wordcloud2 to generate visually appealing word clouds based on word frequency. 
  • Package Development: Utilize R programming language and relevant packages (e.g., tidyverse, text mining) to develop the sentimentTextAnalyzer package. 
  • Testing and Validation: Conduct thorough testing and validation to ensure the accuracy and reliability of sentiment analysis results.

Following this link will take you to the package's description file which provides a few details regarding licensing, potential dependencies, and the current version of the package: DESCRIPTION

~ Katie

Tuesday, March 5, 2024

LIS 4370 R Programming - Module 9 Assignment

In this week in R Programming, we are asked to select a dataset from the Vincent Arel Bundock dataset list and create visualizations from that data.

Link to the list: Vincent Arel Bundock Datasets

I decided to work from the pizzaplace.csv dataset which contains sales, pizza type, and size data over an entire year. 

In the instructions, it is mentioned that there are three ways to make visualizations in R: Base R, Lattice package, and ggplot2 package. Thus, I will generate visualizations from these listed methods.

In base R, let's make a "pie" chart that determines the occurrences of each of the 4 types of pizzas sold:

As for the second visual, let's use the lattice package to explore the relationship between prices and pizza size:

Moving on to the third visual using the ggplot2 package, the data was separated by facet to make it easier to compare trends between the four types of pizza and their associated sales over time:

After creating the visuals using the three different methods, I must admit that it is interesting to see how each method does have its pros and cons. For example, I do like to use the base R method but things can get complicated fast with having to call out all the individual methods. To make the first visual better, I should include percentages for each of the four types of pizza sold. Moving on to the second visual, I do not have too much experience using the Lattice package but I do think that the visual came out well in telling a story with the data. For instance, it still weirds me out that someone bought a super expensive small pizza that surpassed the price of a large pizza. Lastly, the ggplot2 visual really puts into perspective which pizza type is the most expensive in terms of sales like classic going above 30. 

Check out the code here: Module 9 Code

~ Katie 

LIS 4317 Visual Analytics - Module 9 Assignment

In this week of Visual Analytics, we are asked to create a multi-variate visualization graph with a dataset of our choice.

In this case, I decided to work with a dataset called nyc_squirrels.csv and basically it contains very detailed observations from squirrel watching in New York City's Central Park. From what they were doing, what sound they made, to even the exact geo coordinates of where the squirrel watching event occurred, it is all noted down in the data. 

Link to where I found the data: NYC Squirrels Data

From this data, I decided that I wanted to better understand the spatial distribution of squirrel sightings and see if there is any difference in sightings that occurred in the AM or PM.

To begin my analysis, I did make a point to clean my data containing entries with NA values and deleted variables that were not conducive to the analysis.

With the data ready, I used the ggplot2 package in R to graph the points:

Here is the visual: 

As you can see, when the geo coordinate points are plotted, it actually makes a rough outline of Central Park. The big empty gap you are seeing represents Jacqueline Kennedy Onassis Reservoir so it makes sense that there were not any squirrels spotted there. For the most part, I do not see any particular difference in squirrel sightings in the AM versus PM but there does appear to be more squirrel sightings at night than during the day.

For fun, let's see what this plot looks like in Tableau with a map underneath the points:

See the map up close here: NYC Squirrel Sightings

Wrapping up, visualizing multi-variables can be very helpful when it comes to understanding the subtle relationships between them. It is definitely interesting to be able to compare AM sightings to PM sightings and where they occurred in Central Park and allows for one to better understand the dataset.

As for applying the 5 principles of design, alignment is used for the axis labels, legend, and title for better readability. With repetition, shape style, color, font size, and type are kept consistently. To highlight the difference between day and night, I opted to use cool colors most often associated with the night for PM and warm colors for AM which checks off the contrast requirement. Moving on to proximity, visual elements like the legend are clearly placed together to promote connection. Lastly, with balance, I must admit that the Tableau visual is not as balanced as the previous ggplot visual. It has very small legend which makes it have uneven weight. To prevent this, I should think about adding more data elements to make the visual more balanced.

~ Katie

Thursday, February 29, 2024

LIS 4317 Visual Analytics - Module 8 Assignment

In this week in Visual Analytics, we are asked to generate a visual from the mtcars dataset that is either based on correlation and regression analysis.

In this case, I will be conducting correlation analysis on the variables number of cylinders (cyl) and horsepower (hp). That is, I want to know if these variables have a positive correlation or negative correlation relationship.

Let's begin by plotting the relationship via ggplot2 in a scatterplot. I will also draw in a line of best fit using the lm argument.

Here's my code:

The resulting output:

Interestingly, there appears to be a rather strong positive correlation between the number of cylinders and horsepower. Perhaps, this is alluding to cars that have a greater number of cylinders have a higher horsepower. Let's take a look at the correlation coefficient.

With a value of .832, that's a pretty high correlation value!

Lastly, let's finish up by creating a linear regression model of these variables:

Looking at the significance codes, cyl is highly significant at the (***) mark at 0.001. However, the model can only capture about 68% of the variability so it is a moderately strong correlation.

Reflecting on the design remarks of Few from the textbook, my visual does follow a few of his best practices. To begin, I do feature some grid lines for greater readability of graph and one can easily determine the points and where they fall in terms of cyl and hp. However, I do see how grid lines can be helpful if multiple scatter plots are used in the analysis. In addition, my visual also has a line of best fit so that we easily see the correlation's linear shape and positive slope.

I appreciate Few's recommendations when it comes to visualizing correlations between variables. Although my visual does not include all his suggestions and I know that they will be helpful for other datasets.

~ Katie


LIS4370 R Programming - Module 8 Assignment

For this week's assignment in R Programming, we are to do the following:

Step 1: Import txt file in R. This file is called Assignment 6 Dataset.txt

Step 2: Use ddply from the plyr package to generate mean of Age and Grade variables split by Sex variable, this will be saved to new variable, Grade.Average

Step 3: Create a new txt file containing new variable Grade.Average. This new txt file will be called Sorted.Average

Step 4: Separate values in Sorted.Average by comma using sep argument, and save it back to Sorted.Average

Step 5: Using the original txt file, filter the names in the list that contain the letter (i). Then, save the result to a new file called DataSubset with the values separated by comma.

Let's take a look at our new files:

Sorted.Average:

DataSubset:

As you can see by the resulting output, the values are separated by commas and the variables are differentiated as they are surrounded by double quotes. 

See the code on GitHub: Module 8 Code

~ Katie

Thursday, February 22, 2024

LIS 4317 Visual Analytics - Module 7 Assignment

For this week's assignment, we are tasked with creating visual analytics based on distribution analysis. I will be working with the mtcars dataset to understand the distribution of horsepower (hp).

A quick note, I generated a couple visuals of the horsepower (hp) distribution:

Scatter plot:

Boxplot:

Line Graph:

Histogram:

Reflecting on Few's recommendations in testing and best practices when it comes to conducting distribution analysis, each of my graphs have strengths and weaknesses. To begin, Few notes that there are three main characteristics when it comes to describing distributions. These are...

Spread: A simple measure of dispersion, or how spread out the values are and it essentially is the full range of values from highest to lowest.

Center: An estimate of the middle of a set of values and it is often demonstrated by either the mean or median.

Shape: Where the values are located throughout the spread. 

For the most part, my visuals do a good job of showing spread except for maybe the boxplot as it simplifies the values that are shown on the y-axis tick marks but the full spread is still albeit it is slightly downplayed. As for center, visuals 1, 3, and 4 provides horsepower's mean and median and where it lies in correspondence to the chart. The second visual, the boxplot, only provides the median. Moving on to shape, one can note that visuals 1, 3, and 4 appear slightly skewed to the right. In the histogram, one can also see a brief gap near the 300 tick mark and an outlier when hp equals 325. 

As for whether these visuals correspond to Few's distribution analysis best practices, I believe my visuals do a fairly good job when it comes to interval consistency but fails when it comes to outlier resistance. As one can tell from the visuals, there is a clear outlier where hp equals 325. The mean calculation can be heavily affected by outliers and as a result, can be shifted in the direction of that outlier and we can clearly see that happening here. Therefore, it might be a good idea to remove that outlier from the dataset before conducting visual analysis. 

All in all, Few's recommendations are incredibly helpful when it comes to deciphering data when it is visualized.

~ Katie

Wednesday, February 21, 2024

LIS 4370 R Programming - Module 7 Assignment

For this week's assignment, I will start out by examining the iris dataset and then transition to my own dataset when it comes to creating two examples of S3 and S4.

Question 1: Determine if a generic function can be applied to your dataset

To begin, I used the following functions on the iris dataset and came up with the following output:

Based on this output, I can confirm that a generic function can be applied to my chosen dataset.

Question 2: How do you tell what OO system (S3 vs. S4) an object is associated with?

In the library, pryr, one can use the function otype() to determine which OO system object is associated with.

In this case, the iris dataset is associated with S3.

Question 3: How do you determine what the base type of an object is?

Using the typeof() function can help in determining an object's base type. Continuing with the iris dataset, we can check the object type of each of the variables within the dataset:

Question 4: What is an generic function?

A generic function can be defined as a function that performs a common task like printing (print()) or even plotting (plot()). Furthermore, they can be thought as extended function objects because they contain information that is used in creating and dispatching for the function.

Question 5: What are the main differences between S3 and S4?

To put it simply, S3 is considered more convenient while S4 is more safe. Additionally, S3 classes are very straightforward to implement as it only uses the first argument to dispatch but it can allow for mistakes to slip through like misspelled values and missing values and will not alert the programmer of the potential issues. On the other hand, S4 classes and methods are way more formal and more closely related to object-oriented concepts and unlike S3, S4 will complain about such misspellings and other issues to alert the programmer that the current code does need to be fixed.

Question 6: Create two examples of S3 and S4. (Code will be linked to GitHub)

S3 Code:

Output:

S4: Code:

Output:

After conducting this brief code experiment, I must admit that I greatly prefer the form of S4 over S3 just for its ease of creating instances of the class. 

Link to GitHub Code: Module 7 Code

~ Katie

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much...