Thursday, April 25, 2024

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much easier. By automating much of the code statements required to perform such an analysis, all a user must do is execute a few lines of code to get the sentiment results that they are seeking.

In sentimentTextAnalyzer2, users can...

  • See their files read and their html file links parsed and ready for analysis.
  • Have their file/URL go through the necessary preprocessing steps like:
    • Removing punctuation
    • Removing numbers
    • Makes text lowercase
    • Removes any English stopwords like (and, but, then, the, etc.)
  • Get fast results regarding the frequency of positive and negative words courtesy of the Bing lexicon.
  • Visualize their findings with word clouds.

Installing this package in RStudio:
Please follow this link to the sentimentTextAnalyzer2 repository: sentimentTextAnalyzer2 Repository

There you will find various files relating to the documentation of the package but also how to install this file in your own RStudio. Please download the file that reads sentimentTextAnalyzer2_0.1.0.tar.gz. Place it in your preferred R directory and install. The installation steps should look something like this. 


Package Installation Option 2:
You can also run these lines of code to call the library into your RStudio.
# If you haven't already installed devtools...
# install.packages("devtools")
library(devtools)
devtools::install_github("ProfessorHatsuneMiku/sentimentTextAnalyzer2")

Quick Start Guide:
Also included in the repository is a demo guide to using the functions with a working example. I highly recommend you try it out to get a feel for the functions: Quick Start Demo

~ Katie

Sunday, April 14, 2024

LIS 4317 Visual Analytics - Final Project

For this project, I want to answer the following research question:

How does the employment and unemployment rates of recent 4-year college graduates change over time in the state of Florida? Furthermore, how does their median earning income change over time?

Project Objectives:

  • Perform trend analysis on employment and unemployment rates of graduates from 4-year public universities in Florida.
  • Analyze changes in the median earning income of recent graduates from 4-year public universities in Florida over time.
  • Address data gaps resulting from inconsistent reporting by certain universities.
  • Focus the analysis specifically on public universities in Florida.

To solve this question, I will be using an extensive dataset called scorecard from the Vincent Arel Bundock Dataset Archive. This dataset contains information about my posed research question from the years 2007 to 2016. To add, this dataset is not limited to schools in Florida but contains data from all schools in the United States and also includes data on trade schools and community colleges. 

Problem Description:

Based on this information, the goal of this project is to perform trend analysis to understand how the employment and unemployment rates of graduates from 4-year public universities in Florida have changed over time. In addition, I would like to see how the median earning income of graduates has changed over time as well.

It is important to note that some schools that are listed do not provide data for all the years so it is important to take into consideration that there may be some gaps in the time. Upon analyzing the data from the Florida schools, it seems that they reported these values every other year. Given the extensiveness of the data, it may not be feasible to show every school in Florida including community colleges and trade schools. So, I will filter the data to only show public universities in the state of Florida.

Related Work:

Reflecting on concepts applicable to this problem, it can be related to time series analysis as I want to figure out if there are notable patterns evident throughout this stretch of time. As mentioned in chapter 4 of Nathan Yau’s book, Visualize This, it is recommended that looking at the big picture, that is, the full recorded length of time can allow us to analyze irregularities or spikes and dips in the data. However, it may also bring up outliers that do not add much to the overall visualization so it is important to take into account whether the full time period should be graphed or only part of it. For exploratory purposes, I intend to graph the full time period to determine if there are any spikes or dips I should be concerned with. Further, when I see a particular stretch of time that is noteworthy, I will graph that subset of the data.

Secondly, there is the concept of discrete points in time. As Yau points out, it is preferred that the recorded values are from specific points or blocks of time and there is a finite number of possible values. Thankfully, the observations in my selected dataset are finite from the count of college graduates working or not working after completing their degree to the median value of their earnings upon graduating college.

Reviewing Yau’s time series analysis visuals, I found this one to be the most interesting for its simplicity while also being quite informative to readers regarding when a new record was made:

Making sure to touch upon the ideas of time series analysis by Stephen Few of Now You See It, the concept of trend will play a big role in understanding earnings as well as how employment and unemployment rates increase and decrease over time.

Lastly, here is a line graph that I located from the R-Graph Gallery website which served as inspiration regarding line graph design:

Solution:

Before I began my visualization, I had to preprocess my data beforehand to identify the schools in Florida. Thankfully, for trade schools, community colleges, and four-year institutions, they all could be easily identified through the variable: pred_degree_awarded_ipeds which implemented the following numbering system for one to quickly identify what type of school is listed.

1 - Trade Schools

2 - Community Colleges

3 - Four Year Institutions

Then, to prevent having too many schools in the visual, I focused solely on schools that were assigned 3 and filtered out my data to only include public institutions rather than private institutions.

After some filtering, I was able to prepare my data for visualization:

After some consideration, it became clear that the line graph would be the preferred method of visualization. Given that I am dealing with time series data which does involve some spikes and dips in the values, it was clear that it was the most suitable.

I decided to create three line graphs using the ggplot2 package but to give myself more options when it came to design, I also enabled the packages “hrbrthemes” and “viridis”.

When it came to plotting the lines, I knew that I had to go beyond the standard ggplot2 design defaults. Although I do enjoy the color scheme, I needed to make sure that each of the lines were easily distinguishable from one another. After exploring the internet in search of palettes, I found the "Paired" palette to be the most aesthetically pleasing amongst the other color palettes. As for the design of the graph itself,  I made use of theme_ipsum of the hrbrthemes package to give my graphs a more mature design. 

Below are the graphs:

Median incomes over time

Unemployment after graduation over time

Employment after graduation over time

Key takeaways:

Note: It is important to understand that the employment and unemployment values as seen in visuals 2 and 3 are simply counts of how many individuals responded saying that they were employed or not. In this case, individuals were more likely to report that they were employed after graduation as opposed to being unemployed after graduation. Additionally, it seems that USF alumni might be overrepresented in the dataset.

Viewing the first line graph, Median Earning Income of College Graduates, we can immediately see that University of Florida clearly has turned out undergraduate students which make a high income upon graduating. Not far behind it is Florida International University and Florida State University. What is notable about the graph is the sudden dip in incomes during the year 2012 and then its immediate spike back up again. From this graph, we can conclude that median incomes were at first a downward trend but immediately went up becoming an upward trend. More data is needed to further understand this pattern.

Moving on to the second graph, Number of Unemployed Students After Graduation, this graph shows a somewhat concerning upward trend with University of South Florida-Main Campus guiding the way. Not good! Anyway, I must add that it is hard to say exactly if unemployment is necessarily a bad thing. For instance, it could just mean that the graduates went on immediately to attain an advanced degree such as a Masters. However, more data is needed to further understand this situation.

As for the third graph, Number of Employed Students After Graduation, USF redeems itself as it gradually increases with the number of students employed after graduation but I am surprised by the University of Florida being in forth place behind University of Central Florida and Florida State University. As I mentioned from the previous, it could be the case that these numbers are down because these students have gone on to attain advanced degrees elsewhere.

All in all, these graphs are informative but I still have some concerns regarding count of employment and unemployment as it is not the best method to show variations across the institutions. For example, the researchers could have gotten a really small number of responses from these institutions and thus, the data could be skewed and not accurately providing us with the actual trend.

To combat this issue, I have taken the percentage of both employment and unemployment to determine if these trends are really as significant as count makes it out to be:

Unemployment Percentage:

Employment Percentage:

Looking at both these graphs, although we can see that there were clear spikes and dips throughout the years, the trends themselves tell a slightly different story. First, in the Unemployment Percentage graph, there seems to be a downward trend and the percentages themselves are very low with the highest recorded being 15 percent of students at Florida Gulf Coast University report being unemployed after graduation in 2007. As for the Employment Percentage graph, it follows more of a flat trend. Yes, there are highs and lows but there really is no trend with Employment percentage. This allows us to see that although the count of students employed has increased, it really does not make a difference in the overall employment trend.

Although the universities appear to be going in the right direction, it is also critical to determine how a line of best fit impacts how one reads the graph:

Unemployment Percentage:

Employment Percentage:

Through the line of best fit, it deviates from the original story as perceived by the previous plots. When the line of best fit for unemployment (not working) is flat or slightly increasing, and the line of best fit for employment (working) is flat or slightly decreasing, it suggests that there may be little to no change or a slight deterioration in employment and unemployment rates over the years for the institutions.

Furthermore, we must also look towards outside factors like:

Stagnation in Employment and Unemployment: The flat or slightly changing trend lines indicate that there hasn't been significant improvement or worsening in employment and unemployment rates over the years. This could imply a stagnant job market or consistent labor force dynamics within these institutions.

Economic Stability or Stagnation: It may suggest overall economic stability or stagnation in the regions where these institutions operate. Stable economic conditions might lead to steady employment rates, while stagnant conditions might result in little change in both employment and unemployment rates.

Structural Factors: There could be underlying structural factors within these institutions or industries that contribute to the observed trends. For example, if these institutions operate in sectors with slow growth or high job security, it could lead to relatively stable employment and unemployment rates over time.

Overall, interpreting the implications of these trends requires considering broader economic context, institutional factors, and potential limitations of the data. Further analysis or contextual information may help provide a clearer understanding of the observed patterns.

~ Katie

Monday, April 8, 2024

LIS 4317 Visual Analytics - Module 13 Assignment

In this week of Visual Analytics, we are asked to create a simple animation using R and describe it. So, I decided to use a dataset that shows the steady progression of Artificial Intelligence test scores on a variety of subjects slowly meeting the benchmark of human performance. The benchmark is represented by that horizontal gray line.

The dataset can be found here: AI Test Scores

To begin, let's look at just the static visual:

Now, the animation:

What I find interesting about this visual is that it perfectly encapsulates how AI was slowly progressing in the early 2000s (Handwriting recognition, Speech recognition, Image recognition) but is wasn't until the mid-2010s to early 2020's that AI began meeting and surpassing the human performance benchmark. There are still some new AI technologies that have yet to meet the human performance benchmark like math problem solving, code generation, and general knowledge tests but it is only a matter of time before AI surpasses the benchmark.

Code:

~ Katie

Monday, April 1, 2024

LIS 4317 Visual Analytics - Module 12 Assignment

For this week's assignment, we were tasked with creating a social network visualization. To create mine, I used RStudio and included the following R packages:

As for the dataset selection, I must admit it was a bit tricky at first but then I located this interesting resource on GitHub which linked me to a plethora of different social network analysis datasets.

Here's the resource: awesome-network-analysis

From this list, I decided to check out this website called Moviegalaxies, which dedicates itself to documenting character interactions in popular movies. Although they do provide their own analysis, individuals are welcome to download their .json data files for further analysis. After perusing their movies, I decided to go with their Toy Story dataset and was able to visualize the following.

Code:

Visual:

Success or Failure:

I would say that this visual was a success. Immediately we see that Buzz has the most interactions just based on all the links and the centrality he has in the visual. However, there are some data discrepancies. For example, Sid's Mom and Mom are the same person. Thinking about how I would make this visual better, I would like to place the labels to be above the nodes instead of directly on top of the node. Additionally, it would be great to adjust the node placement a bit so that the labels are a little easier to read.

~ Katie

Thursday, March 28, 2024

LIS 4370 R Programming - Module 12 Assignment

In the following link, please see the main functions I have created for the sentimentTextAnalyzer package. Each of the functions are completely functional but I have some testing to do to make sure that it can handle a variety of different types of text and URL and file types. 

Link to RMD file on GitHub: RMD File

easyRead:

The first function I created is called easyRead and its main purpose is to do any preprocessing before the file or link is properly cleaned by easyClean which is called within easyRead. The input is the user's selected file or link and the output is ready-to-use matrix, the appropriate format for analysis.

easyClean:

easyClean takes the preprocessed text from easyRead and cleans it by making the words lowercase, removing punctuation, removing numbers, and removing common English stopwords. The input is the preprocessed text and the output is a matrix.

easyFrequency:

In easyFrequency, it takes the previously created matrix and outputs the frequency of words found within the text. By reading in positive and negative lexicons, the function then determines of the frequency of those types of words found within the text. The input is the word_matrix, the positive and negative lexicons and the output is a list of the frequency results.

easyWordCloud:

This function takes in a dataframe and returns a default wordcloud. At this time, users must create a dataframe from the easyFrequency results for this function to work properly. 

Quick Demo:



Insights, Challenges, Improvements:

For the most part, I am satisfied with easyRead and easyClean but easyFrequency and easyWordCloud could use some polish. At this moment, user's have to input their own lexicons which I understand is not feasible for everyone. Thus, I will have to figure out how to include a few more ready-to-use lexicons. Additionally, I think I will try to change the output to be a dataframe rather than a list as individuals do have to do a bit of coding to get the results ready for visualization. As for easyWordCloud, it works but it could be better. I would like to include some style options for the user to choose from and provide some more control over the number of words shown on the wordcloud. 

~ Katie

Tuesday, March 26, 2024

LIS 4317 Visual Analytics - Module 11 Assignment

After reviewing the many visualizations Dr. Piwek made on his website,

Tufte and Minard Post

I decided to replicate the following visuals:

Density Plot Code:

Visual:

Box Plot Code:

Visual:

Reflection:

Going through Dr. Piwek's post on graphing visuals inspired by Tufte and Minard was quite interesting. There were many complex visualizations included and I have found the ones with added interactivity through the use of the package highcharter to be particularly fascinating. Going forward, I will have to refer back to the post when I need a refresher on style. 

~ Katie

Wednesday, March 20, 2024

LIS 4317 Visual Analytics - Module 10 Assignment

For this week's assignment, we are asked to make improvements to any of the given data visualizations from the Yau textbook or the economics dataset visualizations.

Right off the bat, I must admit that many of these visuals already looked perfect but I attempted to make some improvements.

Starting off with visual 2 of the hotdog dataset from the Yau textbook, the first thing that came to mind was how it could use an annotation to point out the highest on record number of hotdogs eaten as well as the name of the record holder. To do that, I simply added onto the current graph and placed a point so that viewers have an easier time picking it out.

Moving onto the economics dataset, I was intrigued by the line chart of visual 4 that showed the unemployed population over time. Looking at the other variables in the dataset, I wanted to see if I could a do a direct comparison of the unemployed over time versus the median duration of unemployment or uempmed. To do that, I was to generate two line charts of these variables and then place one on top of the other in a single visual. 

All in all, it was interesting working with time series data and I hope to come across more datasets that specifically pertain to time series.

~ Katie

LIS 4370 R Programming - sentimentTextAnalyzer2 Final Project

For this class's major final project, I set out to make the process of analyzing textual files and URL links for sentiment insights much...