Important Background Information:
For this project, I decided to analyze a dataset containing UFO sightings in the United States, Mexico, Canada, and certain places in Europe. However, to keep it simple I focused solely on the data records from the United States.
Here is the link to the original dataset: https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data
Note: when you go on to download the dataset, you will actually download two datasets: scrubbed.csv and complete.csv
For this analysis, I worked from scrubbed.csv which had less entries. After some data munging, I was able to get it down to approximately 63,561 rows and 11 variables which will work as the base of my analysis.
Establishing the Hypothesis (Problem Description):
From this dataset, I intended to determine the following:
Is there any difference in the duration (seconds) of the sightings between the north and south regions of the United States?
From the flowchart provided in the project instructions, I will be working with two samples (North and South) and developed the null and alternative hypotheses.
Null Hypothesis (H0): The average duration of UFO sightings is the same in the North and South regions, and there is no significant difference in the variability of durations between the two regions.
Alternative Hypothesis (H1): The average duration of UFO sightings differs between the North and South regions, and there is a significant difference in the variability of durations between the two regions.
Through conducting the Levene Test, I was able to determine that there are equal variances, and a Two Sample t-Test would be suitable for this analysis.
Also, important to note, to determine what states could be classified as “North” or “South”, I created a new variable known as region which through an if-else statement classifies a state as “North” or “South” by checking if it is above or below my selected latitude bench line. Given that I am from Maryland, this latitude value is 39.718457 and is based off the Mason-Dixon Line which I have crossed on multiple occasions. Fun times!
Related Work:
As for how my problem and chosen method for analysis relates to what was learned during the semester, it can be traced back to Module #6 Random Variable(s) and Probability Distribution(s) and One-Sample and Two-Sample Tests. Through the provided table in the Module #6 One-sample and Two-sample tests page, I used it as a guide when it came to forming my hypotheses.
Solution:
To solve the problem, I used various R functions to conduct my analysis. To determine if there are equal variances, I made use of the Levene test function, and my primary methodology was using a two sample t-test.
Why a two sample t-test, you might ask? Well, this test was determined to be the best choice by following the guidelines in the Final Project flowchart.
Given that I had two samples (North and South), the first question I had to ask myself was whether the data was in pairs or not. In this case, it was not as for North there was only 28,423 records and for South there was 35,020 records. For data to be paired, the observations must be of equal length.
Therefore, I followed the “No” arrow and proceeded to the “Equal Variances?” prompt. This part was a little tricky as the p-value was 0.05573 as determined by the Levene test. It was just barely above the 0.05 significance level so it was natural to check “Assume Equal Variances”, but it could also be considered as no equal variances as well. I was concerned about a potential violation so I performed two t-tests with var.equal set to TRUE or FALSE to see if there was any difference. From the results, I was able to confirm that there was no significant difference. Thankfully, whether I selected “Yes” or “No” to the Equal Variances prompt, it led to the same Two-Sample t-test.
To understand why a two sample t-test was preferred over other analysis methods, I realized that I needed a test that could handle two samples so naturally, I crossed out the idea of using a one-sample t-test as I had two samples (North and South). Furthermore, I did not consider the one-way ANOVA test as you need to have a minimum of three independent samples to make it work. Also, because my data was not paired nor related to each other, I did not execute a paired t-test. Through these insights, I was able to conclude that the Two-Sample t-test was the best fit for the data given the choices in the flowchart.
Findings and Conclusions:
Upon executing my first t-test will var.equal set to TRUE, I soon became aware that I was in marginal violation of the equal variance assumption as the p-value (0.05573) from the Levene test was only slightly above the significance level of 0.05. With this minor issue in mind, I wanted to see via two t-tests with var.equal set to TRUE or FALSE respectively if there was any major difference in output. Quickly, I saw that their differences were minor and essentially alluded to the same findings.
Thus, we can infer the following.
Seeing that the p-value is just above the significance level of 0.05, we can determine that there is not enough evidence to reject the null hypothesis. In other words, there is no significant difference in the average duration of UFO sightings in the north and south regions of the United States. While the means for duration do appear to differ between North and South, one must understand that there truly is no significant difference in the length of the sightings.
~ Katie