Without altering our dataset, excel registers there to be 34 columns and 13689 rows.
I sorted based off Organism.Code for “WGK”. The new .csv file contains 34 columns and 5241 rows.
I did the filtering steps described within Excel. The unedited .csv file looks like the example below
## Lab.ID Isolate.Number Testing.Date Testdate Organism.Name
## 1 1.15e+11 1 4/14/2017 #VALUE! Staph.haemolyticus
## 2 1.15e+11 1 4/14/2017 Ent.cloacae
## 3 1.15e+11 1 4/14/2017 Staph.haemolyticus
## 4 1.15e+11 1 4/14/2017 Staph.haemolyticus
## 5 1.15e+11 1 4/14/2017 Ent.hermanii
## 6 1.15e+11 1 4/14/2017 Staph.haemolyticus
## Organism.Code Bio.Number Percent.Probability Renocyclin PlotMIC
## 1 WGK 7.04e+13 99 0.25 -2
## 2 AMX 1.16e+14 99 <=0,25
## 3 WGK 5.04e+13 95 0.25 -2
## 4 WGK 5.04e+13 97 0.25 -2
## 5 AMM 3.60e+13 97 <=0,25
## 6 WGK 5.04e+13 95 0.25 -2
Now the filtered dataset looks like this:
## Lab.ID Isolate.Number Testing.Date Organism.Name Organism.Code
## 1 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## 2 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## 3 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## 4 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## 5 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## 6 1.15e+11 1 4/14/2017 Staph.haemolyticus WGK
## Bio.Number Percent.Probability Renocyclin PlotMIC
## 1 7.04e+13 99 0.25 -2
## 2 5.04e+13 95 0.25 -2
## 3 5.04e+13 97 0.25 -2
## 4 5.04e+13 95 0.25 -2
## 5 5.04e+13 99 0.50 -1
## 6 5.04e+13 95 1.00 0
I added the PlotMIC log2 in the last step before importing it. To make the Testing.Date column usable by us we will convert it to a numerical value. We can do this using the package lubridate.
# install.packages("lubridate")
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Create a data frame from "data"
df <- data.frame(data)
# Convert Testing.Date to a date type variable
df$Testing.Date <- as.Date(df$Testing.Date, format = "%m/%d/%Y")
# View the first few rows of the data
head(df)
## Lab.ID Isolate.Number Testing.Date Organism.Name Organism.Code
## 1 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## 2 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## 3 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## 4 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## 5 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## 6 1.15e+11 1 2017-04-14 Staph.haemolyticus WGK
## Bio.Number Percent.Probability Renocyclin PlotMIC
## 1 7.04e+13 99 0.25 -2
## 2 5.04e+13 95 0.25 -2
## 3 5.04e+13 97 0.25 -2
## 4 5.04e+13 95 0.25 -2
## 5 5.04e+13 99 0.50 -1
## 6 5.04e+13 95 1.00 0
Note the testing dates look a little different and will be easier to manage.
Firstly I am going to save the dataframe to a new .csv file
# Save the dataset as a new CSV file
write.csv(df, "output_data3.csv", row.names = FALSE)
I went through the steps necessary using Excel formulas (Sum and Count) to get the summary table shown below:
## SummaryDate MICTotal Count
## 1 4/2017 92.50 150
## 2 5/2017 169.75 347
## 3 6/2017 201.75 336
## 4 7/2017 226.50 352
## 5 8/2017 198.00 223
## 6 9/2017 145.75 251
## 7 10/2017 135.25 238
## 8 11/2017 108.25 236
## 9 12/2017 127.75 275
## 10 1/2018 140.00 245
## 11 2/2018 127.00 249
## 12 3/2018 106.50 240
## 13 4/2018 117.75 135
## 14 5/2018 111.75 195
## 15 6/2018 86.75 188
## 16 7/2018 85.25 121
## 17 8/2018 50.25 160
## 18 9/2018 117.25 244
## 19 10/2018 40.50 141
## 20 11/2018 0.25 127
## 21 12/2018 0.75 152
## 22 1/2019 2.25 119
## 23 2/2019 0.50 166
## 24 3/2019 0.25 224
## 25 4/2019 0.00 126
Below is a funtion of how I am calulating 3 month rolling averages for this question
Now input my summary table dataset:
rolling_average_result <- calculate_rolling_average(tablesummary)
print(rolling_average_result)
## SummaryDate RollingAverage LogRollingAverage
## 1 2017-04-01 0.616666667 -0.6974372
## 2 2017-05-01 0.527665996 -0.9223031
## 3 2017-06-01 0.557022809 -0.8441917
## 4 2017-07-01 0.577777778 -0.7914134
## 5 2017-08-01 0.687431394 -0.5407124
## 6 2017-09-01 0.690375303 -0.5345472
## 7 2017-10-01 0.672752809 -0.5718516
## 8 2017-11-01 0.536896552 -0.8972840
## 9 2017-12-01 0.495660881 -1.0125747
## 10 2018-01-01 0.497354497 -1.0076536
## 11 2018-02-01 0.513328999 -0.9620443
## 12 2018-03-01 0.508855586 -0.9746718
## 13 2018-04-01 0.562900641 -0.8290478
## 14 2018-05-01 0.589473684 -0.7625007
## 15 2018-06-01 0.610521236 -0.7118866
## 16 2018-07-01 0.562996032 -0.8288033
## 17 2018-08-01 0.473880597 -1.0774045
## 18 2018-09-01 0.481428571 -1.0546063
## 19 2018-10-01 0.381651376 -1.3896727
## 20 2018-11-01 0.308593750 -1.6962193
## 21 2018-12-01 0.098809524 -3.3392061
## 22 2019-01-01 0.008165829 -6.9361849
## 23 2019-02-01 0.008009153 -6.9641345
## 24 2019-03-01 0.005893910 -7.4065593
## 25 2019-04-01 0.001453488 -9.4262648
# Set a CRAN mirror programmatically (replace 'mirror_url' with your preferred mirror URL)
mirror_url <- "https://cran.rstudio.com/"
options(repos = mirror_url)
install.packages("ggplot2", dependencies = TRUE)
## Installing package into 'C:/Users/crang/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\crang\AppData\Local\Temp\RtmpKeYwAX\downloaded_packages
To make the scatter plot I first used ggplot:
# Load the necessary libraries
library(ggplot2)
# Assuming you have your filtered Renocyclin dataset as 'data' as before
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")
# Create the scatterplot
scatterplot <- ggplot(data, aes(x = Testing.Date, y = PlotMIC)) +
geom_point(color = "blue") + # Scatterplot points in blue
labs(
title = "Renocyclin, All MICs",
x = "Testing Date",
y = "MIC" # Label y-axis as "MIC"
)
# Print the scatterplot
print(scatterplot)
# Load the necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Assuming you have your filtered Renocyclin dataset as 'data'
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")
# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
data$PlotMIC <- as.numeric(data$PlotMIC)
}
## Warning: NAs introduced by coercion
# Calculate the 3-month rolling average
data <- data %>%
arrange(Testing.Date) %>%
mutate(RollingAverage = zoo::rollmean(PlotMIC, k = 36, fill = NA))
# Create the scatterplot with points and a line plot
scatterplot <- ggplot(data, aes(x = Testing.Date)) +
geom_point(aes(y = PlotMIC), color = "blue") + # Scatterplot points in blue
geom_line(aes(y = RollingAverage), color = "red") + # 3-month rolling average line in red
labs(
title = "Renocyclin, All MICs",
x = "Testing Date",
y = "MIC" # Label y-axis as "MIC"
)
# Print the scatterplot
print(scatterplot)
## Warning: Removed 1079 rows containing missing values (`geom_point()`).
## Warning: Removed 1070 rows containing missing values (`geom_line()`).
I did not like this end plot so I remade the plot using the plot and line functions. This resulted in a far superior graph:
# Assuming you have your filtered Renocyclin dataset as 'data'
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")
# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
data$PlotMIC <- as.numeric(data$PlotMIC)
}
# Calculate the 3-month rolling average (ignoring NA values)
data$RollingAverage <- zoo::rollapply(data$PlotMIC, width = 36, FUN = mean, na.rm = TRUE, align = "right", fill = NA)
# Create a scatterplot with points
plot(data$Testing.Date, data$PlotMIC, type = "p", col = "blue", xlab = "Testing Date", ylab = "MIC", main = "Renocyclin, All MICs")
# Add a line plot for the 3-month rolling average
lines(data$Testing.Date, data$RollingAverage, col = "red")
# Add a legend
legend("topright", legend = c("MIC", "3-Month Rolling Avg"), col = c("blue", "red"), lty = 1)
Going back to our dataset, I filtered out all Renocyclin values less than 2 using Excels filtering features and copied that data to a new csv
## Lab.ID Isolate.Number Testing.Date Organism.Name Organism.Code
## 1 1.15e+11 1 4/16/2017 Staph.haemolyticus WGK
## 2 1.15e+11 1 4/16/2017 Staph.haemolyticus WGK
## 3 1.15e+11 1 4/16/2017 Staph.haemolyticus WGK
## 4 1.16e+11 1 4/18/2017 Staph.haemolyticus WGK
## 5 1.16e+11 1 4/22/2017 Staph.haemolyticus WGK
## 6 1.16e+11 1 4/22/2017 Staph.haemolyticus WGK
## Bio.Number Percent.Probability Renocyclin PlotMIC Month Year MonthYear
## 1 5.04e+13 96 2 1 4 2017 4/2017
## 2 5.04e+13 96 2 1 4 2017 4/2017
## 3 5.04e+13 99 2 1 4 2017 4/2017
## 4 5.04e+13 95 2 1 4 2017 4/2017
## 5 1.04e+13 99 2 1 4 2017 4/2017
## 6 1.04e+13 99 2 1 4 2017 4/2017
## SummaryDate MICTotal Count
## 1 4/2017 38 12
## 2 5/2017 66 33
## 3 6/2017 98 34
## 4 7/2017 124 55
## 5 8/2017 136 46
## 6 9/2017 82 41
Then I separated the Summary Table I made within that excel document:
## SummaryDate MICTotal Count
## 1 4/2017 38 12
## 2 5/2017 66 33
## 3 6/2017 98 34
## 4 7/2017 124 55
## 5 8/2017 136 46
## 6 9/2017 82 41
## 7 10/2017 74 37
## 8 11/2017 38 19
## 9 12/2017 50 25
## 10 1/2018 68 33
## 11 2/2018 50 25
## 12 3/2018 38 18
## 13 4/2018 82 20
## 14 5/2018 66 33
## 15 6/2018 34 17
## 16 7/2018 56 14
## 17 8/2018 10 4
## 18 9/2018 50 18
## 19 10/2018 28 7
## 20 11/2018 0 0
## 21 12/2018 0 0
## 22 1/2019 2 1
## 23 2/2019 0 0
## 24 3/2019 0 0
## 25 4/2019 0 0
Now input my q8 summary table dataset into our Rolling avg funtion:
rolling_average_result <- calculate_rolling_average(q8summary)
print(rolling_average_result)
## SummaryDate RollingAverage LogRollingAverage
## 1 2017-04-01 3.166667 1.662965
## 2 2017-05-01 2.311111 1.208587
## 3 2017-06-01 2.556962 1.354431
## 4 2017-07-01 2.360656 1.239188
## 5 2017-08-01 2.651852 1.407000
## 6 2017-09-01 2.408451 1.268105
## 7 2017-10-01 2.354839 1.235628
## 8 2017-11-01 2.000000 1.000000
## 9 2017-12-01 2.000000 1.000000
## 10 2018-01-01 2.025974 1.018616
## 11 2018-02-01 2.024096 1.017278
## 12 2018-03-01 2.052632 1.037475
## 13 2018-04-01 2.698413 1.432111
## 14 2018-05-01 2.619718 1.389412
## 15 2018-06-01 2.600000 1.378512
## 16 2018-07-01 2.437500 1.285402
## 17 2018-08-01 2.857143 1.514573
## 18 2018-09-01 3.222222 1.688056
## 19 2018-10-01 3.034483 1.601451
## 20 2018-11-01 3.120000 1.641546
## 21 2018-12-01 4.000000 2.000000
## 22 2019-01-01 2.000000 1.000000
## 23 2019-02-01 2.000000 1.000000
## 24 2019-03-01 2.000000 1.000000
## 25 2019-04-01 NaN NaN
and now for recreating the scatterplots with our data for question 8
# using q8data this time
# Convert 'Testing.Date' to Date type
q8data$Testing.Date <- as.Date(q8data$Testing.Date, format = "%m/%d/%Y")
# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
q8data$PlotMIC <- as.numeric(q8data$PlotMIC)
}
# Calculate the 3-month rolling average while ignoring NA values
q8data$RollingAverage <- zoo::rollapply(q8data$PlotMIC, width = 36, FUN = mean, na.rm = TRUE, align = "right", fill = NA)
# Create a scatterplot with points
plot(q8data$Testing.Date, q8data$PlotMIC, type = "p", col = "blue", xlab = "Testing Date", ylab = "MIC", main = "Renocyclin, All MICs")
# Add a line plot for the 3-month rolling average
lines(q8data$Testing.Date, q8data$RollingAverage, col = "red")
# Add a legend
legend("topright", legend = c("MIC", "3-Month Rolling Avg"), col = c("blue", "red"), lty = 1)
And finally, the reason why ABGEBR would be an input value is for no growth observed or no test administered.
The datasets and a version of this document can be found at: https://comp.umsl.edu/gitlab/cadc9/math4005midterm.git