Question 1

Without altering our dataset, excel registers there to be 34 columns and 13689 rows.

Question 2

I sorted based off Organism.Code for “WGK”. The new .csv file contains 34 columns and 5241 rows.

Question 3

I did the filtering steps described within Excel. The unedited .csv file looks like the example below

##     Lab.ID Isolate.Number Testing.Date Testdate      Organism.Name
## 1 1.15e+11              1    4/14/2017  #VALUE! Staph.haemolyticus
## 2 1.15e+11              1    4/14/2017                 Ent.cloacae
## 3 1.15e+11              1    4/14/2017          Staph.haemolyticus
## 4 1.15e+11              1    4/14/2017          Staph.haemolyticus
## 5 1.15e+11              1    4/14/2017                Ent.hermanii
## 6 1.15e+11              1    4/14/2017          Staph.haemolyticus
##   Organism.Code Bio.Number Percent.Probability Renocyclin PlotMIC
## 1           WGK   7.04e+13                  99       0.25      -2
## 2           AMX   1.16e+14                  99     <=0,25        
## 3           WGK   5.04e+13                  95       0.25      -2
## 4           WGK   5.04e+13                  97       0.25      -2
## 5           AMM   3.60e+13                  97     <=0,25        
## 6           WGK   5.04e+13                  95       0.25      -2

Now the filtered dataset looks like this:

##     Lab.ID Isolate.Number Testing.Date      Organism.Name Organism.Code
## 1 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
## 2 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
## 3 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
## 4 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
## 5 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
## 6 1.15e+11              1    4/14/2017 Staph.haemolyticus           WGK
##   Bio.Number Percent.Probability Renocyclin PlotMIC
## 1   7.04e+13                  99       0.25      -2
## 2   5.04e+13                  95       0.25      -2
## 3   5.04e+13                  97       0.25      -2
## 4   5.04e+13                  95       0.25      -2
## 5   5.04e+13                  99       0.50      -1
## 6   5.04e+13                  95       1.00       0

Question 4

I added the PlotMIC log2 in the last step before importing it. To make the Testing.Date column usable by us we will convert it to a numerical value. We can do this using the package lubridate.

# install.packages("lubridate")
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
# Create a data frame from "data"
df <- data.frame(data)

# Convert Testing.Date to a date type variable
df$Testing.Date <- as.Date(df$Testing.Date, format = "%m/%d/%Y")

# View the first few rows of the data
head(df)
##     Lab.ID Isolate.Number Testing.Date      Organism.Name Organism.Code
## 1 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
## 2 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
## 3 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
## 4 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
## 5 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
## 6 1.15e+11              1   2017-04-14 Staph.haemolyticus           WGK
##   Bio.Number Percent.Probability Renocyclin PlotMIC
## 1   7.04e+13                  99       0.25      -2
## 2   5.04e+13                  95       0.25      -2
## 3   5.04e+13                  97       0.25      -2
## 4   5.04e+13                  95       0.25      -2
## 5   5.04e+13                  99       0.50      -1
## 6   5.04e+13                  95       1.00       0

Note the testing dates look a little different and will be easier to manage.

Question 5

Firstly I am going to save the dataframe to a new .csv file

# Save the dataset as a new CSV file
write.csv(df, "output_data3.csv", row.names = FALSE)

I went through the steps necessary using Excel formulas (Sum and Count) to get the summary table shown below:

##    SummaryDate MICTotal Count
## 1       4/2017    92.50   150
## 2       5/2017   169.75   347
## 3       6/2017   201.75   336
## 4       7/2017   226.50   352
## 5       8/2017   198.00   223
## 6       9/2017   145.75   251
## 7      10/2017   135.25   238
## 8      11/2017   108.25   236
## 9      12/2017   127.75   275
## 10      1/2018   140.00   245
## 11      2/2018   127.00   249
## 12      3/2018   106.50   240
## 13      4/2018   117.75   135
## 14      5/2018   111.75   195
## 15      6/2018    86.75   188
## 16      7/2018    85.25   121
## 17      8/2018    50.25   160
## 18      9/2018   117.25   244
## 19     10/2018    40.50   141
## 20     11/2018     0.25   127
## 21     12/2018     0.75   152
## 22      1/2019     2.25   119
## 23      2/2019     0.50   166
## 24      3/2019     0.25   224
## 25      4/2019     0.00   126

Question 6

Below is a funtion of how I am calulating 3 month rolling averages for this question

Now input my summary table dataset:

rolling_average_result <- calculate_rolling_average(tablesummary)
print(rolling_average_result)
##    SummaryDate RollingAverage LogRollingAverage
## 1   2017-04-01    0.616666667        -0.6974372
## 2   2017-05-01    0.527665996        -0.9223031
## 3   2017-06-01    0.557022809        -0.8441917
## 4   2017-07-01    0.577777778        -0.7914134
## 5   2017-08-01    0.687431394        -0.5407124
## 6   2017-09-01    0.690375303        -0.5345472
## 7   2017-10-01    0.672752809        -0.5718516
## 8   2017-11-01    0.536896552        -0.8972840
## 9   2017-12-01    0.495660881        -1.0125747
## 10  2018-01-01    0.497354497        -1.0076536
## 11  2018-02-01    0.513328999        -0.9620443
## 12  2018-03-01    0.508855586        -0.9746718
## 13  2018-04-01    0.562900641        -0.8290478
## 14  2018-05-01    0.589473684        -0.7625007
## 15  2018-06-01    0.610521236        -0.7118866
## 16  2018-07-01    0.562996032        -0.8288033
## 17  2018-08-01    0.473880597        -1.0774045
## 18  2018-09-01    0.481428571        -1.0546063
## 19  2018-10-01    0.381651376        -1.3896727
## 20  2018-11-01    0.308593750        -1.6962193
## 21  2018-12-01    0.098809524        -3.3392061
## 22  2019-01-01    0.008165829        -6.9361849
## 23  2019-02-01    0.008009153        -6.9641345
## 24  2019-03-01    0.005893910        -7.4065593
## 25  2019-04-01    0.001453488        -9.4262648

Question 7

# Set a CRAN mirror programmatically (replace 'mirror_url' with your preferred mirror URL)
mirror_url <- "https://cran.rstudio.com/"
options(repos = mirror_url)
install.packages("ggplot2", dependencies = TRUE)
## Installing package into 'C:/Users/crang/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\crang\AppData\Local\Temp\RtmpKeYwAX\downloaded_packages

To make the scatter plot I first used ggplot:

# Load the necessary libraries
library(ggplot2)

# Assuming you have your filtered Renocyclin dataset as 'data' as before
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")

# Create the scatterplot
scatterplot <- ggplot(data, aes(x = Testing.Date, y = PlotMIC)) +
  geom_point(color = "blue") +  # Scatterplot points in blue
  labs(
    title = "Renocyclin, All MICs",
    x = "Testing Date",
    y = "MIC"  # Label y-axis as "MIC"
  )

# Print the scatterplot
print(scatterplot)

# Load the necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Assuming you have your filtered Renocyclin dataset as 'data'
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")

# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
  data$PlotMIC <- as.numeric(data$PlotMIC)
}
## Warning: NAs introduced by coercion
# Calculate the 3-month rolling average
data <- data %>%
  arrange(Testing.Date) %>%
  mutate(RollingAverage = zoo::rollmean(PlotMIC, k = 36, fill = NA))

# Create the scatterplot with points and a line plot
scatterplot <- ggplot(data, aes(x = Testing.Date)) +
  geom_point(aes(y = PlotMIC), color = "blue") +  # Scatterplot points in blue
  geom_line(aes(y = RollingAverage), color = "red") +  # 3-month rolling average line in red
  labs(
    title = "Renocyclin, All MICs",
    x = "Testing Date",
    y = "MIC"  # Label y-axis as "MIC"
  )

# Print the scatterplot
print(scatterplot)
## Warning: Removed 1079 rows containing missing values (`geom_point()`).
## Warning: Removed 1070 rows containing missing values (`geom_line()`).

I did not like this end plot so I remade the plot using the plot and line functions. This resulted in a far superior graph:

# Assuming you have your filtered Renocyclin dataset as 'data'
# Convert 'Testing.Date' to Date type
data$Testing.Date <- as.Date(data$Testing.Date, format = "%m/%d/%Y")

# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
  data$PlotMIC <- as.numeric(data$PlotMIC)
}

# Calculate the 3-month rolling average (ignoring NA values)
data$RollingAverage <- zoo::rollapply(data$PlotMIC, width = 36, FUN = mean, na.rm = TRUE, align = "right", fill = NA)

# Create a scatterplot with points
plot(data$Testing.Date, data$PlotMIC, type = "p", col = "blue", xlab = "Testing Date", ylab = "MIC", main = "Renocyclin, All MICs")

# Add a line plot for the 3-month rolling average 
lines(data$Testing.Date, data$RollingAverage, col = "red")

# Add a legend
legend("topright", legend = c("MIC", "3-Month Rolling Avg"), col = c("blue", "red"), lty = 1)

Question 8

Going back to our dataset, I filtered out all Renocyclin values less than 2 using Excels filtering features and copied that data to a new csv

##     Lab.ID Isolate.Number Testing.Date      Organism.Name Organism.Code
## 1 1.15e+11              1    4/16/2017 Staph.haemolyticus           WGK
## 2 1.15e+11              1    4/16/2017 Staph.haemolyticus           WGK
## 3 1.15e+11              1    4/16/2017 Staph.haemolyticus           WGK
## 4 1.16e+11              1    4/18/2017 Staph.haemolyticus           WGK
## 5 1.16e+11              1    4/22/2017 Staph.haemolyticus           WGK
## 6 1.16e+11              1    4/22/2017 Staph.haemolyticus           WGK
##   Bio.Number Percent.Probability Renocyclin PlotMIC Month Year MonthYear
## 1   5.04e+13                  96          2       1     4 2017    4/2017
## 2   5.04e+13                  96          2       1     4 2017    4/2017
## 3   5.04e+13                  99          2       1     4 2017    4/2017
## 4   5.04e+13                  95          2       1     4 2017    4/2017
## 5   1.04e+13                  99          2       1     4 2017    4/2017
## 6   1.04e+13                  99          2       1     4 2017    4/2017
##   SummaryDate MICTotal Count
## 1      4/2017       38    12
## 2      5/2017       66    33
## 3      6/2017       98    34
## 4      7/2017      124    55
## 5      8/2017      136    46
## 6      9/2017       82    41

Then I separated the Summary Table I made within that excel document:

##    SummaryDate MICTotal Count
## 1       4/2017       38    12
## 2       5/2017       66    33
## 3       6/2017       98    34
## 4       7/2017      124    55
## 5       8/2017      136    46
## 6       9/2017       82    41
## 7      10/2017       74    37
## 8      11/2017       38    19
## 9      12/2017       50    25
## 10      1/2018       68    33
## 11      2/2018       50    25
## 12      3/2018       38    18
## 13      4/2018       82    20
## 14      5/2018       66    33
## 15      6/2018       34    17
## 16      7/2018       56    14
## 17      8/2018       10     4
## 18      9/2018       50    18
## 19     10/2018       28     7
## 20     11/2018        0     0
## 21     12/2018        0     0
## 22      1/2019        2     1
## 23      2/2019        0     0
## 24      3/2019        0     0
## 25      4/2019        0     0

Now input my q8 summary table dataset into our Rolling avg funtion:

rolling_average_result <- calculate_rolling_average(q8summary)
print(rolling_average_result)
##    SummaryDate RollingAverage LogRollingAverage
## 1   2017-04-01       3.166667          1.662965
## 2   2017-05-01       2.311111          1.208587
## 3   2017-06-01       2.556962          1.354431
## 4   2017-07-01       2.360656          1.239188
## 5   2017-08-01       2.651852          1.407000
## 6   2017-09-01       2.408451          1.268105
## 7   2017-10-01       2.354839          1.235628
## 8   2017-11-01       2.000000          1.000000
## 9   2017-12-01       2.000000          1.000000
## 10  2018-01-01       2.025974          1.018616
## 11  2018-02-01       2.024096          1.017278
## 12  2018-03-01       2.052632          1.037475
## 13  2018-04-01       2.698413          1.432111
## 14  2018-05-01       2.619718          1.389412
## 15  2018-06-01       2.600000          1.378512
## 16  2018-07-01       2.437500          1.285402
## 17  2018-08-01       2.857143          1.514573
## 18  2018-09-01       3.222222          1.688056
## 19  2018-10-01       3.034483          1.601451
## 20  2018-11-01       3.120000          1.641546
## 21  2018-12-01       4.000000          2.000000
## 22  2019-01-01       2.000000          1.000000
## 23  2019-02-01       2.000000          1.000000
## 24  2019-03-01       2.000000          1.000000
## 25  2019-04-01            NaN               NaN

and now for recreating the scatterplots with our data for question 8

# using q8data this time
# Convert 'Testing.Date' to Date type
q8data$Testing.Date <- as.Date(q8data$Testing.Date, format = "%m/%d/%Y")

# Check if 'PlotMIC' is character, and if so, convert it to numeric
if (is.character(data$PlotMIC)) {
  q8data$PlotMIC <- as.numeric(q8data$PlotMIC)
}

# Calculate the 3-month rolling average while ignoring NA values
q8data$RollingAverage <- zoo::rollapply(q8data$PlotMIC, width = 36, FUN = mean, na.rm = TRUE, align = "right", fill = NA)

# Create a scatterplot with points
plot(q8data$Testing.Date, q8data$PlotMIC, type = "p", col = "blue", xlab = "Testing Date", ylab = "MIC", main = "Renocyclin, All MICs")

# Add a line plot for the 3-month rolling average 
lines(q8data$Testing.Date, q8data$RollingAverage, col = "red")

# Add a legend
legend("topright", legend = c("MIC", "3-Month Rolling Avg"), col = c("blue", "red"), lty = 1)

And finally, the reason why ABGEBR would be an input value is for no growth observed or no test administered.

The datasets and a version of this document can be found at: https://comp.umsl.edu/gitlab/cadc9/math4005midterm.git