I want to generate averages and standard deviations hour by hour on different subsets of data partitioning the dataset.
In a small set of data, this is simple, just run the code I have below as an example.
In a large dataset, my method is not efficient. create a billion A-Z variables, exhaust the alphabet, … to store all the partitions of my data and repeatedly typing in the subset () function's criteria is slow.
I'm trying to find a way to automate what I'm doing using purrr or other packages.
I have reviewed the package "purrr" and I do not know how to use it.
I can afford to do the same thing to calculate averages on subsets of data.
Here is another example reproducible without external links.
I can not link a and b because all months in which treatment group # 0 and treatment group # 1 are not equivalent. But I could copy and paste the databases a and b in Excel for my needs.
Example 2: gives an example where I can link partitioned data frames
reproducible example using the cd4 dataset:
#use the cd4 dataset
a<-subset(r,group01==0 & age<30)%>%
a1<-subset(r,group01==0 & age>=30)%>%
#c is the finished data frame I wanted to make that I'll import into Excel
In both examples, I have to produce something like this:
# A tibble: 679 x 3
week m sd
1 0 2.71 1.05
2 3.57 2.71 NA
3 4.14 2.71 NA
4 4.71 1.79 NA
5 6.57 3.22 NA
6 6.86 2.30 NA
7 7 3.37 NA
8 7.29 3.76 0.560
9 7.43 3.71 1.42
10 7.57 1.47 1.05
# … with 669 more rows
where the first partition is stacked from the other
I wanted to do the same thing, maybe using purrr, avoiding to create many variables a, a1, b, b1 and typing the conditions of the subset () function, like group01 == 1 or age<30 or age>= 30 several times.
If I were using a large dataset with more variables in addition to age and there was not only two treatment groups, but rather four or more, and that I had to subdivide according to sex, size, marital status, province, political affiliations and political party I also wanted it to work too, but that with dplyr is slow, tedious and inefficient, especially when the subset criteria or the dimensionality of the dataset increases.
As you can see, having an age variable makes the process much more difficult in the example2.
I'm trying to find a more efficient way to do this, especially if the cd4 dataset contained more information. I do not know how to use python.
Similar question but without reproducible example:
I think the difficulty of this task is related to the curse of dimensionality.
I can not change the group_by condition.