How to find the mean of each variable using dplyr by factor variable with ignoring the NA values in R?


If there are NA’s in our data set for multiple values of numerical variables with the grouping variable then using na.rm = FALSE needs to be performed multiple times to find the mean or any other statistic for each of the variables with the mean function. But we can do it with summarise_all function of dplyr package that will result in the mean of all numerical variables in just two lines of code.

Example

Loading dplyr package −

> library(dplyr)

Consider the ToothGrowth data set in base R −

> str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
> grouping_by_supp <- ToothGrowth %>% group_by(supp)
> grouping_by_supp %>% summarise_each(funs(mean(., na.rm = TRUE)))
# A tibble: 2 x 3
supp len dose
<fct> <dbl> <dbl>
1 OJ 20.7 1.17
2 VC 17.0 1.17

Consider the mtcars data set in base R −

> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : Factor w/ 3 levels "four","six","eight": 2 2 1 2 3 2 3 1 1 2 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> grouping_by_cyl <- mtcars %>% group_by(cyl)
> grouping_by_cyl %>% summarise_each(funs(mean(., na.rm = TRUE)))
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 four 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 six 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 eight 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5

Consider the CO2 data set in base R −

> str(CO2)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame': 84 obs. of 5 variables:
$ Plant : Ord.factor w/ 12 levels "Qn1"<"Qn2"<"Qn3"<..: 1 1 1 1 1 1 1 2 2 2 ...
$ Type : Factor w/ 2 levels "Quebec","Mississippi": 1 1 1 1 1 1 1 1 1 1 ...
$ Treatment: Factor w/ 2 levels "nonchilled","chilled": 1 1 1 1 1 1 1 1 1 1 ...
$ conc : num 95 175 250 350 500 675 1000 95 175 250 ...
$ uptake : num 16 30.4 34.8 37.2 35.3 39.2 39.7 13.6 27.3 37.1 ...
- attr(*, "formula")=Class 'formula' language uptake ~ conc | Plant
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "outer")=Class 'formula' language ~Treatment * Type
.. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
- attr(*, "labels")=List of 2
..$ x: chr "Ambient carbon dioxide concentration"
..$ y: chr "CO2 uptake rate"
- attr(*, "units")=List of 2
..$ x: chr "(uL/L)"
..$ y: chr "(umol/m^2 s)"
> grouping_by_Type <- CO2 %>% group_by(Type)
> grouping_by_Type %>% summarise_all(funs(mean(., na.rm = TRUE)))
# A tibble: 2 x 5
Type Plant Treatment conc uptake
<fct> <dbl> <dbl> <dbl> <dbl>
1 Quebec NA NA 435 33.5
2 Mississippi NA NA 435 20.9

Warning messages

  • In mean.default(Plant, na.rm = TRUE) − argument is not numeric or logical− returning NA
  • In mean.default(Plant, na.rm = TRUE) − argument is not numeric or logical− returning NA
  • In mean.default(Treatment, na.rm = TRUE) − argument is not numeric or logical− returning NA
  • In mean.default(Treatment, na.rm = TRUE) − argument is not numeric or logical − returning NA

Here, we are getting some warning messages because the variable Plant and Treatment are not numerical.

Updated on: 12-Aug-2020

224 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements