Summarize_
functions rapidly survey the depth and breadth of a dataframe. The general use cases for this function family is to analyze data for variability from multiple angles to bucket values into broader categories.
summarize_variables(mtcars,
incl_num_calc = FALSE)
#> # A tibble: 11 x 7
#> Variable COUNT DISTINCT_COUNT NA_COUNT NA_STR_COUNT BLANK_COUNT
#> <chr> <int> <int> <int> <int> <int>
#> 1 am 32 2 0 0 0
#> 2 carb 32 6 0 0 0
#> 3 cyl 32 3 0 0 0
#> 4 disp 32 27 0 0 0
#> 5 drat 32 22 0 0 0
#> 6 gear 32 3 0 0 0
#> 7 hp 32 22 0 0 0
#> 8 mpg 32 25 0 0 0
#> 9 qsec 32 30 0 0 0
#> 10 vs 32 2 0 0 0
#> 11 wt 32 29 0 0 0
#> # … with 1 more variable: DISTINCT_VALUES <chr>
Summary functions include:
1. Total and distinct counts
2. Counts for NA (ie.
3. Unique values found within the particular variable in both a pipe-separated string .
Additional metrics can be derived from variables that contain numeric data. The summarize_variables()
function either takes variables as arguments or selects for variables of numeric, integer, or double R classes and calculates summary statistics with both na.rm = FALSE (all _NA suffixed outputs) and na.rm = TRUE parameters.
summarize_variables(data = mtcars,
incl_num_calc = TRUE)
#> $SUMMARY
#> # A tibble: 11 x 7
#> Variable COUNT DISTINCT_COUNT NA_COUNT NA_STR_COUNT BLANK_COUNT
#> <chr> <int> <int> <int> <int> <int>
#> 1 am 32 2 0 0 0
#> 2 carb 32 6 0 0 0
#> 3 cyl 32 3 0 0 0
#> 4 disp 32 27 0 0 0
#> 5 drat 32 22 0 0 0
#> 6 gear 32 3 0 0 0
#> 7 hp 32 22 0 0 0
#> 8 mpg 32 25 0 0 0
#> 9 qsec 32 30 0 0 0
#> 10 vs 32 2 0 0 0
#> 11 wt 32 29 0 0 0
#> # … with 1 more variable: DISTINCT_VALUES <chr>
#>
#> $NUMERIC_CALCULATIONS
#> # A tibble: 11 x 17
#> Variable MEAN MEAN_NA MEDIAN MEDIAN_NA SD SD_NA MAX MAX_NA MIN
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 am 0.406 0.406 0 0 0.499 0.499 1 1 0
#> 2 carb 2.81 2.81 2 2 1.62 1.62 8 8 1
#> 3 cyl 6.19 6.19 6 6 1.79 1.79 8 8 4
#> 4 disp 231. 231. 196. 196. 124. 124. 472 472 71.1
#> 5 drat 3.60 3.60 3.70 3.70 0.535 0.535 4.93 4.93 2.76
#> 6 gear 3.69 3.69 4 4 0.738 0.738 5 5 3
#> 7 hp 147. 147. 123 123 68.6 68.6 335 335 52
#> 8 mpg 20.1 20.1 19.2 19.2 6.03 6.03 33.9 33.9 10.4
#> 9 qsec 17.8 17.8 17.7 17.7 1.79 1.79 22.9 22.9 14.5
#> 10 vs 0.438 0.438 0 0 0.504 0.504 1 1 0
#> 11 wt 3.22 3.22 3.32 3.32 0.978 0.978 5.42 5.42 1.51
#> # … with 7 more variables: MIN_NA <dbl>, SUM <dbl>, SUM_NA <dbl>,
#> # DISTINCT_LENGTH <int>, NA_LENGTH <int>, BLANK_LENGTH <int>,
#> # DISTINCT_STR <chr>
The value_count()
function returns all counts for the unique values for each variable.
value_count(data = mtcars)
#> # A tibble: 171 x 3
#> Variable Value n
#> <chr> <chr> <int>
#> 1 am 0 19
#> 2 vs 0 18
#> 3 gear 3 15
#> 4 cyl 8 14
#> 5 vs 1 14
#> 6 am 1 13
#> 7 gear 4 12
#> 8 cyl 4 11
#> 9 carb 2 10
#> 10 carb 4 10
#> # … with 161 more rows