library(rubix)
data("mtcars")

Introduction

Summarize_ functions rapidly survey the depth and breadth of a dataframe. The general use cases for this function family is to analyze data for variability from multiple angles to bucket values into broader categories.

summarize_variables(mtcars,
                    incl_num_calc = FALSE)
#> # A tibble: 11 x 7
#>    Variable COUNT DISTINCT_COUNT NA_COUNT NA_STR_COUNT BLANK_COUNT
#>    <chr>    <int>          <int>    <int>        <int>       <int>
#>  1 am          32              2        0            0           0
#>  2 carb        32              6        0            0           0
#>  3 cyl         32              3        0            0           0
#>  4 disp        32             27        0            0           0
#>  5 drat        32             22        0            0           0
#>  6 gear        32              3        0            0           0
#>  7 hp          32             22        0            0           0
#>  8 mpg         32             25        0            0           0
#>  9 qsec        32             30        0            0           0
#> 10 vs          32              2        0            0           0
#> 11 wt          32             29        0            0           0
#> # … with 1 more variable: DISTINCT_VALUES <chr>

Summary functions include:
1. Total and distinct counts
2. Counts for NA (ie. ), NA strings (ie “NA”), and blank values that when combined, can provide an overview of missingness in the dataframe
3. Unique values found within the particular variable in both a pipe-separated string .

Summarize Numeric Variables

Additional metrics can be derived from variables that contain numeric data. The summarize_variables() function either takes variables as arguments or selects for variables of numeric, integer, or double R classes and calculates summary statistics with both na.rm = FALSE (all _NA suffixed outputs) and na.rm = TRUE parameters.

summarize_variables(data = mtcars,
                    incl_num_calc = TRUE)
#> $SUMMARY
#> # A tibble: 11 x 7
#>    Variable COUNT DISTINCT_COUNT NA_COUNT NA_STR_COUNT BLANK_COUNT
#>    <chr>    <int>          <int>    <int>        <int>       <int>
#>  1 am          32              2        0            0           0
#>  2 carb        32              6        0            0           0
#>  3 cyl         32              3        0            0           0
#>  4 disp        32             27        0            0           0
#>  5 drat        32             22        0            0           0
#>  6 gear        32              3        0            0           0
#>  7 hp          32             22        0            0           0
#>  8 mpg         32             25        0            0           0
#>  9 qsec        32             30        0            0           0
#> 10 vs          32              2        0            0           0
#> 11 wt          32             29        0            0           0
#> # … with 1 more variable: DISTINCT_VALUES <chr>
#> 
#> $NUMERIC_CALCULATIONS
#> # A tibble: 11 x 17
#>    Variable    MEAN MEAN_NA MEDIAN MEDIAN_NA      SD   SD_NA    MAX MAX_NA   MIN
#>    <chr>      <dbl>   <dbl>  <dbl>     <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#>  1 am         0.406   0.406   0         0      0.499   0.499   1      1     0   
#>  2 carb       2.81    2.81    2         2      1.62    1.62    8      8     1   
#>  3 cyl        6.19    6.19    6         6      1.79    1.79    8      8     4   
#>  4 disp     231.    231.    196.      196.   124.    124.    472    472    71.1 
#>  5 drat       3.60    3.60    3.70      3.70   0.535   0.535   4.93   4.93  2.76
#>  6 gear       3.69    3.69    4         4      0.738   0.738   5      5     3   
#>  7 hp       147.    147.    123       123     68.6    68.6   335    335    52   
#>  8 mpg       20.1    20.1    19.2      19.2    6.03    6.03   33.9   33.9  10.4 
#>  9 qsec      17.8    17.8    17.7      17.7    1.79    1.79   22.9   22.9  14.5 
#> 10 vs         0.438   0.438   0         0      0.504   0.504   1      1     0   
#> 11 wt         3.22    3.22    3.32      3.32   0.978   0.978   5.42   5.42  1.51
#> # … with 7 more variables: MIN_NA <dbl>, SUM <dbl>, SUM_NA <dbl>,
#> #   DISTINCT_LENGTH <int>, NA_LENGTH <int>, BLANK_LENGTH <int>,
#> #   DISTINCT_STR <chr>

Summarizing Variable Values

The value_count() function returns all counts for the unique values for each variable.

value_count(data = mtcars)
#> # A tibble: 171 x 3
#>    Variable Value     n
#>    <chr>    <chr> <int>
#>  1 am       0        19
#>  2 vs       0        18
#>  3 gear     3        15
#>  4 cyl      8        14
#>  5 vs       1        14
#>  6 am       1        13
#>  7 gear     4        12
#>  8 cyl      4        11
#>  9 carb     2        10
#> 10 carb     4        10
#> # … with 161 more rows