Cleaning a common pattern from names

Let’s say that we already have technically clean names - in that they don’t have spaces or punctuation or start with a number. However, let’s say that there is a redundant word (“percent”) that we want to remove or add to multiple columns.

First let’s load the packages we will need. We will show some functions from janitor and the tidyverse:

## Installing package into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
## also installing the dependency 'snakecase'
#install.packages("janitor")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

First let’s make some data:

data_to_clean <- tibble(State = c("Texas", "Utah", "Maryland", "Ohio"),
                        tax_percent = c(10, 20, 60, 40),
                        literacy_percent = c(70, 80, 80, 75),
                        above_poverty_percent = c(60, 70, 50, 60))
data_to_clean
## # A tibble: 4 × 4
##   State    tax_percent literacy_percent above_poverty_percent
##   <chr>          <dbl>            <dbl>                 <dbl>
## 1 Texas             10               70                    60
## 2 Utah              20               80                    70
## 3 Maryland          60               80                    50
## 4 Ohio              40               75                    60

We can use the rename_with function of dplyr and str_remove of stringr to remove the pattern “_percent” from each of the column names.

Here we use the ~ and the . to indicate that we are using str_remove and all the column names. If it finds the pattern it will remove it.

data_to_clean %>% rename_with(~str_remove(., '_percent'))
## # A tibble: 4 × 4
##   State      tax literacy above_poverty
##   <chr>    <dbl>    <dbl>         <dbl>
## 1 Texas       10       70            60
## 2 Utah        20       80            70
## 3 Maryland    60       80            50
## 4 Ohio        40       75            60

Nice! That simplified our names very easily!

Cleaning names with numbers and punctuation

We can use patterns with regex - see this regex cheatsheet for help to remove unwanted characters! We adapted some code from this source.

First we will make some very messy data:

d <- tibble("Year" = 1:5,
       "Info" = 1:5,
       "1. Products" = 1:5,
       "2. Rate" = 1:5,
       "3. Price" = 1:5,
       "29. Other" = 1:5)
d
## # A tibble: 5 × 6
##    Year  Info `1. Products` `2. Rate` `3. Price` `29. Other`
##   <int> <int>         <int>     <int>      <int>       <int>
## 1     1     1             1         1          1           1
## 2     2     2             2         2          2           2
## 3     3     3             3         3          3           3
## 4     4     4             4         4          4           4
## 5     5     5             5         5          5           5

Now we can remove the numbers and punctuation in a similar way as we did before using rename_with and str_remove, but this time we specify a few things:

Here we go:

d %>% 
  rename_with(~str_remove(., "[:digit:]+\\. "))
## # A tibble: 5 × 6
##    Year  Info Products  Rate Price Other
##   <int> <int>    <int> <int> <int> <int>
## 1     1     1        1     1     1     1
## 2     2     2        2     2     2     2
## 3     3     3        3     3     3     3
## 4     4     4        4     4     4     4
## 5     5     5        5     5     5     5

Nice, that is better!

Using values of a specific row for column names

First let’s make some messy data that is missing values in the first row and has possible better column names in the second row. We adapted code from this source.

This can often happen when we read in data.

dirt <- data.frame(X_1 = c(NA, "ID", 1:3),
           X_2 = c(NA, "Value", 4:6))

dirt
##    X_1   X_2
## 1 <NA>  <NA>
## 2   ID Value
## 3    1     4
## 4    2     5
## 5    3     6

The function row_to_names from the janitor package (not part of the tidyverse - so make sure you install and load it!) can be really helpful for this.

We can use the row_number argument of row_to_names to specify that the column names can be found in the second row.

row_to_names(dirt, row_number = 2) # our column names can be found in row 2!
##   ID Value
## 3  1     4
## 4  2     5
## 5  3     6