Let’s say that we already have technically clean names - in that they don’t have spaces or punctuation or start with a number. However, let’s say that there is a redundant word (“percent”) that we want to remove or add to multiple columns.
First let’s load the packages we will need. We will show some functions from janitor
and the tidyverse
:
## Installing package into '/usr/local/lib/R/site-library'
## (as 'lib' is unspecified)
## also installing the dependency 'snakecase'
#install.packages("janitor")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
First let’s make some data:
data_to_clean <- tibble(State = c("Texas", "Utah", "Maryland", "Ohio"),
tax_percent = c(10, 20, 60, 40),
literacy_percent = c(70, 80, 80, 75),
above_poverty_percent = c(60, 70, 50, 60))
data_to_clean
## # A tibble: 4 × 4
## State tax_percent literacy_percent above_poverty_percent
## <chr> <dbl> <dbl> <dbl>
## 1 Texas 10 70 60
## 2 Utah 20 80 70
## 3 Maryland 60 80 50
## 4 Ohio 40 75 60
We can use the rename_with
function of dplyr
and str_remove
of stringr
to remove the pattern “_percent” from each of the column names.
Here we use the ~
and the .
to indicate that we are using str_remove
and all the column names. If it finds the pattern it will remove it.
data_to_clean %>% rename_with(~str_remove(., '_percent'))
## # A tibble: 4 × 4
## State tax literacy above_poverty
## <chr> <dbl> <dbl> <dbl>
## 1 Texas 10 70 60
## 2 Utah 20 80 70
## 3 Maryland 60 80 50
## 4 Ohio 40 75 60
Nice! That simplified our names very easily!
We can use patterns with regex - see this regex cheatsheet for help to remove unwanted characters! We adapted some code from this source.
First we will make some very messy data:
d <- tibble("Year" = 1:5,
"Info" = 1:5,
"1. Products" = 1:5,
"2. Rate" = 1:5,
"3. Price" = 1:5,
"29. Other" = 1:5)
d
## # A tibble: 5 × 6
## Year Info `1. Products` `2. Rate` `3. Price` `29. Other`
## <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1
## 2 2 2 2 2 2 2
## 3 3 3 3 3 3 3
## 4 4 4 4 4 4 4
## 5 5 5 5 5 5 5
Now we can remove the numbers and punctuation in a similar way as we did before using rename_with
and str_remove
, but this time we specify a few things:
that we want to remove digits with [:digits:]
(based on the regex cheatsheet)
that we want to remove possibly one or more digits with the +
(based on the regex cheatsheet)
that we want to remove a period (which needs two \\
based on the (based on the regex cheatsheet) too!) and a space
Here we go:
d %>%
rename_with(~str_remove(., "[:digit:]+\\. "))
## # A tibble: 5 × 6
## Year Info Products Rate Price Other
## <int> <int> <int> <int> <int> <int>
## 1 1 1 1 1 1 1
## 2 2 2 2 2 2 2
## 3 3 3 3 3 3 3
## 4 4 4 4 4 4 4
## 5 5 5 5 5 5 5
Nice, that is better!
First let’s make some messy data that is missing values in the first row and has possible better column names in the second row. We adapted code from this source.
This can often happen when we read in data.
dirt <- data.frame(X_1 = c(NA, "ID", 1:3),
X_2 = c(NA, "Value", 4:6))
dirt
## X_1 X_2
## 1 <NA> <NA>
## 2 ID Value
## 3 1 4
## 4 2 5
## 5 3 6
The function row_to_names
from the janitor
package (not part of the tidyverse
- so make sure you install and load it!) can be really helpful for this.
We can use the row_number
argument of row_to_names
to specify that the column names can be found in the second row.
row_to_names(dirt, row_number = 2) # our column names can be found in row 2!
## ID Value
## 3 1 4
## 4 2 5
## 5 3 6