Problem Statement
This tutorial shows you how to load multiple files from multiple folders using R.
As a bonus, this article will also show you how to add the folder and file names to the final tibble.
Load Libraries
Tidyverse and readxl are the only libraries required for this tutorial.
library(tidyverse)
library(readxl)
Data Description
In this example, Spotify playlist data was downloaded using Exportify. Each .xlsx file contains information about the playlist including the artist name, track name, album name, and song duration.
If you want to follow along and access the .xlsx files and the folder structure, click here.
A sample of one of the the playlist data files is listed below.
## Rows: 62
## Columns: 9
## $ `<U+FEFF>Spotify URI` <chr> "spotify:track:3hP7U1NmWEyeZRpLA1BhYj", "spo...
## $ `Track Name` <chr> "Prop Me Up Beside the Jukebox (If I Die)", "...
## $ `Artist Name` <chr> "Joe Diffie", "Brooks & Dunn", "Deana Carter"...
## $ `Album Name` <chr> "The Essential Joe Diffie", "#1s ... and then...
## $ `Disc Number` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `Track Number` <dbl> 7, 15, 1, 6, 2, 5, 2, 11, 9, 2, 1, 7, 2, 1, 2...
## $ `Track Duration (ms)` <dbl> 226973, 198226, 290466, 247626, 223832, 21669...
## $ `Added By` <chr> "spotify:user:", "spotify:user:", "spotify:us...
## $ `Added At` <chr> "2020-03-29T22:18:52Z", "2020-03-29T22:18:52Z...
Folder Structure
Each playlist was previously converted to a .xlsx file and placed into the appropriate genre’s folder. The three (3) genre folders include Country, Hip-Hop, and Rock.
Warning: please make sure to update your paths below to represent your appropriate project paths.
list.dirs(path = "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/",
full.names = FALSE,
recursive = FALSE)
## [1] "Country" "Hip_Hop" "Rock"
File Structure
Each genre folder included multiple playlist files. The example below shows the three (3) playlists that exist inside the Country genre folder.
list.files(path = "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/",
full.names = FALSE,
recursive = FALSE)
## [1] "90s_country.csv" "90s_country.xlsx"
## [3] "chillin_on_a_dirt_road.xlsx" "forever_country.csv"
## [5] "women_of_country.xlsx"
Create Function to List All Files in Each Folder
A custom function called list_files_in_folder() was created that lists the .xlsx files in each genre folder.
list_files_in_folder <- function(genre) {
list.files(path = str_c("Data_Sources/2020_05_14_Loading_Mult_Files/Genres/", genre),
pattern = "*.xlsx",
full.names = TRUE) %>%
tibble::enframe()
}
View Files in Country Genre Folder
For example, if we want to view all the files in the Country genre folder, we can now easily pass “Country” into the function.
list_files_in_folder("Country")
## # A tibble: 3 x 2
## name value
## <int> <chr>
## 1 1 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/90s_country.x~
## 2 2 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/chillin_on_a_~
## 3 3 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/women_of_coun~
Create Genre List
Next, a genre_list() was created to include all three (3) genres. This list allows us to map the list_files_in_folder() function to multiple genre folders later on.
genre_list <- c("Country", "Hip_Hop", "Rock")
View All Files In All Genre Folders
Using the purrr::map_df() function, we can pass the genre_list() into the list_files_in_folder() function. The final result produces a tibble which includes all the playlists from all of the genre folders.
playlist_tbl <- genre_list %>%
purrr::map_df(list_files_in_folder)
As you can see below, the playlist_tbl now includes 9 observations. Each of the three (3) genre folders includes three (3) playlist .xlsx files.
playlist_tbl
## # A tibble: 9 x 2
## name value
## <int> <chr>
## 1 1 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/90s_country.x~
## 2 2 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/chillin_on_a_~
## 3 3 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/women_of_coun~
## 4 1 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/i_love_my_90s~
## 5 2 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/jazz_rap.xlsx
## 6 3 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/workout_twerk~
## 7 1 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/new_core.xlsx
## 8 2 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/pulp.xlsx
## 9 3 Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/yacht_rock.xlsx
Extract “value” Column From playlist_tbl
The value column was then selected from the playlist_tbl above and converted into a list.
playlist_list <- playlist_tbl %>%
select(value) %>%
as.list()
playlist_list$value
## [1] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/90s_country.xlsx"
## [2] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/chillin_on_a_dirt_road.xlsx"
## [3] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Country/women_of_country.xlsx"
## [4] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/i_love_my_90s_hip-hop.xlsx"
## [5] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/jazz_rap.xlsx"
## [6] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Hip_Hop/workout_twerkout.xlsx"
## [7] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/new_core.xlsx"
## [8] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/pulp.xlsx"
## [9] "Data_Sources/2020_05_14_Loading_Mult_Files/Genres/Rock/yacht_rock.xlsx"
Read All Excel Files and Add Folder and File Names
The playlist_list$value was mapped into the function below which read and loaded each .xlsx file. The code below also added the filename and entire path into a new column called filename.
song_tbl <- playlist_list$value %>%
map_df(function(add_file_name)
read_excel(add_file_name) %>%
mutate(playlist_name = gsub(".xlsx",
"",
basename(add_file_name))) %>%
mutate(filename = add_file_name)
)
song_tbl %>%
glimpse()
## Rows: 794
## Columns: 11
## $ `<U+FEFF>Spotify URI` <chr> "spotify:track:3hP7U1NmWEyeZRpLA1BhYj", "spo...
## $ `Track Name` <chr> "Prop Me Up Beside the Jukebox (If I Die)", "...
## $ `Artist Name` <chr> "Joe Diffie", "Brooks & Dunn", "Deana Carter"...
## $ `Album Name` <chr> "The Essential Joe Diffie", "#1s ... and then...
## $ `Disc Number` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `Track Number` <dbl> 7, 15, 1, 6, 2, 5, 2, 11, 9, 2, 1, 7, 2, 1, 2...
## $ `Track Duration (ms)` <dbl> 226973, 198226, 290466, 247626, 223832, 21669...
## $ `Added By` <chr> "spotify:user:", "spotify:user:", "spotify:us...
## $ `Added At` <chr> "2020-03-29T22:18:52Z", "2020-03-29T22:18:52Z...
## $ playlist_name <chr> "90s_country", "90s_country", "90s_country", ...
## $ filename <chr> "Data_Sources/2020_05_14_Loading_Mult_Files/G...
Separate “filename” Column
The entire path and filename were listed in the new filename column in the song_tbl. Using the separate() function we can split out three (3) folders, Genre, and the Playlist into five (5) separate columns.
song_tidy_tbl <- song_tbl %>%
separate(col = filename,
into = c("Folder_01",
"Folder_02",
"Folder_03",
"Genre",
"Playlist"),
sep = "/",
remove = FALSE)
Final Tidy Tibble
The final tibble is complete and tidy with all 794 songs listed.
song_tidy_tbl %>%
glimpse()
## Rows: 794
## Columns: 16
## $ `<U+FEFF>Spotify URI` <chr> "spotify:track:3hP7U1NmWEyeZRpLA1BhYj", "spo...
## $ `Track Name` <chr> "Prop Me Up Beside the Jukebox (If I Die)", "...
## $ `Artist Name` <chr> "Joe Diffie", "Brooks & Dunn", "Deana Carter"...
## $ `Album Name` <chr> "The Essential Joe Diffie", "#1s ... and then...
## $ `Disc Number` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ `Track Number` <dbl> 7, 15, 1, 6, 2, 5, 2, 11, 9, 2, 1, 7, 2, 1, 2...
## $ `Track Duration (ms)` <dbl> 226973, 198226, 290466, 247626, 223832, 21669...
## $ `Added By` <chr> "spotify:user:", "spotify:user:", "spotify:us...
## $ `Added At` <chr> "2020-03-29T22:18:52Z", "2020-03-29T22:18:52Z...
## $ playlist_name <chr> "90s_country", "90s_country", "90s_country", ...
## $ filename <chr> "Data_Sources/2020_05_14_Loading_Mult_Files/G...
## $ Folder_01 <chr> "Data_Sources", "Data_Sources", "Data_Sources...
## $ Folder_02 <chr> "2020_05_14_Loading_Mult_Files", "2020_05_14_...
## $ Folder_03 <chr> "Genres", "Genres", "Genres", "Genres", "Genr...
## $ Genre <chr> "Country", "Country", "Country", "Country", "...
## $ Playlist <chr> "90s_country.xlsx", "90s_country.xlsx", "90s_...
If you have questions or comments, please message me on LinkedIn.
If you want to check out my other data science articles, please check out my blog.