![]() data$author % str_trim() data$narrator % str_trim() With this problem, we can use the gsub function to substitute the desired substring with an empty string and assign it to the same column. Convert the price column into integer type.Separate the number of starts and number of ratings in the stars column.Convert releasedate column into date type.Standardize the time column in minutes.Remove duplicated rows for name and author columns.Remove the Wrritenby and Narratedby in the author and narrator columns and separate the first from the last name by a blank space.I like to write a list of the problems that need to be addressed to clean the data before starting to clean it. Output of the distinct function over two columnsĪs we can see, we only have 85,966 unique rows, meaning a discrepancy in our data, we don’t know the method of collection of this data, and we don’t know how this error is being made, we should just keep the first row in each duplicated case. For this, we can use the glimpse function: glimpse(data) The first thing that we want to get to know are our fields or variables, how many we have and of which data types they are. # loading the dataset data <- read_csv('audible_uncleaned.csv') Explore Data The one argument that is needed is the path to the file, in this case, I’m working in the same directory, so I just need to pass the file name. In this part we just need to load the dataset and assign it into an R variable, using the read_csv function we can read a CSV file directly into our workspace. ![]() We will be using readr for reading the CSV file and loaded into R, dpylr, stringr, and tidyr for data manipulation and finally ggplot2 to visualize our data. Importing the Libraries library(readr) library(dplyr) library(tidyr) library(ggplot2) library(stringr) I’ll be using a dataset from Audible that there is available in Kaggle. R is a programming language specially made for the analysis of data, you can realize this by just seeing how fast we can actually load a dataset and start exploring it, especially if you are using RStudio, which is perfectly integrated with R, you have options to visualize your entire dataset in a tabular way like in an Excel spreadsheet, although there are some other options like Jupyter Notebooks with Python, I think is still a very good idea to have knowledge and experience over R if you want to work in the Data field. Put more trust in your script than in your memory, and don’t save your workspace.Photo by Claudio Schwarz on Unsplash Introduction But I that should be a deliberate choice, worthy of its own save() and load() calls, and certainly not something one does with simple stuff that can be reproduced a the blink of an eye. And yes, every time a computation takes too much time to reproduce, one should write it to a file to load every time. When using any modern computer system, we rely on saved information and saved state all the time. Storing a data frame in the workspace can seem comforting, but what happens the day I overwrite it by mistake? Don’t save your workspace. What should one do instead? One should source the script often, ideally from freshly minted R sessions, to make sure to always be working with a script that runs and does what it’s supposed to. I don’t know about your old selves, dear reader, but if they are anything like mine, don’t save your workspace. You end up having to put an inordinate trust in your old self. Loading a saved workspace turns your R script from a program, where everything happens logically according to the plan that is the code, to something akin to a cardboard box taken down from the attic, full of assorted pages and notebooks that may or may not be what they seem to be. But you don’t want that, so don’t save your workspace. When you re-open R from that working directory, the workspace will be loaded, and all these things will be available to you again. Saving your workspace creates an image of your current variables and functions, and saves them to a file called ”.RData”. When you exit an R session, you’re faced with the question of whether or not to save your workspace. ![]() To everyone learning R: Don’t save your workspace.
0 Comments
Leave a Reply. |