Posts

Showing posts from April, 2022

Convert UCI SMSSpamCollection Dataset to a .csv using bash script

Image
Convert UCI SMSSpamCollection Dataset to a .csv using bash script Hello Folks, how are you? Yesterday I was looking for a Spam dataset, and I found this one: SMS Spam Collection Data Set , from the UCI Machine Learning Repository. The data was in a weird text format, but I needed the .csv , so I decided to make a simple script to convert it. Here the code: The code is pretty easy. First we make sure to redirect all the output on the new .csv file (...) > SMSSpamCollection.csv . In the main body, before looping through each line, we print the header echo "sms,is_spam"; . Now we need to go through the data, using the construct while read p; do ... done < SMSSpamCollection , we are giving our dataset to the loop input ( < SMSSpamCollection ) iterating line by line ( p ). The main part consist of getting the two parts we are interested in: class [hem/spam] and the sms body. To do that, we pipe | the line p to a cut command, using the delimiter -d optio