Convert UCI SMSSpamCollection Dataset to a .csv using bash script
Hello Folks, how are you?
Yesterday I was looking for a Spam dataset, and I found this one: SMS Spam Collection Data Set, from the UCI Machine Learning Repository. The data was in a weird text format, but I needed the .csv
, so I decided to make a simple script to convert it. Here the code:
The code is pretty easy.
First we make sure to redirect all the output on the new .csv
file (...) > SMSSpamCollection.csv
.
In the main body, before looping through each line, we print the header echo "sms,is_spam";
.
Now we need to go through the data, using the construct while read p; do ... done < SMSSpamCollection
, we are giving our dataset to the loop input (< SMSSpamCollection
) iterating line by line (p
).
The main part consist of getting the two parts we are interested in: class
[hem/spam] and the sms
body. To do that, we pipe |
the line p
to a cut
command, using the delimiter -d
option with ' '
, and the corresponding field -f
option, 1
to get the first token, and 2-
to get the token starting from 2 included. We also have to make sure that there are no double quotes characters "
within the SMS body, since in the last step we will wrap it into double quotes precisely.
Finally, we line in the correct format, and that's it.
Good Job and Bye Bye Folks.
Comments
Post a Comment