Convert UCI SMSSpamCollection Dataset to a .csv using bash script

Convert UCI SMSSpamCollection Dataset to a .csv using bash script

Hello Folks, how are you?

Yesterday I was looking for a Spam dataset, and I found this one: SMS Spam Collection Data Set, from the UCI Machine Learning Repository. The data was in a weird text format, but I needed the .csv, so I decided to make a simple script to convert it. Here the code:

The code is pretty easy. First we make sure to redirect all the output on the new .csv file (...) > SMSSpamCollection.csv. In the main body, before looping through each line, we print the header echo "sms,is_spam";.
Now we need to go through the data, using the construct while read p; do ... done < SMSSpamCollection, we are giving our dataset to the loop input (< SMSSpamCollection) iterating line by line (p). The main part consist of getting the two parts we are interested in: class[hem/spam] and the sms body. To do that, we pipe | the line p to a cut command, using the delimiter -d option with ' ', and the corresponding field -f option, 1 to get the first token, and 2- to get the token starting from 2 included. We also have to make sure that there are no double quotes characters " within the SMS body, since in the last step we will wrap it into double quotes precisely. Finally, we line in the correct format, and that's it.

Good Job and Bye Bye Folks.

via GIPHY

 

 

Comments

Popular posts from this blog

The simplest RESTful APIs with Python, Flask, MongoDB, and Docker