Saturday, October 5, 2019

Useful commands for data cleaning in Machine Learning(ML) training - part 2

This post details post of all the commands specified in this post.

grep: If you are regular CLI user, you might have already knew it about grep command and its advantages. You can refer this and this for more info on grep.

pipe(|): Pipe is used to combined more than one command. i.e output of one command is used  as a input for another command. This will help to make multiple commands in a single line. If you want to find list of names from a file and sort them by removing duplicates. You can simply do like this 'grep name file_name | sort -u'

cat: Cat is the command to show or display contents of the file in the console or stdout. Earlier I used vim for seeing content of the file. But now I am using cat along with more command to see the contents. In my opinion this is the best way to see contents. And if you want to process file content, you can use cat along with awk.

awk: This is very power full text processing tool. We can  do text processing efficiently and effectively using this command. Input data for training should be always separated by delimiters like comma(,), pipe(|) or hyphen(-) or mostly CSV format. We can use awk to verify whether input file is formatted properly. You can refer awk use here.

> (redirection): This will help to save intermediate files in a separate file. For example, if you want to save sorted names after searching, you can run this command 'grep name file_name | sort -u > new_filename'. This will save results in a file new_filename.

wc : This command will show total no.of lines, words and characters in a file. I never used this command even though I knew it until I started working for ML project. If you are working on large size files like machine learning projects where required input data is large in size, it is always good verify the file size or no.of lines. In that case this command will really helps a lot. You can find total no.of unique names after grep by combining this command with grep and sort as 'grep name file_name | sort -u|  wc'. If you want to know only lines or characters or words just use 'wc -l filename' for total no.of lines from filename, for words use 'w' and 'c' for characters instead of 'l'.

more: This command is used to see the file content page by page if the content is more than one page. This will really help in seeing and analyzing the content of the file very quickly. And we can also use this command to see the results of a another command by pipe to analyze the results. For example, I want to see uniquely sorted names after grep, I can use like 'grep name file_name | sort -u | more' which will show results in page by page format.

head/tail: Some times its always helpful to see first or last lines of the file. As the name says, 'head filename' will show first ten lines and 'tail filename' will show last 10 lines of the file by default. If you want only first or last three liens you can use 'head -3 filename' or 'tail -3 filename'. You can replace '3' with any numeric if you want. You can also use this with other commands using pipe(|) like 'grep name file_name | sort -u| head' or 'grep name file_name | sort -u | tail'

file: 'file file_name' command will tell the format of the file file_name. Some times filename doesn't contains extension or may be file contains incorrect file extension. In that case you can use this command to know the file format. This is my favorite command. I use this regularly to know file format. One of my personal experience was when I started working in iOS development. I really don't know about 'ipa' file format. I learned by using this file command that ipa is a zip file. You can see related post here. If you are working in machine learning project where file size going to be large, you cant open the file. This command will help in that scenario.

ls -h: I know many of you aware of this command and you might have already used 'ls -ltr' many times. But I am going tell one usefull option is reading the file size quickly. By using '-h' option, ls command will disply the file size in human readable format. This is also one of my favorite command and it is very very help full when you are working CLI.

Happy Learning!!!

Popular Posts