Sunday, September 22, 2019

Useful commands for data cleaning in Machine Learning(ML) training - Part 1


If you are a CLI user , some of these commands may be familiar to you. The hardest part in whole Machine learning(ML) is providing clean data to train. ML is all about data and there is lot of data available these days and huge data is generating every data in the form of text, videos, images etc. All this data is not structured and that is one of the biggest problem. How to make unstructured into a usable format? By using combination of below commands, we can clean raw data by removing noise and make a structured data which further can be used for training a model.

  • grep
  • pipe(|)
  • cat
  • awk
  • > (redirection)
  • wc -l
  • more
  • head
  • tail
  • ls -lh
  • sed

grep: is a command to find or search for a particular pattern in a given file. you can find more on this.
pipe(|): pipe alone we cant use. But this will be used to join more than one command.
cat: is a command to show contents of a given file in the console.
awk: is very powerful command line utility for text processing.
>(redirection): is the utility to redirect results of any command from console to any other file
wc: This command will give total lines, words and characters from a given file. -l will give only total lines from a given file. This will be very useful to know how big the data file is.
more: this is to see the contents of a given file in pagination format if file content is more than one page. Initially I used to use vi to see the contents of a file, after knowing about more command, I started using more instead of vi and it is very effective as well.
head: to see sample content(first 10 lines) from a given file. you can use -n to specify how many lines you want to show on the console
tail: same as head, but from bottom. To see sample content(last 10 lines) from a given file. you can use -n to specify how many lines you want to show from bottom on the console
ls -lh: Many of us know about ls command. trick here is l and h. l will show file details like permissions and owner etc, h will show the size of the file in human readable format. like if file size is 1424, ls -h will show 1.4K. you can easily tell later one is more readable.
sed: is a stream editor. this command mostly I used for replacing a pattern in the same file.

I write next post on these commands in details.

Happy Learning!!!

No comments:

Popular Posts