Saturday, October 5, 2019

Useful commands for data cleaning in Machine Learning(ML) training - part 2

This post details post of all the commands specified in this post.

grep: If you are regular CLI user, you might have already knew it about grep command and its advantages. You can refer this and this for more info on grep.

pipe(|): Pipe is used to combined more than one command. i.e output of one command is used  as a input for another command. This will help to make multiple commands in a single line. If you want to find list of names from a file and sort them by removing duplicates. You can simply do like this 'grep name file_name | sort -u'

cat: Cat is the command to show or display contents of the file in the console or stdout. Earlier I used vim for seeing content of the file. But now I am using cat along with more command to see the contents. In my opinion this is the best way to see contents. And if you want to process file content, you can use cat along with awk.

awk: This is very power full text processing tool. We can  do text processing efficiently and effectively using this command. Input data for training should be always separated by delimiters like comma(,), pipe(|) or hyphen(-) or mostly CSV format. We can use awk to verify whether input file is formatted properly. You can refer awk use here.

> (redirection): This will help to save intermediate files in a separate file. For example, if you want to save sorted names after searching, you can run this command 'grep name file_name | sort -u > new_filename'. This will save results in a file new_filename.

wc : This command will show total no.of lines, words and characters in a file. I never used this command even though I knew it until I started working for ML project. If you are working on large size files like machine learning projects where required input data is large in size, it is always good verify the file size or no.of lines. In that case this command will really helps a lot. You can find total no.of unique names after grep by combining this command with grep and sort as 'grep name file_name | sort -u|  wc'. If you want to know only lines or characters or words just use 'wc -l filename' for total no.of lines from filename, for words use 'w' and 'c' for characters instead of 'l'.

more: This command is used to see the file content page by page if the content is more than one page. This will really help in seeing and analyzing the content of the file very quickly. And we can also use this command to see the results of a another command by pipe to analyze the results. For example, I want to see uniquely sorted names after grep, I can use like 'grep name file_name | sort -u | more' which will show results in page by page format.

head/tail: Some times its always helpful to see first or last lines of the file. As the name says, 'head filename' will show first ten lines and 'tail filename' will show last 10 lines of the file by default. If you want only first or last three liens you can use 'head -3 filename' or 'tail -3 filename'. You can replace '3' with any numeric if you want. You can also use this with other commands using pipe(|) like 'grep name file_name | sort -u| head' or 'grep name file_name | sort -u | tail'

file: 'file file_name' command will tell the format of the file file_name. Some times filename doesn't contains extension or may be file contains incorrect file extension. In that case you can use this command to know the file format. This is my favorite command. I use this regularly to know file format. One of my personal experience was when I started working in iOS development. I really don't know about 'ipa' file format. I learned by using this file command that ipa is a zip file. You can see related post here. If you are working in machine learning project where file size going to be large, you cant open the file. This command will help in that scenario.

ls -h: I know many of you aware of this command and you might have already used 'ls -ltr' many times. But I am going tell one usefull option is reading the file size quickly. By using '-h' option, ls command will disply the file size in human readable format. This is also one of my favorite command and it is very very help full when you are working CLI.

Happy Learning!!!

Sunday, September 29, 2019

Key points to better understand Machine Learning(ML) projects!!!!!

Problem statement: Understanding problem statement is one of the key to get good Machine learning model. Domain knowledge makes crucial role in understanding problem statement.Defining problem statement is not easy. All  the examples or problems available over the web are clearly defined their problem statement. But when you start working on real time use cases, its very hard to understand the problem. Data scientist and Data Analyst will play major in this.

Data: Now a days we are having lot of data in the form of text, images, audio and video format. But the problem with this data is, all this data is unstructured format and not clean.All the sample data available over web (Kaggle for example) is already cleaned data. For practicing or for learning this will help. But when you started working on real time projects, you wont get cleaned data. Data Engineer will play major role in cleaning unstructured data, which is commonly known as pre-training stage.

Understanding Data: Even though cleaned data is available, To get better training model, need to understand data samples clearly. How data is a distributed and what features needs to take from that data.If the data is not distributed equally, ML model will not work properly. Always make sure that input data is equally distributed. Data Analyst will play major role in this.

Happy Learning!!

Sunday, September 22, 2019

Useful commands for data cleaning in Machine Learning(ML) training - Part 1

If you are a CLI user , some of these commands may be familiar to you. The hardest part in whole Machine learning(ML) is providing clean data to train. ML is all about data and there is lot of data available these days and huge data is generating every data in the form of text, videos, images etc. All this data is not structured and that is one of the biggest problem. How to make unstructured into a usable format? By using combination of below commands, we can clean raw data by removing noise and make a structured data which further can be used for training a model.

  • grep
  • pipe(|)
  • cat
  • awk
  • > (redirection)
  • wc -l
  • more
  • head
  • tail
  • ls -lh
  • sed

grep: is a command to find or search for a particular pattern in a given file. you can find more on this.
pipe(|): pipe alone we cant use. But this will be used to join more than one command.
cat: is a command to show contents of a given file in the console.
awk: is very powerful command line utility for text processing.
>(redirection): is the utility to redirect results of any command from console to any other file
wc: This command will give total lines, words and characters from a given file. -l will give only total lines from a given file. This will be very useful to know how big the data file is.
more: this is to see the contents of a given file in pagination format if file content is more than one page. Initially I used to use vi to see the contents of a file, after knowing about more command, I started using more instead of vi and it is very effective as well.
head: to see sample content(first 10 lines) from a given file. you can use -n to specify how many lines you want to show on the console
tail: same as head, but from bottom. To see sample content(last 10 lines) from a given file. you can use -n to specify how many lines you want to show from bottom on the console
ls -lh: Many of us know about ls command. trick here is l and h. l will show file details like permissions and owner etc, h will show the size of the file in human readable format. like if file size is 1424, ls -h will show 1.4K. you can easily tell later one is more readable.
sed: is a stream editor. this command mostly I used for replacing a pattern in the same file.

I write next post on these commands in details.

Happy Learning!!!

Monday, September 2, 2019

How to validate data file is formatted properly before training ML model!!

In machine learning(ML), clean data for training  is the key for a better model.
Must remove noise(unwanted data like special characters) from input data to get clean clean data.
.To do this one of the first step is to make sure data format is consistent in entire input file.
One of the best way to check this is below command. This command should return
Only one value, if not your data file is not properly formatted.

cat file_filename | awk -F’,’ ‘{print NF}’ | sort -u

Let’s see in detail about this command. This command contains below sub parts

cat: is the command to display contents of a text file
awk: this is very power full text processing tool.
sort:this command is to sort and u is for unique and removing duplicates
NF: is number of fields. This will be used In AWK
Pipe(|): this is also very power full command line utilitiy which combines more than one command.

Let’s take a sample file which contains name, age and city of some customers separated by comma, which means each line should contain only three fields separated by comma. But if you see in the below data, in the second line it contains four fields separated by comma.This is very very common scenario in Machine learning training data. And checking this format in a large file not easy. Using above command we can easily verify it.

Chandra, 35,Singapore
Sekhar,Chandra, 26,Guntur

Applying above command on this text will return the results 3 and 4 as shown below as input data is not consistent. This is due to second line contains one extra comma in it.

After removing extra comma from the second line, that command returns only one value which is 3 as shown below. This return value may vary basing on number of fields in each line, but each line should contain same number of fields which makes this command returns only one value.

cat command will display the contents of the file and redirect to AWK because of pipe and awk will split each line with comma as -F is a field separator and NF will return all the fields after split each line. In this case we have three lines in that first and third lines has three and second line has 4 after split using comma. Sort will show the sorted result with duplicates and -u will show only unique values.

This command will work efficiently on very large files as well. I tried on data files whose file size is around one million lines and it returned results in seconds.

Happy Learning!!!

Tuesday, March 12, 2019

GREP unix command useful tips!!!

Recently I learned  useful tips from my manager for 'grep' which helps a lot when troubleshooting the  issues with logs. In this post I will explain about below

  • How to search multiple strings in a file at the same time.
  • How to display found string in the color.
  • How to exclude particular string in the search result
  • How to display total no.of lines in the grep result

Search a string in a file using grep: Check this link for basic usage  of grep command.  
Lets see this tip with example. Below is the content of a file demo.txt and this file name we are going to use in this example as well.

How to search multiple strings in a file:
In the below screen we are search for the strings 'demo', 'show' and 'multiple' in the file demo.txt and and the result is as follows.

How to display found string in color:
In the above screen, even though grep found the patterns, it is difficult to identify in which line these patterns available. For that you can use grep command property 'color' as shown below.

From the above screen, we can easily identify the found strings as they are highlighted in red color. This color utility will save lot of time when you are searching in debug logs while troubleshooting.

How to exclude particular string in the search result: GREP command will supports option to exclude particular pattern from the result. This is basically not including a string in the result.

In the above screen, initially search for the strings 'demo' and 'show' and in the results I want to exclude the string 'multiple'.

How to display total no.of lines in the grep result: We can use -c option to get the total number of lines of the grep result. If the pattern is unique in each line, that count will be total no.of occurrences of the pattern.

In the above screen, searching for the patterns 'demo' and 'show' results two lines and using -c option will show the total count to two. If the grep results are more, this count option will be very useful.

Happy Learning!!!

Sunday, February 24, 2019

Uploading a file using Multer!!

Recently I got a  requirement of uploading a file. We are using NodeJS server, so I explored on NPM and found out very simple library called Multer. It is very simple to use, so sharing here.
                   For uploading a file, we need to know the location where we are going to store the file. For that we need to specify the location for the Multer as shown below. One parameter is destination in the line 20 where we need to specify the location in this case uploads folder in the current directory. Another parameter is filename un the line 23 with which we are going to store in the location. If filename not specified, random number will be given to that file name. 
                   After specifying destination and filename parameters, need to map these Multer storage fields to Multer as shown below in the line 29.

Now let's see how to upload file. While using uploading POST request, need to match the query which is coming in the request parameters as shown below in line 48. In this api , I am using 'file', so query parameter should be 'file', you can use any name, but i used file here. In the line 48, we are using upload.single('file'), upload is actually Multer configured storage in the above image line 29. If file uploading is successful, we will get file name in the call back as shown in the line 49. If file upload fails, this filename will be undefined.

If you want to validate the request before start uploading, just add the method in the POST request as shown below in the line 48. In this case I used a method name as validate, you can use whatever function name you want or even you can skip this validation if you don't want, in that case you just remove that validation method from the line 48.

Now let's see Downloading file. For downloading a file, there is no dependency with Multer, but as we are discussing about uploading a file, lets see downloading file as well.

For the complete code. Refer my github link here.

Happy Learning!!

Popular Posts