Technical Stuff: September 2019

Sunday, September 29, 2019

Key points to better understand Machine Learning(ML) projects!!!!!

Problem statement: Understanding problem statement is one of the key to get good Machine learning model. Domain knowledge makes crucial role in understanding problem statement.Defining problem statement is not easy. All the examples or problems available over the web are clearly defined their problem statement. But when you start working on real time use cases, its very hard to understand the problem. Data scientist and Data Analyst will play major in this.

Data: Now a days we are having lot of data in the form of text, images, audio and video format. But the problem with this data is, all this data is unstructured format and not clean.All the sample data available over web (Kaggle for example) is already cleaned data. For practicing or for learning this will help. But when you started working on real time projects, you wont get cleaned data. Data Engineer will play major role in cleaning unstructured data, which is commonly known as pre-training stage.

Understanding Data: Even though cleaned data is available, To get better training model, need to understand data samples clearly. How data is a distributed and what features needs to take from that data.If the data is not distributed equally, ML model will not work properly. Always make sure that input data is equally distributed. Data Analyst will play major role in this.

Happy Learning!!

Sunday, September 22, 2019

Useful commands for data cleaning in Machine Learning(ML) training - Part 1

If you are a CLI user , some of these commands may be familiar to you. The hardest part in whole Machine learning(ML) is providing clean data to train. ML is all about data and there is lot of data available these days and huge data is generating every data in the form of text, videos, images etc. All this data is not structured and that is one of the biggest problem. How to make unstructured into a usable format? By using combination of below commands, we can clean raw data by removing noise and make a structured data which further can be used for training a model.

grep
pipe(|)
cat
awk
> (redirection)
wc -l
more
head
tail
ls -lh
sed

grep: is a command to find or search for a particular pattern in a given file. you can find more on this.
pipe(|): pipe alone we cant use. But this will be used to join more than one command.
cat: is a command to show contents of a given file in the console.
awk: is very powerful command line utility for text processing.
>(redirection): is the utility to redirect results of any command from console to any other file
wc: This command will give total lines, words and characters from a given file. -l will give only total lines from a given file. This will be very useful to know how big the data file is.
more: this is to see the contents of a given file in pagination format if file content is more than one page. Initially I used to use vi to see the contents of a file, after knowing about more command, I started using more instead of vi and it is very effective as well.
head: to see sample content(first 10 lines) from a given file. you can use -n to specify how many lines you want to show on the console
tail: same as head, but from bottom. To see sample content(last 10 lines) from a given file. you can use -n to specify how many lines you want to show from bottom on the console
ls -lh: Many of us know about ls command. trick here is l and h. l will show file details like permissions and owner etc, h will show the size of the file in human readable format. like if file size is 1424, ls -h will show 1.4K. you can easily tell later one is more readable.
sed: is a stream editor. this command mostly I used for replacing a pattern in the same file.

I write next post on these commands in details.

Happy Learning!!!

Monday, September 2, 2019

How to validate data file is formatted properly before training ML model!!

In machine learning(ML), clean data for training is the key for a better model.
Must remove noise(unwanted data like special characters) from input data to get clean clean data.
.To do this one of the first step is to make sure data format is consistent in entire input file.
One of the best way to check this is below command. This command should return
Only one value, if not your data file is not properly formatted.

cat file_filename | awk -F’,’ ‘{print NF}’ | sort -u

Let’s see in detail about this command. This command contains below sub parts

cat: is the command to display contents of a text file
awk: this is very power full text processing tool.
sort:this command is to sort and u is for unique and removing duplicates
NF: is number of fields. This will be used In AWK
Pipe(|): this is also very power full command line utilitiy which combines more than one command.

Let’s take a sample file which contains name, age and city of some customers separated by comma, which means each line should contain only three fields separated by comma. But if you see in the below data, in the second line it contains four fields separated by comma.This is very very common scenario in Machine learning training data. And checking this format in a large file not easy. Using above command we can easily verify it.

Chandra, 35,Singapore
Sekhar,Chandra, 26,Guntur
Sekhar,35,Singapore

Applying above command on this text will return the results 3 and 4 as shown below as input data is not consistent. This is due to second line contains one extra comma in it.

After removing extra comma from the second line, that command returns only one value which is 3 as shown below. This return value may vary basing on number of fields in each line, but each line should contain same number of fields which makes this command returns only one value.

cat command will display the contents of the file and redirect to AWK because of pipe and awk will split each line with comma as -F is a field separator and NF will return all the fields after split each line. In this case we have three lines in that first and third lines has three and second line has 4 after split using comma. Sort will show the sorted result with duplicates and -u will show only unique values.

This command will work efficiently on very large files as well. I tried on data files whose file size is around one million lines and it returned results in seconds.

Happy Learning!!!

Technical Stuff

Pages