20+ useful Linux commands for Data Scientist

Tram Ho

This article of Development Programs No Trouble will present useful Linux command for population data scientist. This is the sum of the author himself in the learning and working process. This list does not include basic Linux commands (cd, pwd, ls, ssh, scp, etc.). By knowing and using the commands in this list, your processing speed will be significantly faster.

20+ useful linux commands for Data Scientist

Original article posted at his personal blog: 20+ useful linux commands for Data Scientist

In the process of working, if I know or find a useful command, I will continue to add to this list. And now let’s start with a list of useful linux commands to help us work more efficiently.

List of useful linux commands

I will follow the function, the linux command or a combination of commands to help us solve a specific problem in the work. These commands or combinations of commands are very brief and may not be easy to remember ^^ (If you use them a lot, you can remember them or forget to come here, hehe)

###first. Download data from the internet These are two commands that help you download files from the internet to your computer and more. Instead of using a browser to download on a personal computer, then have to transfer from a personal computer to the server computer, from the server, using one of these two commands will help you download it directly to where you need it.

As for the command curl , it will also download but not save the file, but displays the downloaded content on the screen if you do not specify the --output parameter for it:

Of course, the application of these 2 commands does not stop here, you can learn more. There is a little difference between these two commands, you can see the difference is detailed in curl vs wget .

###2. View file content

The set of 3 commands helps you quickly see part or all of the content of a text file from different views.

The head command will show you a piece of text at the top of the file, and conversely you can see a piece of content at the end of the file with the tail command. As for the cat command, it will display the entire text.

If you want to view text content but do not want to edit, using one of these 3 commands will help you see the fastest, faster than using vim editor.

With head and tail , you can specify the number of lines you want to see with the -n parameter. When using cat you want to see slowly in case the file needs to see too many lines, please add | less goes after the cat command (Use q to exit the command and use the arrow or scroll to see more).

### 3. Counting words and lines in a file The wc command helps you to count words, lines and bytes of data from a text file, respectively. When you call this command without any parameters, a result line will appear with the values ​​of the said sequence. Or if you want to count only one specific parameter, add one.

### 4. Sort and delete duplicates

As the name implies , sort (sort) commands help you to sort data in alphabetical order. It has many options but the one that I use the most is sort and delete duplicate content.

Parameters -u means delete duplicate lines (retaining only one single line in the case of multiple lines of the same). You can also reverse sort (descending alphabetically) by using the -r parameter.

If you want to sort without case sensitivity, add the -f parameter. Just type sort --help have it all.

The duplicate content deletion has many different commands to do. Here is another command that can quickly delete duplicates without sorting:


5. Count the frequency of occurrence

Suppose if you have a text file with a bunch of words like this:

And you want to count the number of occurrences of each line in this file, use the following command combination with the data file, assuming data.txt:

Or print better, then fix a bit, like this:

6. Check, delete lines according to conditions

You are working with a text file and want to check if the text content has the word “love” or not. Then you will be able to check very quickly without having to open the text editor:

Above I inserted the code so you do not see highlight, in fact, the word love will be highlighted and highlighted in red. This way, you can quickly check for the conditions you want.

You can search with regex , for example:

Or regardless of case, search for lines starting with English ,

If you want to get these lines, then output it to a file:

What if you want to delete the conditional rows?

  • Using the invert match of the command grep above, for example, delete the lines starting with the word love:

  • Or use sed command as follows:

7. Delete longer lines, shorter than x characters

Still the command sed that we mentioned above only. Experiment with a brief command line instead of having to sit down to write a long piece of code.

  • Delete lines in the text file of length less than or equal to x, assuming x = 10:

If you are familiar with regex, you will understand this command better. Or there is another command that uses awk with the same function:

  • Delete lines in the text file greater than x, assuming x = 10:

8. Check the encoding, convert to UTF-8

In the process of working with text files, sometimes for some reason we will encounter encoding errors, especially because we often work with Vietnamese data, so we often have to handle this problem.

To check the encoding of a text file, use the following command:

If it’s UTF-8 then there is no problem. But if the result is a different encoding, you can still convert it back to UTF-8 in the following way (assuming you need to convert from UTF16 to UTF8)

See the encoding that iconv supports using the command:

9. Delete the data streams of file A contained in file B

The first way, requires two files file A and file B to be sorted. If not already sorted, use the command sort shown in section 4.

Then, to delete the data streams of file A that appear in file B, use the following command:

Then the output file will be the data stream of file A but not in file B. The parameter -23 takes the lines that appear in both files, or appears only in file B.

In case you want to keep the original file order and still want to achieve the same purpose, use the following command:

10. Randomly mix (Shuffle) the lines

Let’s say you’re training a classification model or something, and you want to split train / test but want your data to be distributed randomly. Then, you can randomize the position of lines in the file.

11. Split file into several smaller files

Sometimes you have to work with dozens of GB files, but you want to handle this line of data files. So to be able to run multithreading, you need to split into multiple files. There is a command in linux that can help you do this quickly:


Above are some useful linux commands that sum up from experience in your learning and working process. The shares in this article are aggregated from the results I look up on Google. If you have any good commands, don’t be afraid to share them with everyone.

Thank you for reading all the articles!

Share the news now

Source : Viblo