This article of Development Programs No Trouble will present useful Linux command for population data scientist. This is the sum of the author himself in the learning and working process. This list does not include basic Linux commands (cd, pwd, ls, ssh, scp, etc.). By knowing and using the commands in this list, your processing speed will be significantly faster.
Original article posted at his personal blog: 20+ useful linux commands for Data Scientist
In the process of working, if I know or find a useful command, I will continue to add to this list. And now let’s start with a list of useful linux commands to help us work more efficiently.
List of useful linux commands
I will follow the function, the linux command or a combination of commands to help us solve a specific problem in the work. These commands or combinations of commands are very brief and may not be easy to remember ^^ (If you use them a lot, you can remember them or forget to come here, hehe)
###first. Download data from the internet These are two commands that help you download files from the internet to your computer and more. Instead of using a browser to download on a personal computer, then have to transfer from a personal computer to the server computer, from the server, using one of these two commands will help you download it directly to where you need it.
1 2 3 4 5 6 7 8 9 10 11 12 | $ wget https://gist.githubusercontent.com/nguyenvanhieuvn/7d9441c10b3c2739499fc5a4d9ea06fb/raw/06c0f4ed63c0951cf32c32089383aeb345e6743e/teencode.txt --2020-05-03 11:37:24-- https://gist.githubusercontent.com/nguyenvanhieuvn/7d9441c10b3c2739499fc5a4d9ea06fb/raw/06c0f4ed63c0951cf32c32089383aeb345e6743e/teencode.txt Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 151.101.76.133 Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|151.101.76.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1437875 (1,4M) [text/plain] Saving to: ‘teencode.txt’ teencode.txt 100%[=================================================================================================================>] 1,37M 1,81MB/s in 0,8s 2020-05-03 11:37:26 (1,81 MB/s) - ‘teencode.txt’ saved [1437875/1437875] |
As for the command curl , it will also download but not save the file, but displays the downloaded content on the screen if you do not specify the --output
parameter for it:
1 2 3 4 5 | $ curl https://dumps.wikimedia.org/viwiki/20191201/viwiki-20191201-pages-articles-multistream.xml.bz2 --output viwiki-20191201-pages-articles-multistream.xml.bz2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 694M 0 287k 0 0 45186 0 4:28:30 0:00:06 4:28:24 57884 |
Of course, the application of these 2 commands does not stop here, you can learn more. There is a little difference between these two commands, you can see the difference is detailed in curl vs wget .
###2. View file content
The set of 3 commands helps you quickly see part or all of the content of a text file from different views.
The head command will show you a piece of text at the top of the file, and conversely you can see a piece of content at the end of the file with the tail command. As for the cat command, it will display the entire text.
If you want to view text content but do not want to edit, using one of these 3 commands will help you see the fastest, faster than using vim editor.
With head and tail , you can specify the number of lines you want to see with the -n parameter. When using cat you want to see slowly in case the file needs to see too many lines, please add | less goes after the cat command (Use q to exit the command and use the arrow or scroll to see more).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | $ head -n 5 teencode.txt created cờ ri ết kg không ctrai con trai khôg không bme bố mẹ $ tail teencode.txt ngaay 1 tieuthutaodo 1 nosmay 1 bpđ 1 tuongvs 1 jjvyxmni 1 engày 1 oáh 1 thiix 1 zajjj 1 $ cat teencode.txt | less ... đág đáng nvay như vậy nhjeu nhiều xg xuống zồi rồi trag trang zữ dữ atrai anh trai : |
### 3. Counting words and lines in a file The wc command helps you to count words, lines and bytes of data from a text file, respectively. When you call this command without any parameters, a result line will appear with the values of the said sequence. Or if you want to count only one specific parameter, add one.
1 2 3 4 5 6 7 8 9 10 11 12 | $ wc teencode.txt 135042 270198 1437875 teencode.txt # -c, --bytes print the byte counts # -m, --chars print the character counts # -l, --lines print the newline counts $ wc -w teencode.txt 270198 teencode.txt $ wc -l teencode.txt 135042 teencode.txt |
### 4. Sort and delete duplicates
As the name implies , sort (sort) commands help you to sort data in alphabetical order. It has many options but the one that I use the most is sort and delete duplicate content.
1 2 | $ sort -u filename |
Parameters -u
means delete duplicate lines (retaining only one single line in the case of multiple lines of the same). You can also reverse sort (descending alphabetically) by using the -r
parameter.
If you want to sort without case sensitivity, add the -f
parameter. Just type sort --help
have it all.
The duplicate content deletion has many different commands to do. Here is another command that can quickly delete duplicates without sorting:
1 2 | $ awk '!seen[$0]++' filename |
Or:
1 2 | $ uniq -u filename |
5. Count the frequency of occurrence
Suppose if you have a text file with a bunch of words like this:
1 2 3 4 5 6 7 8 9 10 11 12 | a b c a b c d f a f a |
And you want to count the number of occurrences of each line in this file, use the following command combination with the data file, assuming data.txt:
1 2 3 4 5 6 7 | $ sort a.txt | uniq -c 4 a 2 b 2 c 1 d 2 f |
Or print better, then fix a bit, like this:
1 2 3 4 5 6 7 | $ sort data.txt | uniq -c | awk '{print $2, $1}' a 4 b 2 c 2 d 1 f 2 |
6. Check, delete lines according to conditions
You are working with a text file and want to check if the text content has the word “love” or not. Then you will be able to check very quickly without having to open the text editor:
1 2 3 4 5 | $ grep "yêu" data.txt vũ tức sôi máu nhưng chỉ có thể cung kính phục tùng mọi yêu cầu của nó tôi có thể giúp cô chỉ cần cô không làm ảnh hưởng đến cuộc sống gia đình tôi mọi yêu cầu của cô tôi đều chấp nhận làm theo ông không tin chuyện ma quỷ ông chỉ nghĩ con trai mình yêu đương vớ vẩn nửa đêm nửa hôm ngang nhiên dẫn con gái về nhà |
Above I inserted the code so you do not see highlight, in fact, the word love will be highlighted and highlighted in red. This way, you can quickly check for the conditions you want.
You can search with regex , for example:
1 2 3 4 5 6 | $ grep "y[êế]u" data.txt giọng nữ yếu ớt thê lương truyền vào tai anh đứng từ xa chỉ thấy cô ấy là một cái bóng trắng có mái tóc vừa đen vừa dài người yếu bóng vía chắc chắn bị dọa cho chết ngất ngay đến thanh niên trai tráng như vũ mới nhìn qua cũng bị dọa cho giật mình một luồng khói trắng yếu ớt uốn lượn trong không khí dần dần tụ lại thành hình hài một cô gái gầy yếu vừa khít với chỗ khoanh vùng thi thể vũ tức sôi máu nhưng chỉ có thể cung kính phục tùng mọi yêu cầu của nó |
Or regardless of case, search for lines starting with English ,
1 2 3 4 5 6 | $ grep -i '^anh' data.txt Anh chương một con đường ma ám anh học y vì gia cảnh tốt nên ra trường thuận lợi làm bác sĩ ở bệnh viện lớn chuyện ở công ty của ba anh anh hầu như không quan tâm anh chăm chú xem mà không để ý ngoài trời đã đổ mưa lớn gió rít từng cơn rợn người anh nhìn vết máu dài màu đỏ bắt mắt trên cánh tay thở dài để mẹ anh nhìn thấy bà ấy lại mắng ù tai đây |
If you want to get these lines, then output it to a file:
1 2 | $ grep 'pattern' filename > output |
What if you want to delete the conditional rows?
- Using the invert match of the command grep above, for example, delete the lines starting with the word love:
1 2 | $ grep -v '^anh' data.txt > output |
- Or use sed command as follows:
1 2 3 4 | # sed '/pattern to match/d' ./infile $ sed '/^yêu/d' data.txt > output |
7. Delete longer lines, shorter than x characters
Still the command sed that we mentioned above only. Experiment with a brief command line instead of having to sit down to write a long piece of code.
- Delete lines in the text file of length less than or equal to x, assuming x = 10:
1 2 | $ sed '/^.{10}./d' filename > output |
If you are familiar with regex, you will understand this command better. Or there is another command that uses awk with the same function:
- Delete lines in the text file greater than x, assuming x = 10:
1 2 3 4 | $ sed -r '/^.{,10}$/d' filename > output # Hoặc $ awk 'length($0)<=10' filename > output |
8. Check the encoding, convert to UTF-8
In the process of working with text files, sometimes for some reason we will encounter encoding errors, especially because we often work with Vietnamese data, so we often have to handle this problem.
To check the encoding of a text file, use the following command:
1 2 3 | $ file teencode.txt teencode.txt: UTF-8 Unicode text, with very long lines |
If it’s UTF-8 then there is no problem. But if the result is a different encoding, you can still convert it back to UTF-8 in the following way (assuming you need to convert from UTF16 to UTF8)
1 2 3 | # iconv options -f from-encoding -t to-encoding inputfile(s) -o outputfile $ iconv -f UTF-16 -t UTF-8 filename -o output |
See the encoding that iconv supports using the command:
1 2 | $ iconv -l |
9. Delete the data streams of file A contained in file B
The first way, requires two files file A and file B to be sorted. If not already sorted, use the command sort shown in section 4.
Then, to delete the data streams of file A that appear in file B, use the following command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | comm -23 fileA fileB > output $ cat a.txt a b c d $ cat b.txt a b $ comm -23 a.txt b.txt a c d |
Then the output file will be the data stream of file A but not in file B. The parameter -23
takes the lines that appear in both files, or appears only in file B.
In case you want to keep the original file order and still want to achieve the same purpose, use the following command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # grep -Fvxf <lines-to-remove> <all-lines> $ grep -Fvxf fileB fileA > output $ cat a.txt a d c b $ cat b.txt a b $ grep -Fvxf b.txt a.txt a d c |
10. Randomly mix (Shuffle) the lines
Let’s say you’re training a classification model or something, and you want to split train / test but want your data to be distributed randomly. Then, you can randomize the position of lines in the file.
1 2 | $ shuf filename |
11. Split file into several smaller files
Sometimes you have to work with dozens of GB files, but you want to handle this line of data files. So to be able to run multithreading, you need to split into multiple files. There is a command in linux that can help you do this quickly:
1 2 3 4 5 | // Kiểm tra số dòng của file $ wc -l bigfile.txt // Chia nhỏ theo số lượng dòng ở mỗi file con, giả sử là 1 triệu dòng mỗi file $ split -l 1000000 bigfile.txt output_dir/prefix_ |
summary
Above are some useful linux commands that sum up from experience in your learning and working process. The shares in this article are aggregated from the results I look up on Google. If you have any good commands, don’t be afraid to share them with everyone.
Thank you for reading all the articles!