Text Processing Linux command

Text processing in Linux refers to a set of commands used to manipulate and format text data in the command-line environment. These commands can perform operations such as reading and concatenating the contents of files, searching for specific patterns, sorting and filtering data and performing various text transformations

The text processing commands in Linux are widely used for tasks such as data processing, log analysis, and report generation. They can be combined with other Linux commands and pipelines to automate complex text-processing tasks.

Some common text-processing commands in Linux include:

cat

cat is a basic text processing command in Linux used to concatenate and display the contents of files. It is commonly used to view the contents of small text files or to combine multiple files into one. Here are some of its key features:

Syntax: 

cat [OPTION]...[FILE]...

Options:

  • -A, --show-all: equivalent to -vET
  • -b, --number-nonblank: number nonempty output lines, overrides -n
  • -E, --show-ends : display $ at end of each line
  • -n, --number: number all output lines
  • -s, --squeeze-blank: suppress repeated empty output lines
  • -T, --show-tabs: display TAB characters as ^I
  • -v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB

In its simplest usage, cat prints a file's content to the standard output:

username@Technoscience:~$ cat newfile.txt
The text 1
The text 2
The text 3
The text 4
username@Technoscience:~$

You can print the content of multiple files:

username@Technoscience:~$ cat newfile.txt newfile2.txt
The text 1
The text 2
The text 3
The text 4
The text 5
The text 6
The text 7
username@Technoscience:~$

and using the output redirection operator > you can concatenate the content of multiple files into a new file:

username@Technoscience:~$ cat newfile.txt newfile2.txt > newfile3.txt
username@Technoscience:~$ cat newfile3.txt
The text 1
The text 2
The text 3
The text 4
The text 5
The text 6
The text 7
username@Technoscience:~$

Using >> you can append the content of multiple files into a new file, creating it if it does not exist:

username@Technoscience:~$ cat newfile.txt newfile2.txt >> newfile4.txt
username@Technoscience:~$ cat newfile4.txt
The text 1
The text 2
The text 3
The text 4
The text 5
The text 6
The text 7
username@Technoscience:~$

When watching source code files it's great to see the line numbers and you can have cat print them using the -n option:

username@Technoscience:~$ cat -n newfile4.txt
     1 The text 1
     2 The text 2
     3 The text 3
     4 The text 4
     5 The text 5
     6 The text 6
     7 The text 7
username@Technoscience:~$

You can only add a number to non-blank lines using -b, or you can also remove all the multiple empty lines using -scat is often used in combination with the pipe operator | to feed a file content as input to another command.


cut

'cut' is a text processing command in Linux that is used to extract specific columns or fields from input data (i.e., from a file or from the standard input). It splits the input into lines and then cuts out specified fields from each line.

Here are some key features of the cut command:

cut [OPTIONS] [FILE]

Options:

  • -b, --bytes: selects the specified byte positions
  • -c, --characters: selects the specified character positions
  • -d, --delimiter=DELIM: specifies the delimiter that separates fields in the input data
  • -f, --fields=LIST: specifies the fields to be extracted (a comma-separated list of field numbers)

Usage:

cut -f1,3 -d: /etc/passwd: Extracts the first and third fields (separated by a colon) from the /etc/passwd file.


Note: By default, the delimiter is a tab character (\t).


grep

'grep' is a powerful utility in Linux for searching for a specific pattern of text in a file or input. The basic syntax of the command is:

grep [options] pattern [file_name(s)]
where:
  • 'pattern' is the string of text you want to search for.
  • 'file_name(s)' is the name(s) of the file(s) you want to search. You can specify one or more file names, or use a wildcard pattern to search multiple files.
'grep' has several useful options, including:
  • -v: to invert the match, so that it returns lines that do not contain the pattern.
  • -i: to ignore the case of the pattern when searching.
  • -r: to search recursively through all subdirectories.
  • -c: to count the number of matches in each file.

For example, here's how we can find the occurrences of the 'text' line in the 'newfile3.txt' file:

username@Technoscience:~$ grep 'text' newfile3.txt
The text 1
The text 2
The text 3
The text 4
The text 5
The text 6
The text 7
username@Technoscience:~$

Using the -n option will show the line numbers:

username@Technoscience:~$ grep -n 'text 2' newfile3.txt
2:The text 2
username@Technoscience:~$

One very useful thing is to tell grep to print 2 lines before, and 2 lines after the matched line, to give us
more context. That's done using the
-C option, which accepts a number of lines:

username@Technoscience:~$ grep -nC 2 'text 2' newfile3.txt
1-The text 1
2:The text 2
3-The text 3
4-The text 4
username@Technoscience:~$

Search is case-sensitive by default. Use the '-iflag to make it insensitive. As mentioned, you can use grep to filter the output of another command. We can replicate the same functionality as above using:

username@Technoscience:~$ less newfile3.txt | grep -n 'text 5'
5:The text 5
username@Technoscience:~$

The search string can be a regular expression, and this makes grep very powerful.

username@Technoscience:~$ ls -al | grep 'newfile'
-rw-rw-r--  1 username username     33 Feb  2 19:39 newfile2.txt
-rw-rw-r--  1 username username     77 Feb  2 19:42 newfile3.txt
-rw-rw-r--  1 username username     77 Feb  2 19:44 newfile4.txt
-rw-rw-r--  1 username username     44 Feb  2 19:37 newfile.txt
username@Technoscience:~$

Another thing you might find very useful is to invert the result, excluding the lines that match a particular string, using the '-voption:

username@Technoscience:~$ ls -al | grep -v 'newfile'

You can use 'grep' in a pipeline to search the output of another command. This makes it possible to search for patterns in the output of any command, making it a very versatile tool for text processing.


sort

sort is a Linux command used to sort the contents of a file or input stream in ascending or descending order. The sorted output can be redirected to a file or displayed on the terminal. By default, sort sorts the input lines in lexicographic order. Options can be provided to sort the input based on different criteria such as numeric value, reverse order, field-based sorting, etc.

Syntax: 

sort [OPTION]...[FILE]...

Suppose you have a text file that contains the names of dogs:

username@Technoscience:~$ cat dogs
golden retriever
labrador retriever
french bulldog
beagle
german shepherd dog
poodle
bulldog
most popular breeds
Great Danes
Miniature schnauzers
Cane corso
Boxers
Havanese
Spaniels
Brittanys
username@Technoscience:~$

This list is unorderedRun the sort command to sort them by name:

username@Technoscience:~$ sort dogs
beagle
Boxers
Brittanys
bulldog
Cane corso
french bulldog
german shepherd dog
golden retriever
Great Danes
Havanese
labrador retriever
Miniature schnauzers
most popular breeds
poodle
Spaniels
username@Technoscience:~$

Use the -r option to reverse the order:

username@Technoscience:~$ sort -r dogs
Spaniels
poodle
most popular breeds
Miniature schnauzers
labrador retriever
Havanese
Great Danes
golden retriever
german shepherd dog
french bulldog
Cane corso
bulldog
Brittanys
Boxers
beagle
username@Technoscience:~$

Sorting by default is case-sensitive, and alphabetic. Use the --ignore-case option to sort case insensitive, and the -n option to sort using a numeric order. If the file contains duplicate lines:

username@Technoscience:~$ sort --ignore-case -n dogs
beagle
Boxers
Brittanys
bulldog
Cane corso
french bulldog
german shepherd dog
golden retriever
Great Danes
Havanese
labrador retriever
Miniature schnauzers
most popular breeds
poodle
Spaniels
username@Technoscience:~$

You can use the -u option to remove them:

username@Technoscience:~$ sort -u dogs
beagle
Boxers
Brittanys
bulldog
Cane corso
french bulldog
german shepherd dog
golden retriever
Great Danes
Havanese
labrador retriever
Miniature schnauzers
most popular breeds
poodle
Spaniels
username@Technoscience:~$

sort does not just works on files, as many UNIX commands it also works with pipes, so you can use it on the output of another command, for example, you can order the files returned by ls with:

username@Technoscience:~$ ls | sort
Desktop
Documents
dogs
Downloads
Music
newdir
newfile2.txt
newfile3.txt
newfile4.txt
newfile.txt
Pictures
Public
Templates
Videos
username@Technoscience:~$

sort is very powerful and has lots more options, which you can explore by calling man sort.

SORT(1)                             User Commands                            SORT(1)
NAME
       sort - sort lines of text files
SYNOPSIS
       sort [OPTION]... [FILE]...
       sort [OPTION]... --files0-from=F
DESCRIPTION
       Write sorted concatenation of all FILE(s) to standard output.
       With no FILE, or when FILE is -, read standard input.
       Mandatory  arguments  to  long  options  are mandatory for short options too.
       Ordering options:
       -b, --ignore-leading-blanks
              ignore leading blanks
       -d, --dictionary-order
              consider only blanks and alphanumeric characters
<SNIP>


uniq

uniq is a Linux command used to remove duplicate lines from a sorted file or input stream. The output of uniq will contain only unique lines from the input. If multiple adjacent lines are identical, only one of them will be included in the output.

Syntax: 

uniq [OPTION]... [INPUT [OUTPUT]]

You can get those lines from a file, or using pipes from the output of another command:

username@Technoscience:~$ uniq dogs
golden retriever
labrador retriever
french bulldog
<SNIP>
username@Technoscience:~$

You can list duplicate lines from the directory:

username@Technoscience:~$ ls | uniq
Desktop
Documents
dogs
Downloads
<SNIP>
username@Technoscience:~$

You need to consider this key thing: uniq will only detect adjacent duplicate lines. This implies that you will most likely use it along with sort :

username@Technoscience:~$ sort dogs | uniq
beagle
Boxers
Brittanys
bulldog
Cane corso
<SNIP>
username@Technoscience:~$

The sort command has its own way to remove duplicates with the -u (unique) option. But uniq has
more power. 

By default it removes duplicate lines:

username@Technoscience:~$ sort dogs | uniq
beagle
Boxers
Brittanys
bulldog
Cane corso
french bulldog
<SNIP>
username@Technoscience:~$

You can tell it to only display duplicate lines, for example, with the -d option:

username@Technoscience:~$ sort dogs | uniq -d
username@Technoscience:~$

There are no duplicate lines. You can use the -u option to only display nonduplicate lines:

username@Technoscience:~$ sort dogs | uniq -u
beagle
Boxers
Brittanys
bulldog
Cane corso
<SNIP>
username@Technoscience:~$

You can count the occurrences of each line with the -c option:

username@Technoscience:~$ sort dogs | uniq -c
      1 beagle
      1 Boxers
      1 Brittanys
      1 bulldog
      1 Cane corso
<SNIP>
      1 Spaniels
username@Technoscience:~$

Use the special combination to then sort those lines by most frequent:

username@Technoscience:~$ sort dogs | uniq -c | sort -nr
      1 Spaniels
      1 poodle
      1 most popular breeds
      1 Miniature schnauzers
      1 labrador retriever
      1 Havanese
      1 Great Danes
<SNIP>
username@Technoscience:~$


sed

sed (Stream EDitor) is a powerful Linux command-line utility that performs text transformations on an input file or input stream. sed can perform operations such as text substitution, deletion, insertion, and search and replace. sed operates in a non-interactive manner and performs the transformations in batch mode. The output of sed can be redirected to a file or displayed on the terminal.

Syntax: 

sed [OPTION]... {script} [input-file]...


awk

'awk' is a programming language and a Linux command used for text processing and manipulation. It scans input text files line by line, performing actions based on patterns and rules specified by the user. awk is often used to extract information from log files, CSV files, and other types of text data.

'awk' is capable of performing operations such as text substitution, deletion, insertion, and search and replace, and it also supports advanced features such as pattern matching and conditional statements. The output of awk can be redirected to a file or displayed on the terminal.

Syntax: 

awk [OPTION]... 'program' [input-file]...


tr

'tr' is a Linux command used for text translation and manipulation. It translates or deletes specified characters in the input text and writes the result to standard output.

'tr' can be used to perform operations such as character substitution, deletion, and compression. For example, it can be used to convert lowercase characters to uppercase, remove duplicates, or squeeze repeated characters.

Syntax: 

tr [OPTION]... SET1 [SET2]


tac

'tac' is a Linux command used to concatenate and print files in reverse order. It reads the input files in reverse order, line by line, and writes the result to the standard output.

The tac command is useful for viewing the last lines of a file, for example, the latest entries in a log file.

Syntax: 

tac [FILE]...

Run the following command:

username@Technoscience:~$ tac dogs
Brittanys
Spaniels
Havanese
Boxers
<SNIP>
username@Technoscience:~$


head

'head' is a Linux command that prints the first n lines (10 by default) of a file to the standard output. It can also be used to view the first part of a large file to see what it contains before reading the whole file.

Syntax: 

head [OPTION]... [FILE]...

Options:

  • -n: Specify the number of lines to display.
  • -c: Specify the number of bytes to display.
  • -q: Quiet, never print headers giving file names.

Example: 

To print the first 5 lines of the file 'dogs' use the following command:

username@Technoscience:~$ head -n 5 dogs
golden retriever
labrador retriever
french bulldog
beagle
german shepherd dog
username@Technoscience:~$


tail

"tail" is a Linux command used to display the last N number of lines in a file. By default, it shows the last 10 lines. It can be used to monitor logs, check for updates, etc.

Syntax:

tail [OPTION]...[FILE]...

Options:
  •  -f, --follow: Output appended data as the file grows 
  • -n, --lines=[+|-]NUM: Number of lines to output. A '+' symbol shows lines starting from NUM 
  • -q, --quiet, --silent: Don't print headers when multiple files are outputting 
  • -v, --verbose: Always print headers when multiple files are outputting

Example:

The best use case of the tail in my opinion is when called with the -f option. It opens the file at the end and watches for file changes. Any time there is new content in the file, it is printed in the window. This is great for watching log files, for example:

username@Technoscience:~$ tail -f /var/log/dpkg.log
2023-02-02 15:43:15 status installed htop:amd64 2.1.0-3
2023-02-02 15:43:15 trigproc desktop-file-utils:amd64 0.23-1ubuntu3.18.04.2 <none>
2023-02-02 15:43:15 status half-configured desktop-file-utils:amd64 0.23-1ubuntu3.18.04.2
2023-02-02 15:43:15 status installed desktop-file-utils:amd64 0.23-1ubuntu3.18.04.2
2023-02-02 15:43:15 trigproc man-db:amd64 2.8.3-2ubuntu0.1 <none>
2023-02-02 15:43:15 status half-configured man-db:amd64 2.8.3-2ubuntu0.1
2023-02-02 15:43:17 status installed man-db:amd64 2.8.3-2ubuntu0.1
2023-02-02 15:43:17 trigproc mime-support:all 3.60ubuntu1 <none>
2023-02-02 15:43:17 status half-configured mime-support:all 3.60ubuntu1
2023-02-02 15:43:17 status installed mime-support:all 3.60ubuntu1

To exit, press Ctrl + C.

You can print the last 5 lines in a file:

username@Technoscience:~$ tail -n 5 newfile3.txt
The text 3
The text 4
The text 5
The text 6
The text 7
username@Technoscience:~$

You can print the whole file content starting from a specific line using + before the line number:

tail -n +10 <filename>

username@Technoscience:~$ tail -n+5 newfile3.txt
The text 5
The text 6
The text 7
username@Technoscience:~$

'tail' can do much more and as always my advice is to check man tail.

These commands are used in combination with pipes (|) and redirection (> and >>) to create powerful text processing pipelines that can manipulate, filter, and format text data in various ways.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Learn More
Ok, Go it!