Text processing in Linux refers to a set of commands used to manipulate and format text data in the command-line environment. These commands can perform operations such as reading and concatenating the contents of files, searching for specific patterns, sorting and filtering data and performing various text transformations.
The text processing commands in Linux are widely used for tasks such as data processing, log analysis, and report generation. They can be combined with other Linux commands and pipelines to automate complex text-processing tasks.
Some common text-processing commands in Linux include:
cat
cat is a basic text processing command in Linux used to concatenate and display the contents of files. It is commonly used to view the contents of small text files or to combine multiple files into one. Here are some of its key features:
Syntax:
Options:
- -A, --show-all: equivalent to -vET
- -b, --number-nonblank: number nonempty output lines, overrides -n
- -E, --show-ends : display $ at end of each line
- -n, --number: number all output lines
- -s, --squeeze-blank: suppress repeated empty output lines
- -T, --show-tabs: display TAB characters as ^I
- -v, --show-nonprinting: use ^ and M- notation, except for LFD and TAB
In its simplest usage, cat prints a file's content to the standard output:
You can print the content of multiple files:
and using the output redirection operator > you can concatenate the content of multiple files into a new file:
Using >> you can append the content of multiple files into a new file, creating it if it does not exist:
When watching source code files it's great to see the line numbers and you can have cat print them using the -n option:
You can only add a number to non-blank lines using -b, or you can also remove all the multiple empty lines using -s. cat is often used in combination with the pipe operator | to feed a file content as input to another command.
cut
'cut' is a text processing command in Linux that is used to extract specific columns or fields from input data (i.e., from a file or from the standard input). It splits the input into lines and then cuts out specified fields from each line.
Here are some key features of the cut command:
Options:
- -b, --bytes: selects the specified byte positions
- -c, --characters: selects the specified character positions
- -d, --delimiter=DELIM: specifies the delimiter that separates fields in the input data
- -f, --fields=LIST: specifies the fields to be extracted (a comma-separated list of field numbers)
Usage:
cut -f1,3 -d: /etc/passwd: Extracts the first and third fields (separated by a colon) from the /etc/passwd file.
Note: By default, the delimiter is a tab character (\t).
grep
'grep' is a powerful utility in Linux for searching for a specific pattern of text in a file or input. The basic syntax of the command is:
- 'pattern' is the string of text you want to search for.
- 'file_name(s)' is the name(s) of the file(s) you want to search. You can specify one or more file names, or use a wildcard pattern to search multiple files.
- -v: to invert the match, so that it returns lines that do not contain the pattern.
- -i: to ignore the case of the pattern when searching.
- -r: to search recursively through all subdirectories.
- -c: to count the number of matches in each file.
For example, here's how we can find the occurrences of the 'text' line in the 'newfile3.txt' file:
Using the -n option will show the line numbers:
One very useful thing is to tell grep to print 2 lines before, and 2 lines after the matched line, to give us
more context. That's done using the -C option, which accepts a number of lines:
Search is case-sensitive by default. Use the '-i' flag to make it insensitive. As mentioned, you can use grep to filter the output of another command. We can replicate the same functionality as above using:
The search string can be a regular expression, and this makes grep very powerful.
Another thing you might find very useful is to invert the result, excluding the lines that match a particular string, using the '-v' option:
You can use 'grep' in a pipeline to search the output of another command. This makes it possible to search for patterns in the output of any command, making it a very versatile tool for text processing.
sort
sort is a Linux command used to sort the contents of a file or input stream in ascending or descending order. The sorted output can be redirected to a file or displayed on the terminal. By default, sort sorts the input lines in lexicographic order. Options can be provided to sort the input based on different criteria such as numeric value, reverse order, field-based sorting, etc.
Syntax:
Suppose you have a text file that contains the names of dogs:
This list is unordered. Run the sort command to sort them by name:
Use the -r option to reverse the order:
Sorting by default is case-sensitive, and alphabetic. Use the --ignore-case option to sort case insensitive, and the -n option to sort using a numeric order. If the file contains duplicate lines:
You can use the -u option to remove them:
sort does not just works on files, as many UNIX commands it also works with pipes, so you can use it on the output of another command, for example, you can order the files returned by ls with:
sort is very powerful and has lots more options, which you can explore by calling man sort.
uniq
uniq is a Linux command used to remove duplicate lines from a sorted file or input stream. The output of uniq will contain only unique lines from the input. If multiple adjacent lines are identical, only one of them will be included in the output.
Syntax:
You can get those lines from a file, or using pipes from the output of another command:
You can list duplicate lines from the directory:
You need to consider this key thing: uniq will only detect adjacent duplicate lines. This implies that you will most likely use it along with sort :
The sort command has its own way to remove duplicates with the -u (unique) option. But uniq has
more power.
By default it removes duplicate lines:
You can tell it to only display duplicate lines, for example, with the -d option:
There are no duplicate lines. You can use the -u option to only display nonduplicate lines:
You can count the occurrences of each line with the -c option:
Use the special combination to then sort those lines by most frequent:
sed
Syntax:
awk
'awk' is a programming language and a Linux command used for text processing and manipulation. It scans input text files line by line, performing actions based on patterns and rules specified by the user. awk is often used to extract information from log files, CSV files, and other types of text data.
'awk' is capable of performing operations such as text substitution, deletion, insertion, and search and replace, and it also supports advanced features such as pattern matching and conditional statements. The output of awk can be redirected to a file or displayed on the terminal.
Syntax:
tr
'tr' is a Linux command used for text translation and manipulation. It translates or deletes specified characters in the input text and writes the result to standard output.
'tr' can be used to perform operations such as character substitution, deletion, and compression. For example, it can be used to convert lowercase characters to uppercase, remove duplicates, or squeeze repeated characters.
Syntax:
tac
'tac' is a Linux command used to concatenate and print files in reverse order. It reads the input files in reverse order, line by line, and writes the result to the standard output.
The tac command is useful for viewing the last lines of a file, for example, the latest entries in a log file.
Syntax:
Run the following command:
head
'head' is a Linux command that prints the first n lines (10 by default) of a file to the standard output. It can also be used to view the first part of a large file to see what it contains before reading the whole file.
Syntax:
Options:
- -n: Specify the number of lines to display.
- -c: Specify the number of bytes to display.
- -q: Quiet, never print headers giving file names.
Example:
To print the first 5 lines of the file 'dogs' use the following command:
tail
"tail" is a Linux command used to display the last N number of lines in a file. By default, it shows the last 10 lines. It can be used to monitor logs, check for updates, etc.
Syntax:
Options:
- -f, --follow: Output appended data as the file grows
- -n, --lines=[+|-]NUM: Number of lines to output. A '+' symbol shows lines starting from NUM
- -q, --quiet, --silent: Don't print headers when multiple files are outputting
- -v, --verbose: Always print headers when multiple files are outputting
Example:
The best use case of the tail in my opinion is when called with the -f option. It opens the file at the end and watches for file changes. Any time there is new content in the file, it is printed in the window. This is great for watching log files, for example:
To exit, press Ctrl + C.
You can print the last 5 lines in a file:
You can print the whole file content starting from a specific line using + before the line number:
tail -n +10 <filename>
'tail' can do much more and as always my advice is to check man tail.
These commands are used in combination with pipes (|) and redirection (> and >>) to create powerful text processing pipelines that can manipulate, filter, and format text data in various ways.