linux 文本处理

Posted 2021-01-02 lizitest

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了linux 文本处理相关的知识，希望对你有一定的参考价值。

    ? cat 查看文件
       ○ 有不可见内容：cat -A file
       ○ 行号：cat -n file 空行加编号
       ○ 行号：cat -b file 空行不加编号
       ○ 压缩相邻的空行：cat -ns file
       ○ 倒着显示：tac file

   ? more/less
       ○ 翻页：空格、PgDn/Up
   ? head/tail：
       ○ 默认前10行
       ○ head -n 3 file 前3行
       ○ tail -n 3 file 后3行
       ○ 跟踪日志：tail -f file(f-follow)
   ? cut
       ○ 竖着剪切部分列：
           § cut -d: -f1,3 /etc/passwd(分隔符为冒号，取第一个和第三个字段)
           § cut -d: -f1,3-5 /etc/passwd
           § df -h | cut -c34-36
           § 获取ip地址：
               □ CentOS7：ifconfig | head -n 2 | tail -n -1 | cut -dt -f2 | cut -d" " -f1
               □ CentOS6：ifconfig | head -n 2 | tail -1 | cut -d: -f2 | cut -d" " -f1
               □ ifconfig | head -n 2 | tail -n -1 | tr -s " " | cut -d" " -f2

   ? suid 4：继承程序所有者的权限
   ? sgid 2：继承程序所有组的权限；继承父目录的所属组
   ? sticky 1：只能删除自己的文件

   ? 分区利用率：
       ○ df | tr -s ‘ ‘ ‘%‘| cut -d "%" -f5 压缩空格，用百分号替换空格，再分割

   ? paste 合并两个文件，同行号的列到一行
       ○ paste file1 file2 横向合并
       ○ cat file1 file2 纵向合并
       ○ paste -d ":" file1 file2 加分隔符
       ○ paste -s file1 file2 所有行合并成一行

   ? wc-word count：统计行、单词、字节、文件名
       ○ wc -l file 行
       ○ wc -w file 单词
       ○ wc -c file 字节数
       ○ wc -m file 字符数
       ○ wc -L file 最长行
       ○ 有多少文件：ls | wc -l
       ○ 有多少用户：cat /etc/passwd | wc -l


   ? httpd：web服务，提供web页面
       ○ 启动：service httpd start
       ○ 地址：ip a
       ○ 端口号：ss -ntl
       ○ 防火墙：iptables -vnL
       ○ 关闭防火墙：service iptables stop
       ○ 禁止防火墙开机启动：chkconfig iptalbes off

   ? 排序
       ○ sort file：按字符排序
       ○ sort -t":" -k3 -n -r /etc/passwd 冒号分割，第三列，数字排序，倒序
       ○ cut -d: -f3 /etc/passwd | sort -nr > uid.txt
       ○ 删除重复的：sort -u file

   ? 生成文件
       ○ echo {1..55}
       ○ echo {1..55} | tr ‘ ‘ ‘ ‘ 每行一个
       ○ echo {1..55} | tr ‘ ‘ ‘ ‘ | sort -R | head -n 1 随机抽奖
       ○ seq 55 | tr ‘ ‘ ‘ ‘ | sort -R | head -n 1


   ? 模拟大量用户访问
   ? 模拟：ab -c 100 -n 2000 http://192.168.10.6/index.html
   ? 访问日志：cat /var/log/httpd/access_log 第一行是客户ip
   ? 访问人数（PV）：wc -l /var/log/httpd/access_log
   ? 哪个地址访问次数最多（是否是攻击）：
       ○ cut -d ‘ ‘ -f 1 access_log | uniq -c 显示出现频次
       ○ 前三名：
           § cut -d ‘ ‘ -f 1 access_log | uniq -c | sort -nr | tr -s ‘‘| cut -d ‘ ‘ -f 3 | head -n 3


   ? last 用户登录的情况
       ○ 每个账号登录的次数：last | sort | cut -d ‘ ‘ -f 1 | uniq -c


   ? uniq
       ○ 显示重复的：sort file | uniq -d
       ○ 显示出现频次：sort file | uniq -c
       ○ 找出两个文件相同行：cat file1 file2 | sort | uniq -d
       ○ 找出两个文件不同行：cat file1 file2 | sort | uniq -u

   ? 随机数：
       ○ echo &[RANDOM%55+1]

文本处理三剑客：
   ? grep：文本过滤
   ? sed：文本过滤、修改文件
   ? awk：支持循环、分支，报表打印

   ? grep root /etc/passwd 包含root的行
   ? grep `whoami` /etc/passwd 当前用户名信息
   ? grep $USER /etc/passwd 当前用户名信息
   ? grep -v root /etc/passwd 不包含root的信息
   ? grep -n root /etc/passwd 显示行号
   ? grep -c root /etc/passwd 一共有几行
   ? grep -q root /etc/passwd 静默不输出，配合$?(保存静默结果)使用
   ? /dev/null 垃圾桶，不能删除
   ? grep -An3 root /etc/passwd after
   ? grep -Bn3 root /etc/passwd before
   ? grep -Cn3 root /etc/passwd context前后都要
   ? grep root /etc/passwd | grep wang and关系
   ? grep -e root -e wang /etc/passwd or关系
   ? grep -w root /etc/passwd 包含单词root
   ? nmap 扫描工具
       ○ 扫描所有开机设备：nmap -v -sP 172.16.1010.0/24
       ○ 过滤up的机器：nmap -v -sP 172.16.1010.0/24 | grep -Bl up | grep scan

   ? 正则表达式：
       ○ 基本正则表达式：BRE
       ○ 扩展正则表达式：ERE
       ○ grep 默认支持基本正则表达式，-E支持扩展正则表达式
       ○ 正则表达式引擎：PCRE (Perl Compatible Regular Expressions)
       ○ 匹配：字符匹配、匹配次数、位置锚点、分组
       ○ 字符匹配:
       . 匹配任意单个字符
           § grep r..t /etc/passwd
       [] 匹配指定范围内的任意单个字符
           - grep [xyz] /etc/passwd - 包含x或y或z
           - grep [x-z] /etc/passwd - 包含x到z任意字符（区分大小写）
       [^] 匹配指定范围外的任意单个字符 -

       [:alnum:] 字母和数字
       [:alpha:] 代表任何英文大小写字符，亦即 A-Z, a-z
       [:lower:] 小写字母 [:upper:] 大写字母
       [:blank:] 空白字符（空格和制表符）
       [:space:] 水平和垂直的空白字符（比[:blank:]包含的范围广）
       [:cntrl:] 不可打印的控制字符（退格、删除、警铃...）
       [:digit:] 十进制数字 [:xdigit:]十六进制数字
       [:graph:] 可打印的非空白字符
       [:print:] 可打印字符
       [:punct:] 标点符号

       * 匹配前面的字符任意次，包括0次
           - 星号前面的字符出现任意次数
           - grep go*gle testfile 字母o出现几次
       贪婪模式：尽可能长的匹配
       .* 任意长度的任意字符
           -
       ? 匹配其前面的字符0或1次
           - grep go?gle testfile 字母o出现一次或0次
       + 匹配其前面的字符至少1次

       {n} 匹配前面的字符n次
       {m,n} 匹配前面的字符至少m次，至多n次
           - grep go{2,7}gle testfile 字母o出现2-7次
       {,n} 匹配前面的字符至多n次
       {n,} 匹配前面的字符至少n次

       ○ 获取ip地址：ifconfig | grep --color -wo "[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}"
       ○ 获取ip地址：ifconfig | grep --color -wo "([0-9]{1,3}.){3}[0-9]{1,3}"
       ○ 获取 2018-01-12 03:48:46：（tr 替换命令）
           - ffmpeg -i "D:Videos20180112_114837.mp4" 2>&1 | grep -wo "[0-9]{4}[-][0-9]{2}[-][0-9]{2}[T][0-9]{2}[:][0-9]{2}[:][0-9]{2}" |tr "T" " " | uniq
       ○ 获取 00:00:08：
           - ffmpeg -i "D:Videos20180112_114837.mp4" 2>&1 | grep Duration | grep -wo "[0-9]{2}[:][0-9]{2}[:][0-9]{2}"
       ○ 获取2018-01-12
           - ffmpeg -i "D:Videos20180112_114837.mp4" 2>&1 | grep -wo "[0-9]{4}[-][0-9]{2}[-][0-9]{2}" | uniq

   ? 以root开头的行：grep "^root" /etc/passwd
   ? 以nologin结尾的行：grep "nologin$" /etc/passwd
   ? 空行：grep "^$" file
   ? 不显示空行：grep -v "^$" file
   ? 不显示注释行：grep -v "^#" /etc/fstab | grep -v "^$"
   ? 显示包含指定单词的行：grep "<root>" /ect/passwd
   ? 一个或多个空格开头：grep "^[[:space:]]+linux16" /boot/grub2/grub.cfg | grep -v rescue

   ? 分组：root这个单词出现两次以上：echo rootrootroot | grep "(root){2,}"
   ? 分组后的调用：root用户开头并且后续再出现root
               □ grep "^(root).*1.*" /etc/passwd
               □ grep "(r..t).*1" file (搜索替代，非常有用)
   ? 获取分区利用率：（大于80%利用率就开始报警）
       ○ df -h | grep "^[/dev/sda]" | grep -o "[0,9]{1,3}%$" | grep -o "[0,9]{1,3}"
       ○ df -h | grep "^/dev/sda" | grep -o "[[:digit:]]+%" | grep -o "[[:digit:]]+"

   ? a或者b开头：echo aXX bXX | grep -o "(a|b)XX"

   ? 以上是基本正则表达式，比较麻烦，总加；扩展正则表达式去掉

   ? 扩展正则：

       ○ 获取ip地址：ifconfig | grep -Ewo "([0-9][1,3].){3}[0-9]{1,3}"
       ○ 获取系统镜像中的文件对系统的要求：
           - ls *.rpm | grep -Eo ".(.*).rpm" | cut -d "." -f 2 | uqic -c
           - ls *.rpm | grep -Eo ".<[[:alnum:]_]+>.rpm$" | cut -d "." -f 2 | sort | uqic -c
           - ls *.rpm | grep -Eo ".[^.]+.rpm$" | cut -d "." -f 2 | sort | uqic -c

以上是关于linux 文本处理的主要内容，如果未能解决你的问题，请参考以下文章