Python进阶02文本处理与IO深入理解
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python进阶02文本处理与IO深入理解相关的知识,希望对你有一定的参考价值。
1、有一个文件,单词之间使用空格、分号、逗号、或者句号分隔,请提取全部单词。
解决方案:
使用\w匹配并提取单词,但是存在误判
使用str.split分隔字符字符串,但是需要多次分隔
使用re.split分隔字符串
In [4]: help(re.split) Help on function split in module re: split(pattern, string, maxsplit=0, flags=0) Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.
In [23]: text = "i‘m xj, i love Python,,Linux; i don‘t like windows." In [24]: fs = re.split(r"(,|\.|;|\s)+\s*", text) In [25]: fs Out[25]: ["i‘m", ‘ ‘, ‘xj‘, ‘ ‘, ‘i‘, ‘ ‘, ‘love‘, ‘ ‘, ‘Python‘, ‘,‘, ‘Linux‘, ‘ ‘, ‘i‘, ‘ ‘, "don‘t", ‘ ‘, ‘like‘, ‘ ‘, ‘windows‘, ‘.‘, ‘‘] In [26]: fs[::2] #提取出单词 Out[26]: ["i‘m", ‘xj‘, ‘i‘, ‘love‘, ‘Python‘, ‘Linux‘, ‘i‘, "don‘t", ‘like‘, ‘windows‘, ‘‘] In [27]: fs[1::2] #提取出符号 Out[27]: [‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘,‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘.‘] In [53]: fs = re.findall(r"[^,\.;\s]+", text) In [54]: fs Out[54]: ["i‘m", ‘xj‘, ‘i‘, ‘love‘, ‘Python‘, ‘Linux‘, ‘i‘, "don‘t", ‘like‘, ‘windows‘] In [55]: fh = re.findall(r‘[,\.;\s]‘, text) In [56]: fh Out[56]: [‘ ‘, ‘,‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘,‘, ‘,‘, ‘;‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘.‘]
2、有一个目录,保存了若干文件,找出其中所有的C源文件(.c和.h)
解决方案:
使用listdir
使用str.endswith判断
In [13]: s = "xj.c" In [14]: s.endswith(".c") Out[14]: True In [15]: s.endswith(".h") Out[15]: False In [16]: import os In [17]: os.listdir("/usr/include/") Out[17]: [‘libmng.h‘, ‘netipx‘, ‘ft2build.h‘, ‘FlexLexer.h‘, ‘selinux‘, ‘QtSql‘, ‘resolv.h‘, ‘gio-unix-2.0‘, ‘wctype.h‘, ‘python2.6‘, ‘scsi‘, . . . ‘QtOpenGL‘, ‘mysql‘, ‘byteswap.h‘, , ‘xj.c‘ ‘mntent.h‘, ‘semaphore.h‘, ‘stdio_ext.h‘, ‘libxml2‘] In [21]: for filename in os.listdir("/usr/include"): if filename.endswith(".c"): print filename ....: xj.c In [22]: for filename in os.listdir("/usr/include"): if filename.endswith((".c", ".h")): #这里元祖是或的关系 print filename ....: libmng.h ft2build.h FlexLexer.h nss.h png.h utime.h ieee754.h features.h xj.c . . . verto-module.h semaphore.h stdio_ext.h In [23]:
3、fnmath模块
支持和shell一样的通配符
In [24]: help(fnmatch) #是否区分大小写与操作系统一致 Help on function fnmatch in module fnmatch: fnmatch(name, pat) Test whether FILENAME matches PATTERN. Patterns are Unix shell style: * matches everything ? matches any single character [seq] matches any character in seq [!seq] matches any char not in seq An initial period in FILENAME is not special. Both FILENAME and PATTERN are first case-normalized if the operating system requires it. If you don‘t want this, use fnmatchcase(FILENAME, PATTERN). ~ (END) In [47]: fnmatch.fnmatch("sba.txt", "*txt") Out[47]: True In [48]: fnmatch.fnmatch("sba.txt", "*t") Out[48]: True In [49]: fnmatch.fnmatch("sba.txt", "*b") Out[49]: False In [50]: fnmatch.fnmatch("sba.txt", "*b*") Out[50]: True
案例: 你有一个程序处理文件,文件名由用户输入,你需要支持和shell一样的通配符。
[[email protected] src]# cat test1.py #!/usr/local/bin/python2.7 #coding: utf-8 import os import sys from fnmatch import fnmatch ret = [name for name in os.listdir(sys.argv[1]) if fnmatch(name, sys.argv[2])] print ret [[email protected] src]# python2.7 test1.py /usr/include/ *.c [‘xj.c‘]
4、re.sub() 文本替换
In [53]: help(re.sub) Help on function sub in module re: sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl. repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it‘s passed the match object and must return a replacement string to be used.
案例:有一个文本,文本里的日期使用的是%m/%d/%Y的格式,你需要把它全部转化成%Y-%m-%d的格式。
In [55]: text = "Today is 11/08/2016, next class time 11/15/2016" In [56]: new_text = re.sub(r‘(\d+)/(\d+)/(\d+)‘, r‘\3-\2-\1‘, text ) In [57]: new_text Out[57]: ‘Today is 2016-08-11, next class time 2016-15-11‘
5、str.format 字符串格式化
案例:你需要创建一个小型的模版引擎,不需要逻辑控制,但是需要使用变量来填充模版
In [71]: help(str.format) Help on method_descriptor: format(...) S.format(*args, **kwargs) -> string Return a formatted version of S, using substitutions from args and kwargs. The substitutions are identified by braces (‘{‘ and ‘}‘). (END)
本文出自 “xiexiaojun” 博客,请务必保留此出处http://xiexiaojun.blog.51cto.com/2305291/1870832
以上是关于Python进阶02文本处理与IO深入理解的主要内容,如果未能解决你的问题,请参考以下文章