将固定宽度的文件从文本转换为 csv

Posted 2023-02-24

技术标签:

【中文标题】将固定宽度的文件从文本转换为 csv【英文标题】：convert a fixed width file from text to csv 【发布时间】：2015-04-18 03:58:36 【问题描述】：

我有一个文本格式的大型数据文件，我想通过指定每列长度将其转换为 csv。

列数 = 5

列长

[4 2 5 1 1]

样本观察：

aasdfh9013512
ajshdj 2445df

预期输出

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

【问题讨论】：

【参考方案1】：

GNU awk (gawk) 直接通过 FIELDWIDTHS 支持这一点，例如：

gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

【讨论】：

不错！我不知道这个功能。大+1！相关链接：Reading Fixed-width Data “FIELDWIDTHS”参数只有在我安装和使用gawk时才对我有效；在 Ubuntu 14.04.3 上。 @Arthur：根据GNU awk's feature history，FIELDWIDTHS 从 gawk 2.13 开始可用，即 2010 年 7 月。 @Thor 是的，我确定这是对的。但是，如果未安装 gawk 则无关紧要。至少对我来说，在 Ubuntu 14.04.3 上安装了awk，但没有安装gawk。 @Arthur：是的，这是一个 GNU awk (gawk) 特定的答案，我会更清楚地说明这一点。出于某种原因，许多 Debian 派生系统将 mawk 作为默认的 awk 替代品，可能是因为它更快。【参考方案2】：

我会使用sed 并捕获具有给定长度的组：

$ sed -r 's/^(.4)(.2)(.5)(.1)(.1)$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

【讨论】：

首先，感谢您回答这个问题。但在实际文件中，我必须将其分成 80 列，而 sed 命令仅适用于 9 列。请帮忙。 @AshishKumar 那么你可能必须使用 Thor 的答案和awk。【参考方案3】：

这是一个适用于常规awk 的解决方案（不需要gawk）。

awk -v OFS=',' 'print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)'

它使用 awk 的substr 函数来定义每个字段的开始位置和长度。 OFS 定义输出字段分隔符是什么（在本例中为逗号）。

（旁注：这仅在源数据没有任何逗号的情况下才有效。如果数据有逗号，那么您必须将它们转义为正确的 CSV，这超出了本问题的范围。）

演示：

echo 'aasdfh9013512
ajshdj 2445df' | 
awk -v OFS=',' 'print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)'

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

【讨论】：

【参考方案4】：

在awk 中添加一个通用的处理方式（替代 FIELDSWIDTH 选项）（我们不需要对子字符串位置进行硬编码，这将根据用户提供的位置 nuber 工作，只要需要插入逗号）可以是如下，在 GNU awk 中编写和测试。要使用它，我们必须定义值（如示例中显示的 OP）、需要插入逗号的位置编号、awk 变量名称为 colLength 给出位置编号，它们之间有空格。

awk -v colLengh="4 2 5 1 1" '
BEGIN
  num=split(colLengh,arr,OFS)


  j=sum=0
  while(++j<=num)
    if(length($0)>sum)
      sub("^."arr[j]+sum"","&,")
    
    sum+=arr[j]+1
  

1
' Input_file

解释： 简单的解释是，创建名为colLengh 的awk 变量，我们需要在需要插入逗号的位置定义位置编号。然后在BEGIN 部分创建数组arr，其中包含我们需要在其中插入逗号的索引值。

在主程序部分中，首先将变量 j 和 sum 置为无效。然后从 j=1 运行while 循环，直到 j 的值等于 num。在每次运行中，从当前行的开头替换（如果当前行的长度大于总和，否则执行替换是没有意义的，我已经在此处进行了附加检查）所有内容都包含所有内容 + , 根据需要。例如：sub 函数将在第一次循环运行时变为 .4 然后变为 .7 因为它的第 7 个位置我们需要插入逗号等等。所以sub 将用匹配值+, 替换从开始到生成数字的许多字符。最后在这个程序中提到1 将打印已编辑/未编辑的行。

【讨论】：

【参考方案5】：

如果有人还在寻找解决方案，我已经在 python 中开发了一个小脚本。只要你有python 3.5，它就很容易使用

https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py

  """
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
    Should have format as
    FieldName,fieldLength
    eg:
    FirstName,10
    SecondName,8
    Address,30
    etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys


def slices(s, args):
    position = 0
    for length in args:
        length = int(length)
        yield s[position:position + length]
        position += length

def extant_file(x):
    """
    'Type' for argparse - checks that file exists but does not open.
    """
    if not os.path.exists(x):
        # Argparse uses the ArgumentTypeError to give a rejection message like:
        # error: argument input: x does not exist
        raise argparse.ArgumentTypeError("0 does not exist".format(x))
    return x





parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True,    help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False,    help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False,   help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False,   help="Provide the delimiter string you want",metavar="STRING", default="|")

args = parser.parse_args()

#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter

#Output file checks
if args.OutputFile is None:
    OutputFile = str(InputFile) + "Delimited.txt"
    print ("Setting Ouput file as "+ OutputFile)
else:
    OutputFile = args.OutputFile

#Config file check
if args.ConfigFile is None:
    if not os.path.exists("Config.txt"):
        print ("There is no Config File provided exiting the script")
        sys.exit()
    else:
        ConfigFile = "Config.txt"
        print ("Taking Config.txt file on this path as Default Config File")
else:
    ConfigFile = args.ConfigFile

fieldNames = []
fieldLength = []
myvars = OrderedDict()


with open(ConfigFile) as myfile:
    for line in myfile:
        name, var = line.partition(",")[::2]
        myvars[name.strip()] = int(var)
for key,value in myvars.items():
    fieldNames.append(key)
    fieldLength.append(value)

with open(OutputFile, 'w') as f1:
    fieldNames = DELIMITER.join(map(str, fieldNames))
    f1.write(fieldNames + "\n")
    with open(InputFile, 'r') as f:
        for line in f:
            rec = (list(slices(line, fieldLength)))
            myLine = DELIMITER.join(map(str, rec))
            f1.write(myLine + "\n")

【讨论】：

【参考方案6】：

便携`awk`

使用适当的 substr 命令生成 awk 脚本

cat cols

<cols awk ' print "substr($0,"p","$1")"; cs+=$1; p=cs+1 ' p=1

输出：

substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)

合并行并使其成为有效的 awk 脚本：

<cols awk ' print "substr($0,"p","$1")"; cs+=$1; p=cs+1 ' p=1 |
paste -sd, | sed 's/^/ print /; s/$/ /'

输出：

 print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1)

将以上内容重定向到一个文件，例如/tmp/t.awk 并在输入文件上运行它：

<infile awk -f /tmp/t.awk

输出：

aasd fh 90135 1 2
ajsh dj  2445 d f

或者用逗号作为输出分隔符：

<infile awk -f /tmp/t.awk OFS=,

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

【讨论】：

以上是关于将固定宽度的文件从文本转换为 csv的主要内容，如果未能解决你的问题，请参考以下文章

将固定宽度的文件从文本转换为 csv

便携awk

便携`awk`