使用 Shell / Python 解析格式错误的 JSON

Posted 2023-02-23

技术标签:

【中文标题】使用 Shell / Python 解析格式错误的 JSON【英文标题】：Parsing malformed JSON using Shell / Python 【发布时间】：2021-10-22 10:29:22 【问题描述】：

我正在尝试解析和展平类似 JSON 的文件，如下所示：

EventTime_t                : 2021-07-23T23:03:41.711Z
FileName_s                   : \\nb009\dfsroot\admin\usershares\klein\documents\importing ee data v5.0.pdf
FileAttributes_s : [
                                 
                                   "Access": 70,
                                   "Count": 3,
                                   "FileType": "99c07caa-8fc4-4f94-b313-cb434493f900",
                                   "UniqueCount": null,
                                   "Attachment": null,
                                   "Name": "Employee Details - U.S."
                                 ,
                                 
                                   "Access": 93,
                                   "Count": 11,
                                   "FileType": "a44669fe-0d48-453d-a9b1-2cc83f2cba77",
                                   "UniqueCount": null,
                                   "Attachment": null,
                                   "Name": "Portable stack (BS)"
                                 
                               ]
FileUpdatedBy_s             : 
FileUpdatedDate_t           : 2009-05-27T20:01:22Z

EventTime_t                : 2021-07-23T23:04:03.862Z
FileName_s                   : \\xdev1900.org\dfsroot\admin\usershares\klein\axn980\test management\bare cards to link.xlsx
FileAttributes_s : [
                                 
                                   "Access": 85,
                                   "Count": 20,
                                   "FileType": "50842eb7-edc8-4019-85dd-5a5c1f2bb085",
                                   "UniqueCount": null,
                                   "Attachment": null,
                                   "Name": Plan Growth Number"
                                 
                               ]
FileUpdatedBy_s             : Mike
FileUpdatedDate_t           : 1980-01-02T00:00:00Z

我写了一个 bash 脚本，但我不喜欢我写它的方式。

#!/bin/bash

echo -n | tee col_1.txt col_2.txt col_3.txt col_4.txt col_5.txt col_6.txt col_7.txt col_8.txt col_9.txt col_10.txt 

cat full.json | grep '\"Access\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"Access\"/ && seen[$0]++)' > col_1.txt
cat full.json | grep '\"Count\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"Count\"/ && seen[$0]++)' > col_2.txt
cat full.json | grep '\"FileType\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"FileType\"/ && seen[$0]++)' > col_3.txt
cat full.json | grep '\"UniqueCount\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"UniqueCount\"/ && seen[$0]++)' > col_4.txt
cat full.json | grep '\"Attachment\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"Attachment\"/ && seen[$0]++)' > col_5.txt
cat full.json | grep '\"Name\"' | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/\"Name\"/ && seen[$0]++)' > col_6.txt
 
paste -d ',' col_1.txt col_2.txt col_3.txt col_4.txt col_5.txt col_6.txt | pr -t -e20 > output1.txt
sed -i 's/"  *"/" "/g' output1.txt
sed -i 's/  */ /g' output1.txt

cat full.json | grep "EventTime_t" | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/EventTime_t/ && seen[$0]++)' > col_1.txt
cat full.json | grep "FileAttributes_s" | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/FileAttributes_s/ && seen[$0]++)' > col_2.txt
cat full.json | grep "FileUpdatedBy_s" | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/FileUpdatedBy_s/ && seen[$0]++)' > col_3.txt
cat full.json | grep "FileUpdatedDate_t" | sed -e 's/  */ /g' -e 's/:/\n/g' | awk '!(/FileUpdatedDate_t/ && seen[$0]++)' > col_4.txt

paste -d ',' col_6.txt col_7.txt col_8.txt col_9.txt | pr -t -e20 > output2.txt
sed -i 's/"  *"/" "/g' output2.txt
sed -i 's/  */ /g' output2.txt

ln=( $(grep -n "EventTime_t" full.json | cut -d ':' -f 1) )
last_line=`wc -l full.json | cut -d ' ' -f 1`
ln+=($last_line)

cat /dev/null > fnl_output.csv
echo "1|" > col_10.txt
j=1;i=0;
while [ $i -lt $#ln[*] ];
do
  if [ -z $ln[$j] ]; then
      paste col_10.txt output2.txt | pr -t -e20 > output3.txt
      sed -e 's/" *"/ /g' -e 's/  */ /g' -e 's/null//g' output3.txt | awk '!seen[$0]++' > output2.txt
      echo  -n | tee output3.txt
      while read line
      do
         count=`echo $line | cut -d'|' -f 1`
         txt=`echo $line | cut -d'|' -f 2`
         i=0
         while  [ $i -lt $count ]; do
            echo $txt >> output3.txt
            i=$(( $i + 1));
         done;
     done < output2.txt;
     paste -d ',' output3.txt output1.txt > fnl_output.csv

     rm -f ./*.txt
     exit 0;
   else
      if [ $ln[$i] -lt $ln[$j] ]; then
         start=$ln[$i];
         end=`expr $ln[$j] - 1`;
         sed -n "$start,$endp" full.json > $start.txt
         ln1=( $(grep -n '\"Access\"' $start.txt | cut -d ':' -f 1))
         t=`echo $#ln1[*]`
         echo $t"|" >> col_10.txt
         j=$(( $j + 1));
     fi
     i=$(( $i + 1));
  fi
done;

我知道这是解析 JSON 的一种蹩脚方式，原因有很多——通过格式化每个键来使用键值对，可能无法处理更大的 JSON 文件，创建大量临时文件等。

而且这个 JSON 可能有更多额外的属性需要解析——因此，每次我找到一个新属性时——我都必须返回代码并更新。

这个 shell 脚本的输出应该是一组 csv 格式的行和列。

任何人都可以帮助我在 python 中实现相同的目标吗？（我在 python 中使用包 - 'json' 和 'pandas' 进行了同样的尝试，但他们不会将此数据识别为正确的 JSON） 注意：目前，输入文件的大小接近 50 到 100 MB。这个大小将来可能会增长。

谢谢拉克什米纳拉苏·陈杜里

【问题讨论】：

这不是“格式错误的 JSON 文件”，它只是一种非 JSON 格式。也就是说，“我如何为与这个示例大致匹配的数据编写解析器？”太宽泛了，无法在这里讨论。 “我如何编写解析器？”通常是一门 300 级的计算机科学课程（通常是编译器设计入门课程的前半部分）。描述如何做好它是一本书的主题。描述如何做到糟糕......好吧，为什么有人会教这个？另一方面......您确实嵌入了实际的 JSON 数据。如果您可以从键中挑选出值，那么您可以对实际上是 JSON 的值的子集使用 real JSON 解析器。部分 json 数据也可能被破坏。刚刚添加了修复该字符串的尝试。 【参考方案1】：

您需要一些自定义的解析代码。例如，我们可以先将数据中的每个事件分开，然后对每个字段进行处理。


STRING = "string"
LIST = "list"

FIELD_TYPE_MAPPING = 
    "EventTime_t" : STRING,
    "FileName_s" : STRING,
    "FileAttributes_s" : LIST,
    "FileUpdatedBy_s": STRING,
    "FileUpdatedDate_t": STRING



def process(data, fieldTypeMapping):
    output = 
    mode = STRING
    linesOfList = []
    savedFieldName = None


    for line in data:
        # If we have : in a line, we may be starting a new field
        if ':' in line:
            components = line.split(':')
            fieldName = components[0].strip()

            if fieldName in fieldTypeMapping:
                if mode == LIST:
                    # Process the collected lines in the list field. (Here we just merge them into one line.)
                    # We should have something like output[savedFiledName] = handleDataForListField(linesOfList)
                    output[savedFieldName] = "".join(linesOfList)

                fieldType = fieldTypeMapping[fieldName]

                if fieldType == STRING:
                    # We could have a customized handling function based on the field name here
                    output[fieldName] = line[len(components[0]) + 1 :]
                    mode = STRING
                elif fieldType == LIST:
                    mode = LIST
                    linesOfList = []
                    savedFieldName = fieldName


        if mode == LIST:
            linesOfList.append(line) # we could process the line here

    return output


# Main Part
isInProgress = False
startPattern = 'EventTime_t'
endPattern = 'FileUpdatedDate_t'

# Process line by line
data = []
events = []

# Assume we process the data line by line
for line in input.split('\n'):
    if line.startswith(startPattern):
        isInProgress = True

    if isInProgress:
        data.append(line)

    if line.startswith(endPattern):
        isInProgress = False
        events.append(process(data, FIELD_TYPE_MAPPING))

# Testing Part
for item in events:
    print(item)

【讨论】：

【参考方案2】：

事实上，“json”可以使用 sed 来“修复”

fixedInner=$(
    sed -re '
    s/\\/\\\\/g # escape
    s/^([^ :["]+) +: +"?(.2,|$)"?/"\1" : "\2",/ # double quote keys and values
    s/^([^ :["]+)( +): \[$/"\1"\2: [/ # double quote keys followed by [
    s/"EventTime_t.*/\0/; s/("FileUpdatedDate_t.*),/\1,/ # add braces to object start and end
    # add , for alone ]
    s/ +[]] *$/],/ ' test.txt)
echo "[ $fixedInner%? ]" | jq -r '.'

结果：

[
  
    "EventTime_t": "2021-07-23T23:03:41.711Z",
    "FileName_s": "\\\\nb009\\dfsroot\\admin\\usershares\\klein\\documents\\importing ee data v5.0.pdf",
    "FileAttributes_s": [
      
        "Access": 70,
        "Count": 3,
        "FileType": "99c07caa-8fc4-4f94-b313-cb434493f900",
        "UniqueCount": null,
        "Attachment": null,
        "Name": "Employee Details - U.S."
      ,
      
        "Access": 93,
        "Count": 11,
        "FileType": "a44669fe-0d48-453d-a9b1-2cc83f2cba77",
        "UniqueCount": null,
        "Attachment": null,
        "Name": "Portable stack (BS)"
      
    ],
    "FileUpdatedBy_s": "",
    "FileUpdatedDate_t": "2009-05-27T20:01:22Z"
  ,
  
    "EventTime_t": "2021-07-23T23:04:03.862Z",
    "FileName_s": "\\\\xdev1900.org\\dfsroot\\admin\\usershares\\klein\\axn980\\test management\\bare cards to link.xlsx",
    "FileAttributes_s": [
      
        "Access": 85,
        "Count": 20,
        "FileType": "50842eb7-edc8-4019-85dd-5a5c1f2bb085",
        "UniqueCount": null,
        "Attachment": null,
        "Name": "Plan Growth Number"
      
    ],
    "FileUpdatedBy_s": "Mike",
    "FileUpdatedDate_t": "1980-01-02T00:00:00Z"
  ,
  
]

我留给你解决这个问题 :-) "Name": Plan Growth Number"

【讨论】：

以上是关于使用 Shell / Python 解析格式错误的 JSON的主要内容，如果未能解决你的问题，请参考以下文章

如何在python中解析格式错误的HTML

xml 解析错误：python 中格式不正确<invalid token>

lxml 和 libxml2 哪个更适合在 Python 中解析格式错误的 html？

“CSV格式转Json格式”Shell脚本解析

python如何实现像shell中的case功能

python解析json格式出问题？