使用批处理脚本,如何使用正则表达式拆分 .csv 文件中的数据?
Posted
技术标签:
【中文标题】使用批处理脚本,如何使用正则表达式拆分 .csv 文件中的数据?【英文标题】:Using a Batch script, how do I use regex to split up data in a .csv file? 【发布时间】:2019-10-09 22:08:51 【问题描述】:我有一个 .csv 文件(通过导出 googleDoc 电子表格生成),我需要从中提取信息。该信息不包含一致的分隔符。
我目前使用逗号 (,) 作为分隔符,在从前 4 列获取信息时效果很好。
但是,当我想从第 8 列中提取信息时,我得到的数据不正确。这是因为某些单元格包含用逗号分隔的 2 条信息。
包含 2 条信息的单元格在开始和结束时用双引号 (")。提供像 1,"2,3",4
这样的数据
我的拆分器无法识别 1,2,3,4 和 1,"2,3",4 之间的区别,因此第三个值返回第一组的 3
和第二组的 3"
,当它应该为第二组返回4
(预计第一组为 3)
以下是我正在使用的 .csv 文件的摘录。
A,SCONE,Shen ring,SHEN_RING,"FLOUR, BUTTER","BRONZE,GOLD",BLANK,Blank,,BLANK,
A,STRAWBERRIES_AND_CREAM,Cat1,CAT1,"STRAWBERRY, CREAM","OBSIDIAN,GOLD2",FS,FreeSpin,,FREE_SPIN,
A,WALNUT_TOFFEE,Pyramid,PYRAMID,"BUTTER, SUGAR, WALNUT","GOLD,EMERALD,PERIDOT",1,Champagne,Garnet,GARNET,
A,RASPBERRY_AND_LIME_JELLY,Cuff bracelet,CUFF_BRACELET,"RASPBERRY, JELLY, LIME","ZIRCON,BRONZE2,TOPAZ",2,Cocoa,Lapis lazuli,LAPIS_LAZULI,Blue
A,CHOCOLATE_CHIP_COOKIES,Nekhbet,NEKHBET,"SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT","EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER",3,GoldLeaf,gold3,GOLD3,yellow
A,BUTTER_CREAM_CUP_CAKE,Sobek,SOBEK,"ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM","JADE,BRONZE,GOLD,GARNET2",4,Sugar,emerald,EMERALD,green
A,PEANUT_BUTTER_COOKIE,Sekhmet,SEKHMET,"PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER","GARNET1,BRONZE,AMAZONITE,EMERALD",5,IcingSugar,JADE,JADE,green
A,CHOCOLATE_MARSHMALLOWS,Osiris,OSIRIS,"MARSHMALLOW, CHOCOLATE_CHIPS","PLATINUM,ALEXANDRITE",6,Flour,Bronze,BRONZE,yellow
,,,,,,7,Butter,Gold,GOLD,yellow
B,BLUEBERRY_PIE,Ankh,ANKH,"BLUEBERRY, SUGAR, FLOUR, BUTTER","JADEITE,EMERALD,BRONZE,GOLD",8,ChocolateChips,Alexandrite,ALEXANDRITE,
这是我用于提取信息的当前 for 循环,外部 forloop 检查空数据以确保它始终返回相同的列。内部 forloop 将数据值放入数组中。
SET originalCol=8
SET newCol=10
SET startRow=2
SET lastRow=45
SET rowsToSkip=1
SET /a i=0
SET /a totalValues=0
SET /a maxLines=%lastRow%-%startRow%
FOR /f "skip=%rowsToSkip% delims=" %%L in (%fileLocation%) DO (
set "line=%%L,,,,,,,,"
set "line=#!line:,=,#!"
FOR /f "tokens=1,%originalCol%,%newCol% delims=," %%F IN ("!line!") DO (
set "param1=%%F"
set "param2=%%G"
set "param3=%%H"
set "param1=!param1:~1!"
set "param2=!param2:~1!"
set "param3=!param3:~1!"
IF NOT #!param1!# == ## (
SET /a lineCounter=!i!+%startRow%
SET /a totalValues=!i!
SET originalValuesList[!i!]=!param2!
SET newValuesList[!i!]=!param3!
IF !i! == %maxLines% (
goto :copyingCSVDataComplete
) ELSE (
SET /a i+=1
)
)
)
)
echo. originalValuesList [A] & echo [%originalValuesList[0]%, %originalValuesList[1]%, %originalValuesList[2]%, %originalValuesList[3]%, %originalValuesList[4]%, %originalValuesList[5]%, %originalValuesList[6]%, %originalValuesList[7]%]
echo.
echo. originalValuesList [B] & echo [%originalValuesList[8]%]
echo.
echo. newValuesList [A] & echo [%newValuesList[0]%, %newValuesList[1]%, %newValuesList[2]%, %newValuesList[3]%, %newValuesList[4]%, %newValuesList[5]%, %newValuesList[6]%, %newValuesList[7]%]
echo.
echo. newValuesList [B] & echo [%newValuesList[8]%]
实际:
originalValuesList [A]
[GOLD", GOLD2", "GOLD, "ZIRCON, CHOCOLATE_CHIPS, BUTTERCREAM", BAKING_POWDER", ALEXANDRITE"]
originalValuesList [B]
[ BUTTER"]
newValuesList [A]
[Blank, FreeSpin, PERIDOT", TOPAZ", "EMERALD, BRONZE, BRONZE, Flour]
newValuesList [B]
[EMERALD]
预期:
originalValuesList [A]
[Blank, FreeSpin, Champagne, Cocoa, GoldLeaf, Sugar, IcingSugar, flour]
originalValuesList [B]
[ChocolateChips]
newValuesList [A]
[BLANK, FREE_SPIN, GARNET, LAPIS_LAZULI, GOLD3, EMERALD, JADE, BRONZE]
newValuesList [B]
[ALEXANDRITE]
所以,我想要的是使用相同的代码,但不是在逗号 (,) 分隔符上进行拆分,而是根据正则表达式进行拆分。像 (,"([A-Z]*),") | (,)
是否可以在批处理中使用正则表达式,如果可以,我如何使用它来拆分字符串?
【问题讨论】:
当您将变量设置为set "param1=%%~F"
和%%~G
等时会发生什么?
@GerhardBarnard 我不确定你的意思。将%%F
保存到param1
允许我操纵字符串以删除#
。如果你问如果我使用 %%~F
而不是 %%F
会发生什么,那么答案是没有变化。
我现在浏览了我的 googleDoc 并将所有逗号 (,) 替换为破折号 (-) 并将其重新导出到 .csv 文件,该文件修复了该问题,因为它不再在单元格之间中断。但是,如果有人对上述问题有答案,那真是太好了,将来可能会对其他人有所帮助:)
您的代码与正则表达式无关。那么你的问题是什么?您想知道如何修复您的代码吗?或者您是否正在寻找一种使用正则表达式的全新方法?请注意,正则表达式需要包含第三方实用程序、PowerShell 或 CSCRIPT(JScript 或 VBS)。
@dbenham 啊,我以为我已经包含了正则表达式,我一定忘记了,我会在几秒钟内编辑问题。我想使用相同的代码,但不是在逗号 (,) 分隔符上进行拆分,而是根据正则表达式进行拆分。像(,"([A-Z]*),") | (,)
这样的东西值得注意,我是正则表达式的新手,所以这可能是非常错误的,但我希望你明白我的意思哈哈
【参考方案1】:
首先,PowerShell 具有解析和操作 CSV 文档的内置功能,因此这是一个更好的选择。但我会坚持批处理。
正则表达式解决方案
正则表达式不适合纯原生批处理解决方案,原因有两个:
不可能改变 FOR /F 行为以通过正则表达式解析标记 - 它就是这样 - 非常有限。 要使用 FOR /F 解析文件,您需要在解析之前处理每一行。 Batch 没有任何可以更改内容的正则表达式实用程序。它只有 FINDSTR 可以进行非常粗略的正则表达式搜索,但它总是返回原始匹配行。最重要的是,FINDSTR 正则表达式是如此残缺,我不确定您是否可以正确解析 CSV。您可以通过 CSCRIPT 使用自定义 JScript 或 VBScript 通过正则表达式搜索预处理文件,并以 FOR /F 可以解析文件的方式进行替换。我已经写了一个很适合这个的hybrid JScript/batch regular expression processing utility called JREPL.BAT。
带引号的 CSV 字段可以包含引号文字,在这种情况下,引号会加倍。以下正则表达式将匹配任何 CSV 令牌(不包括逗号分隔符)("(?:""|[^"])*"|[^,"]*)
。它查找一个引号后跟任意数量的非引号字符和/或双引号,然后是右引号或任意数量的字符,不包括引号或逗号。但是您的 CSV 不包含任何双引号文字,因此正则表达式可以简化为 ("[^"]*"|[^,"]*)
。
CSCRIPT 没有在参数中传递引号文字的机制,因此 JREPL 有一个 /XSEQ 选项来启用扩展的转义序列支持,包括 \q
来表示 "
。另一种选择是使用标准的\x22
序列。 JREPL "(\q[^\q]*\q|[^,\q]*)," "$1;" /XSEQ /F "test.csv"
将匹配任何后跟逗号分隔符的标记(可能为空),并保留该标记并将逗号替换为分号。
但这仍然留下空标记,并且 FOR /F 不能正确解析空标记。所以我可以在替换术语中加入一些 JSCRIPT 来删除任何现有的引号,然后用引号将每个标记括起来(最后一个标记除外,它不需要)JREPL "(\q[^\q]*\q|[^,\q]*)," "$txt='\q'+$1.replace(/'\q'/,'')+'\q;'" /JQ /XSEQ /F "test.csv"
这是一个演示如何使用它来解析您的 CSV:
@echo off
for /f "tokens=1-11 delims=;" %%A in (
'JREPL "(\q[^\q]*\q|[^,\q]*)," "$txt='\x22'+$1.replace(/\x22/g,'')+'\x22;'" /JQ /XSEQ /F test.csv'
) do (
echo A=%%~A
echo B=%%~B
echo C=%%~C
echo D=%%~D
echo E=%%~E
echo F=%%~F
echo G=%%~G
echo H=%%~H
echo I=%%~I
echo J=%%~J
echo K=%%~K
echo(
)
--输出--
A=A
B=SCONE
C=Shen ring
D=SHEN_RING
E=FLOUR, BUTTER
F=BRONZE,GOLD
G=blank
H="This
I="BLANK""
J=
K=BLANK
A=A
B=STRAWBERRIES_AND_CREAM
C=Cat1
D=CAT1
E=STRAWBERRY, CREAM
F=OBSIDIAN,GOLD2
G=FS
H=FreeSpin
I=
J=FREE_SPIN
K=
A=A
B=WALNUT_TOFFEE
C=Pyramid
D=PYRAMID
E=BUTTER, SUGAR, WALNUT
F=GOLD,EMERALD,PERIDOT
G=1
H=Champagne
I=Garnet
J=GARNET
K=
A=A
B=RASPBERRY_AND_LIME_JELLY
C=Cuff bracelet
D=CUFF_BRACELET
E=RASPBERRY, JELLY, LIME
F=ZIRCON,BRONZE2,TOPAZ
G=2
H=Cocoa
I=Lapis lazuli
J=LAPIS_LAZULI
K=Blue
A=A
B=CHOCOLATE_CHIP_COOKIES
C=Nekhbet
D=NEKHBET
E=SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT
F=EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER
G=3
H=GoldLeaf
I=gold3
J=GOLD3
K=yellow
A=A
B=BUTTER_CREAM_CUP_CAKE
C=Sobek
D=SOBEK
E=ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM
F=JADE,BRONZE,GOLD,GARNET2
G=4
H=Sugar
I=emerald
J=EMERALD
K=green
A=A
B=PEANUT_BUTTER_COOKIE
C=Sekhmet
D=SEKHMET
E=PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER
F=GARNET1,BRONZE,AMAZONITE,EMERALD
G=5
H=IcingSugar
I=JADE
J=JADE
K=green
A=A
B=CHOCOLATE_MARSHMALLOWS
C=Osiris
D=OSIRIS
E=MARSHMALLOW, CHOCOLATE_CHIPS
F=PLATINUM,ALEXANDRITE
G=6
H=Flour
I=Bronze
J=BRONZE
K=yellow
A=
B=
C=
D=
E=
F=
G=7
H=Butter
I=Gold
J=GOLD
K=yellow
A=B
B=BLUEBERRY_PIE
C=Ankh
D=ANKH
E=BLUEBERRY, SUGAR, FLOUR, BUTTER
F=JADEITE,EMERALD,BRONZE,GOLD
G=8
H=ChocolateChips
I=Alexandrite
J=ALEXANDRITE
K=
但我不会为此使用正则表达式。还有其他方法。
纯原生批处理解决方案
信不信由你,只使用内部批处理命令来操作每一行以便 FOR /F 可以解析所有标记并不难。
您的 CSV 需要做两件事:
1) 不带引号的逗号分隔符必须转换为文件中未出现的其他字符,只留下带引号的逗号。我可以使用 technique that jeb developed 的派生词来区分带引号的字符和不带引号的字符:当使用百分比扩展扩展变量时,像 ^,
这样的转义字符会根据它们是否被引用而受到不同的处理。通常^,
变为,
,而"^,"
保持不变。但是如果你使用 CALL,那么"^,"
就变成了"^^,"
,而^,
保持不变。无论哪种方式,都可以区分带引号的字符和不带引号的字符。
2) FOR /F 无法解析空标记,因此必须用引号将空标记括起来。最简单的方法是简单地将所有标记括在引号中。
@echo off
setlocal enableDelayedExpansion
for /f "usebackq delims=" %%A in ("test.csv") do (
%= Print out the raw line so we can verify the end result =%
echo %%A
%= Preprocess the line so it is safe to parse =%
set "ln=%%A" %= Transfer line to environment variable =%
%= Artifact of CALL - Convert quoted , to ^^; and unquoted , to ^; =%
%= Make sure unquoted SET statement does not have any trailing characters =%
call set ln=%%ln:,=^^;%%
set "ln=!ln:^^;=,!" %= Convert quoted ^^; back into , =%
set "ln=!ln:^;=;!" %= Convert unquoted ^; to ; =%
set "ln=!ln:"=!" %= Strip all quotes so we can safely do next step =%
set "ln="!ln:;=";"!"" %= Enclose all tokens in quotes to protect empty tokens =%
%= The line is now ready to parse with another FOR /F =%
%= I simply print the value of all 11 tokens, 1 per line. =%
%= Adjust the loop as needed to suit your needs. =%
for /f "tokens=1-11 delims=;" %%A in ("!ln!") do (
for %%a in (A B C D E F G H I J K) do call :echoToken %%a
echo(
)
)
exit /b
:echoToken Char
for %%. in (.) do echo %1=%%~%1
exit /b
这是没有所有 cmets 的相同代码:
@echo off
setlocal enableDelayedExpansion
for /f "usebackq delims=" %%A in ("test.csv") do (
echo %%A
set "ln=%%A"
call set ln=%%ln:,=^^;%%
set "ln=!ln:^^;=,!"
set "ln=!ln:^;=;!"
set "ln=!ln:"=!"
set "ln="!ln:;=";"!""
for /f "tokens=1-11 delims=;" %%A in ("!ln!") do (
for %%a in (A B C D E F G H I J K) do call :echoToken %%a
echo(
)
)
exit /b
:echoToken Char
for %%. in (.) do echo %1=%%~%1
exit /b
-- 输出 ---
A,SCONE,Shen ring,SHEN_RING,"FLOUR, BUTTER","BRONZE,GOLD",blank,"This,""BLANK""",,BLANK,
A=A
B=SCONE
C=Shen ring
D=SHEN_RING
E=FLOUR, BUTTER
F=BRONZE,GOLD
G=blank
H=This,BLANK
I=
J=BLANK
K=
A,STRAWBERRIES_AND_CREAM,Cat1,CAT1,"STRAWBERRY, CREAM","OBSIDIAN,GOLD2",FS,FreeSpin,,FREE_SPIN,
A=A
B=STRAWBERRIES_AND_CREAM
C=Cat1
D=CAT1
E=STRAWBERRY, CREAM
F=OBSIDIAN,GOLD2
G=FS
H=FreeSpin
I=
J=FREE_SPIN
K=
A,WALNUT_TOFFEE,Pyramid,PYRAMID,"BUTTER, SUGAR, WALNUT","GOLD,EMERALD,PERIDOT",1,Champagne,Garnet,GARNET,
A=A
B=WALNUT_TOFFEE
C=Pyramid
D=PYRAMID
E=BUTTER, SUGAR, WALNUT
F=GOLD,EMERALD,PERIDOT
G=1
H=Champagne
I=Garnet
J=GARNET
K=
A,RASPBERRY_AND_LIME_JELLY,Cuff bracelet,CUFF_BRACELET,"RASPBERRY, JELLY, LIME","ZIRCON,BRONZE2,TOPAZ",2,Cocoa,Lapis lazuli,LAPIS_LAZULI,Blue
A=A
B=RASPBERRY_AND_LIME_JELLY
C=Cuff bracelet
D=CUFF_BRACELET
E=RASPBERRY, JELLY, LIME
F=ZIRCON,BRONZE2,TOPAZ
G=2
H=Cocoa
I=Lapis lazuli
J=LAPIS_LAZULI
K=Blue
A,CHOCOLATE_CHIP_COOKIES,Nekhbet,NEKHBET,"SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT","EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER",3,GoldLeaf,gold3,GOLD3,yellow
A=A
B=CHOCOLATE_CHIP_COOKIES
C=Nekhbet
D=NEKHBET
E=SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT
F=EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER
G=3
H=GoldLeaf
I=gold3
J=GOLD3
K=yellow
A,BUTTER_CREAM_CUP_CAKE,Sobek,SOBEK,"ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM","JADE,BRONZE,GOLD,GARNET2",4,Sugar,emerald,EMERALD,green
A=A
B=BUTTER_CREAM_CUP_CAKE
C=Sobek
D=SOBEK
E=ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM
F=JADE,BRONZE,GOLD,GARNET2
G=4
H=Sugar
I=emerald
J=EMERALD
K=green
A,PEANUT_BUTTER_COOKIE,Sekhmet,SEKHMET,"PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER","GARNET1,BRONZE,AMAZONITE,EMERALD",5,IcingSugar,JADE,JADE,green
A=A
B=PEANUT_BUTTER_COOKIE
C=Sekhmet
D=SEKHMET
E=PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER
F=GARNET1,BRONZE,AMAZONITE,EMERALD
G=5
H=IcingSugar
I=JADE
J=JADE
K=green
A,CHOCOLATE_MARSHMALLOWS,Osiris,OSIRIS,"MARSHMALLOW, CHOCOLATE_CHIPS","PLATINUM,ALEXANDRITE",6,Flour,Bronze,BRONZE,yellow
A=A
B=CHOCOLATE_MARSHMALLOWS
C=Osiris
D=OSIRIS
E=MARSHMALLOW, CHOCOLATE_CHIPS
F=PLATINUM,ALEXANDRITE
G=6
H=Flour
I=Bronze
J=BRONZE
K=yellow
,,,,,,7,Butter,Gold,GOLD,yellow
A=
B=
C=
D=
E=
F=
G=7
H=Butter
I=Gold
J=GOLD
K=yellow
B,BLUEBERRY_PIE,Ankh,ANKH,"BLUEBERRY, SUGAR, FLOUR, BUTTER","JADEITE,EMERALD,BRONZE,GOLD",8,ChocolateChips,Alexandrite,ALEXANDRITE,
A=B
B=BLUEBERRY_PIE
C=Ankh
D=ANKH
E=BLUEBERRY, SUGAR, FLOUR, BUTTER
F=JADEITE,EMERALD,BRONZE,GOLD
G=8
H=ChocolateChips
I=Alexandrite
J=ALEXANDRITE
K=
但有许多可能的情况会使解析 CSV 变得更加复杂。
如果在扩展 FOR 变量时启用了延迟扩展,则所有未转义的!
都将被损坏,如果存在 !
,则未转义的 ^
也会损坏。
该技术需要百分比扩展。但是,如果存在诸如&
、|
、>
、<
、^
等有毒字符,除非它们被引用或转义,否则这将失败。
您可能不知道数据,在这种情况下,您无法确定是否有任何字符可用作未出现在数据中的分隔符。因此,引用值中的分隔符文字必须编码为其他内容,然后在解析标记后恢复。
带引号的 CSV 字段可能包含双引号。解析后双引号应取消双引号。
CSV 还允许在引用字段中使用换行符。我不知道可以解决此问题的纯批处理解决方案。
单个 FOR /F 不能在一行中解析超过 32 个标记。有关超出此限制的技术,请参阅this DosTips thread,尤其是其中的以下帖子:
https://www.dostips.com/forum/viewtopic.php?f=3&t=7703&start=30#p51604
https://www.dostips.com/forum/viewtopic.php?f=3&t=7703&start=45#p51625
https://www.dostips.com/forum/viewtopic.php?f=3&t=7703&start=45#p51662
这是一个强大的纯批处理解决方案,只要字段中没有换行符,并且没有行长度接近 8191 字节批处理限制,并且您不需要解析超过 31 个标记,它就可以解析任何 CSV。为了解释所需的所有步骤,该代码被大量注释。
@echo off
setlocal enableDelayedExpansion
:: Must use arcane FOR /F option syntax to disable both EOL and DELIMS.
for /f usebackq^ delims^=^ eol^= %%A in ("test2.csv") do call :processLine
:: I CALL out of the loop to a :subroutine because a single CALL :subroutine
:: is much faster than many CALL SET statements. It also simplifies the
:: management of delayed expansion.
exit /b
:processLine
:: Must disable delayed expansion so percent expansion does not corrupt ! or ^ literals.
setlocal disableDelayedExpansion
:: FOR variables are global - this extra FOR loop exposes %%A that would otherwise be hidden.
for %%. in (.) do set "ln=%%A"
:: Print out raw line so we can diagnose the result.
set ln
:: "Hide" quotes by doubling, making all characters safe for percent expansion when
:: entire string is quoted. Also enclose line within extra set of , delimiters.
set "ln=,%ln:"=""%,"
:: Escape poison characters so all characters are safe for unquoted percent expansion.
set "ln=%ln:^=^^^^%" %= Double escaped to account for enabled delayed expansion later on. =%
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
:: Double escape ! so not corrupted by later percent expansion while delayed expansion enabled.
set "ln=%ln:!=^^!%"
:: Double and escape all commas. , -> ^,^,
set "ln=%ln:,=^,^,%"
:: Undouble quotes and unescape (originally) unquoted strings. Note that outer quotes are escaped.
set ^"ln=%ln:""="%^"
:: At this point quoted comma literals are still ^,^, whereas unquoted comma delimiters are ,,
:: Also, all quoted poison characters are still escaped, but unquoted ones are not.
:: Redouble quotes, all characters safe again for quoted percent expansion.
set "ln=%ln:"=""%"
:: Encode @ as @a and quoted comma literals ^,^, as @c
set "ln=%ln:@=@a%"
set "ln=%ln:^,^,=@c%"
:: Restore delayed expansion and undouble quotes, which unescapes (originally) quoted strings.
:: Note that outer quotes are NOT escaped this time. The ENDLOCAL and SET are on the same
:: line so that the percent expansion value is transferred across the ENDLOCAL barrier.
endlocal & set "ln=%ln:""="%" ! %= Trailing ! is ignored except forces all ^^ to become ^ =%
:: At this point no characters are escaped, and all ! and ^ are unprotected against percent or
:: FOR variable expansion while delayed expansion is enabled.
:: Remove enclosing quotes from tokens that are already quoted so we can later safely enclose
:: all tokens in quotes. This is why the extra enclosing , were added at the beginning.
set "ln=!ln:,,"=,,!"
set "ln=!ln:",,=,,!"
:: Remove outer , delimiters that were added at the beginning.
set "ln=!ln:~2,-2!"
:: Must double escape ! and ^ again to protect against delayed expansion within parsing FOR /F loop.
set "ln=!ln:^=^^^^!"
set "ln=%ln:!=^^^!%"
:: Undouble remaining quotes because quote literals are doubled within original CSV.
set "ln=!ln:""="!"
:: Restore doubled ,, delimiters to , and enclose all tokens within quotes to preserves empty tokens.
set "ln="!ln:,,=","!"" !
:: The line is now safe to parse with FOR /F, though @ and , are encoded as @a and @c
:: Parse line into tokens.
for /f "tokens=1-11 delims=," %%A in ("!ln!") do (
%= Decode the tokens and store result in environment variables =%
for %%a in (A B C D E F G H I J K) do call :decodeToken %%a
%= Your processing goes here. Decoded %%A - %%K are now safely in !A! - !K! =%
%= I will simply echo all the values, one per line =%
for %%a in (A B C D E F G H I J K) do echo %%a=!%%a!
echo(
)
exit /b
:decodeToken Char
:: Converts @c and @a back into , and @
for %%. in (.) do set "%1=%%~%1" !
if defined %1 (
set "%1=!%1:@c=,!"
set "%1=!%1:@a=@!"
)
exit /b
这是没有所有 cmets 的相同代码:
@echo off
setlocal enableDelayedExpansion
for /f usebackq^ delims^=^ eol^= %%A in ("test2.csv") do call :processLine
exit /b
:processLine
setlocal disableDelayedExpansion
for %%. in (.) do set "ln=%%A"
set ln
set "ln=,%ln:"=""%,"
set "ln=%ln:^=^^^^%"
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
set "ln=%ln:!=^^!%"
set "ln=%ln:,=^,^,%"
set ^"ln=%ln:""="%^"
set "ln=%ln:"=""%"
set "ln=%ln:@=@a%"
set "ln=%ln:^,^,=@c%"
endlocal & set "ln=%ln:""="%" !
set "ln=!ln:,,"=,,!"
set "ln=!ln:",,=,,!"
set "ln=!ln:~2,-2!"
set "ln=!ln:^=^^^^!"
set "ln=%ln:!=^^^!%"
set "ln=!ln:""="!"
set "ln="!ln:,,=","!"" !
for /f "tokens=1-11 delims=," %%A in ("!ln!") do (
for %%a in (A B C D E F G H I J K) do call :decodeToken %%a
for %%a in (A B C D E F G H I J K) do echo %%a=!%%a!
echo(
)
exit /b
:decodeToken Char
for %%. in (.) do set "%1=%%~%1" !
if defined %1 (
set "%1=!%1:@c=,!"
set "%1=!%1:@a=@!"
)
exit /b
这是您的示例 CSV 文件,其中添加了一行以测试各种复杂性:
;A!,"B!","C is ""cool""",D @^&|<>,"E @^&|<>","F ,x","G ""@^&|<>""","H ""@^&|<>!""",I,J,K
A,SCONE,Shen ring,SHEN_RING,"FLOUR, BUTTER","BRONZE,GOLD",blank,"This,""BLANK""",,BLANK,
A,STRAWBERRIES_AND_CREAM,Cat1,CAT1,"STRAWBERRY, CREAM","OBSIDIAN,GOLD2",FS,FreeSpin,,FREE_SPIN,
A,WALNUT_TOFFEE,Pyramid,PYRAMID,"BUTTER, SUGAR, WALNUT","GOLD,EMERALD,PERIDOT",1,Champagne,Garnet,GARNET,
A,RASPBERRY_AND_LIME_JELLY,Cuff bracelet,CUFF_BRACELET,"RASPBERRY, JELLY, LIME","ZIRCON,BRONZE2,TOPAZ",2,Cocoa,Lapis lazuli,LAPIS_LAZULI,Blue
A,CHOCOLATE_CHIP_COOKIES,Nekhbet,NEKHBET,"SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT","EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER",3,GoldLeaf,gold3,GOLD3,yellow
A,BUTTER_CREAM_CUP_CAKE,Sobek,SOBEK,"ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM","JADE,BRONZE,GOLD,GARNET2",4,Sugar,emerald,EMERALD,green
A,PEANUT_BUTTER_COOKIE,Sekhmet,SEKHMET,"PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER","GARNET1,BRONZE,AMAZONITE,EMERALD",5,IcingSugar,JADE,JADE,green
A,CHOCOLATE_MARSHMALLOWS,Osiris,OSIRIS,"MARSHMALLOW, CHOCOLATE_CHIPS","PLATINUM,ALEXANDRITE",6,Flour,Bronze,BRONZE,yellow
,,,,,,7,Butter,Gold,GOLD,yellow
B,BLUEBERRY_PIE,Ankh,ANKH,"BLUEBERRY, SUGAR, FLOUR, BUTTER","JADEITE,EMERALD,BRONZE,GOLD",8,ChocolateChips,Alexandrite,ALEXANDRITE,
这是最终的输出:
ln=;A!,"B!","C is ""cool""",D @^&|<>,"E @^&|<>","F ,x","G ""@^&|<>""","H ""@^&|<>!""",I,J,K
A=;A!
B=B!
C=C is "cool"
D=D @^&|<>
E=E @^&|<>
F=F ,x
G=G "@^&|<>"
H=H "@^&|<>!"
I=I
J=J
K=K
ln=A,SCONE,Shen ring,SHEN_RING,"FLOUR, BUTTER","BRONZE,GOLD",blank,"This,""BLANK""",,BLANK,
A=A
B=SCONE
C=Shen ring
D=SHEN_RING
E=FLOUR, BUTTER
F=BRONZE,GOLD
G=blank
H=This,"BLANK"
I=
J=BLANK
K=
ln=A,STRAWBERRIES_AND_CREAM,Cat1,CAT1,"STRAWBERRY, CREAM","OBSIDIAN,GOLD2",FS,FreeSpin,,FREE_SPIN,
A=A
B=STRAWBERRIES_AND_CREAM
C=Cat1
D=CAT1
E=STRAWBERRY, CREAM
F=OBSIDIAN,GOLD2
G=FS
H=FreeSpin
I=
J=FREE_SPIN
K=
ln=A,WALNUT_TOFFEE,Pyramid,PYRAMID,"BUTTER, SUGAR, WALNUT","GOLD,EMERALD,PERIDOT",1,Champagne,Garnet,GARNET,
A=A
B=WALNUT_TOFFEE
C=Pyramid
D=PYRAMID
E=BUTTER, SUGAR, WALNUT
F=GOLD,EMERALD,PERIDOT
G=1
H=Champagne
I=Garnet
J=GARNET
K=
ln=A,RASPBERRY_AND_LIME_JELLY,Cuff bracelet,CUFF_BRACELET,"RASPBERRY, JELLY, LIME","ZIRCON,BRONZE2,TOPAZ",2,Cocoa,Lapis lazuli,LAPIS_LAZULI,Blue
A=A
B=RASPBERRY_AND_LIME_JELLY
C=Cuff bracelet
D=CUFF_BRACELET
E=RASPBERRY, JELLY, LIME
F=ZIRCON,BRONZE2,TOPAZ
G=2
H=Cocoa
I=Lapis lazuli
J=LAPIS_LAZULI
K=Blue
ln=A,CHOCOLATE_CHIP_COOKIES,Nekhbet,NEKHBET,"SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT","EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER",3,GoldLeaf,gold3,GOLD3,yellow
A=A
B=CHOCOLATE_CHIP_COOKIES
C=Nekhbet
D=NEKHBET
E=SUGAR, FLOUR, BUTTER, CHOCOLATE_CHIPS, SALT
F=EMERALD,BRONZE,GOLD,ALEXANDRITE,SILVER
G=3
H=GoldLeaf
I=gold3
J=GOLD3
K=yellow
ln=A,BUTTER_CREAM_CUP_CAKE,Sobek,SOBEK,"ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM","JADE,BRONZE,GOLD,GARNET2",4,Sugar,emerald,EMERALD,green
A=A
B=BUTTER_CREAM_CUP_CAKE
C=Sobek
D=SOBEK
E=ICING_SUGAR, FLOUR, BUTTER, BUTTERCREAM
F=JADE,BRONZE,GOLD,GARNET2
G=4
H=Sugar
I=emerald
J=EMERALD
K=green
ln=A,PEANUT_BUTTER_COOKIE,Sekhmet,SEKHMET,"PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER","GARNET1,BRONZE,AMAZONITE,EMERALD",5,IcingSugar,JADE,JADE,green
A=A
B=PEANUT_BUTTER_COOKIE
C=Sekhmet
D=SEKHMET
E=PEANUT_BUTTER, FLOUR, SUGAR, BAKING_POWDER
F=GARNET1,BRONZE,AMAZONITE,EMERALD
G=5
H=IcingSugar
I=JADE
J=JADE
K=green
ln=A,CHOCOLATE_MARSHMALLOWS,Osiris,OSIRIS,"MARSHMALLOW, CHOCOLATE_CHIPS","PLATINUM,ALEXANDRITE",6,Flour,Bronze,BRONZE,yellow
A=A
B=CHOCOLATE_MARSHMALLOWS
C=Osiris
D=OSIRIS
E=MARSHMALLOW, CHOCOLATE_CHIPS
F=PLATINUM,ALEXANDRITE
G=6
H=Flour
I=Bronze
J=BRONZE
K=yellow
ln=,,,,,,7,Butter,Gold,GOLD,yellow
A=
B=
C=
D=
E=
F=
G=7
H=Butter
I=Gold
J=GOLD
K=yellow
ln=B,BLUEBERRY_PIE,Ankh,ANKH,"BLUEBERRY, SUGAR, FLOUR, BUTTER","JADEITE,EMERALD,BRONZE,GOLD",8,ChocolateChips,Alexandrite,ALEXANDRITE,
A=B
B=BLUEBERRY_PIE
C=Ankh
D=ANKH
E=BLUEBERRY, SUGAR, FLOUR, BUTTER
F=JADEITE,EMERALD,BRONZE,GOLD
G=8
H=ChocolateChips
I=Alexandrite
J=ALEXANDRITE
K=
请参阅This DosTips post,了解如何扩展此技术以解析超过 32 个字段。
混合 JScript/批处理 parseCSV.bat 实用程序
纯批处理需要大量难以动态创建的代码,而且速度相对较慢。我创建了parseCSV.bat - 一个混合的 JScript/batch 实用程序,可以快速将几乎任何 CSV 格式化为 FOR /F 可以轻松解析的内容。它甚至支持字段内的换行符。
parseCSV当然不能解决8191行长度限制,解析超过32个token还是需要额外的代码。
parseCSV.bat 不使用正则表达式。
我不会详细介绍它的工作原理。该实用程序内置了完整的文档,可通过从命令行输入parseCSV /?
获得。帮助的输出如下:
parseCSV [/option]...
Parse stdin as CSV and write it to stdout in a way that can be safely
parsed by FOR /F. All columns will be enclosed by quotes so that empty
columns may be preserved. It also supports delimiters, newlines, and
escaped quotes within quoted values. Two consecutive quotes within a
quoted value are converted into one quote by default.
Available options:
/I:string = Input delimiter. Default is a comma (,)
/O:string = Output delimiter. Default is a comma (,)
The entire option must be quoted if specifying poison character
or whitespace literals as a delimiters for /I or /O.
Examples: pipe = "/I:|"
space = "/I: "
Standard JScript escape sequences can also be used.
Examples: tab = /I:\t or /I:\x09
backslash = /I:\\
/E = Encode output delimiter literal within value as \D
Encode newline within value as \N
Encode backslash within value as \S
/D = escape exclamation point and caret for Delayed expansion
! becomes ^!
^ becomes ^^
/L = treat all input quotes as quote Literals
/Q:QuoteOutputFormat
Controls output of Quotes, where QuoteOutputFormat may be any
one of the following:
L = all columns quoted, quote Literals output as " (Default)
E = all columns quoted, quote literals Escaped as ""
N = No columns quoted, quote literals output as "
The /Q:E and /Q:N options are useful for transforming data for
purposes other than parsing by FOR /F
/U = Write unix style lines with newline (\n) instead of the default
Windows style of carriage return and linefeed (\r\n).
parseCSV /?
Display this help
parseCSV /V
Display the version of parseCSV.bat
parseCSV.bat was written by Dave Benham. Updates are available at the original
posting site: http://www.dostips.com/forum/viewtopic.php?f=3&t=5702
下面是 parseCSV.bat 如何与上面的 test2.csv 一起使用。
@echo off
setlocal enableDelayedExpansion
for /f "tokens=1-11 delims=," %%A in (
'parseCSV /E /D ^<test2.csv'
) do (
%= Decode Tokens =%
for %%a in (A B C D E F G H I J K) do call :decodeToken %%a
%= Show the results =%
for %%a in (A B C D E F G H I J K) do echo %%a=!%%a!
echo(
)
exit /b
:decodeToken
for %%. in (.) do set "%1=%%~%1" !
if defined %1 (
set "%1=!%1:\D=,!"
set "%1=!%1:\S=\!"
)
exit /b
请参阅This DosTips post,了解如何扩展此技术以解析超过 32 个字段。
【讨论】:
非常感谢!非常详细的解释和易于理解。我使用了 Pure Native Batch 解决方案,效果很好以上是关于使用批处理脚本,如何使用正则表达式拆分 .csv 文件中的数据?的主要内容,如果未能解决你的问题,请参考以下文章