如何最好地使用正则表达式将层次文本文件转换为 XML?
Posted
技术标签:
【中文标题】如何最好地使用正则表达式将层次文本文件转换为 XML?【英文标题】:How best use Regular Expressions to convert Heirarchical Text File into XML? 【发布时间】:2011-01-18 11:19:29 【问题描述】:早上好——
我有兴趣了解一种解析层次结构文本文件值的有效方法(即,具有标题 => 多个标题 => 多个子标题 => 多个键 =>多个值)到一个简单的 XML 文档中。为简单起见,答案将使用:
正则表达式(最好是 php) 或者,PHP 代码(例如,如果循环更有效)这是我正在使用的库存文件的示例。请注意,Header = FOODS,Sub-Header = Type (A, B...),Keys = PRODUCT(或 CODE 等) > 和 Values 可能还有多行。
**FOODS - TYPE A**
___________________________________
**PRODUCT**
1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;
2) La Fe String Cheese
**CODE**
Sell by date going back to February 1, 2009
**MANUFACTURER**
Quesos Mi Pueblito, LLC, Passaic, NJ.
**VOLUME OF UNITS**
11,000 boxes
**DISTRIBUTION**
NJ, NY, DE, MD, CT, VA
___________________________________
**PRODUCT**
1) Peanut Brittle No Sugar Added;
2) Peanut Brittle Small Grind;
3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating
**CODE**
1) Lots 7109 - 8350 inclusive;
2) Lots 8198 - 8330 inclusive;
3) Lots 7075 - 9012 inclusive;
4) Lots 7100 - 8057 inclusive;
5) Lots 7152 - 8364 inclusive
**MANUFACTURER**
Star Kay White, Inc., Congers, NY.
**VOLUME OF UNITS**
5,749 units
**DISTRIBUTION**
NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN
**FOODS - TYPE B**
___________________________________
**PRODUCT**
Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;
**CODE**
990-10/2 10/5
**MANUFACTURER**
San Mar Manufacturing Corp., Catano, PR.
**VOLUME OF UNITS**
384
**DISTRIBUTION**
PR
这是所需的输出(请原谅任何 XML 语法错误):
<foods>
<food type = "A" >
<product>Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese</product>
<product>La Fe String Cheese</product>
<code>Sell by date going back to February 1, 2009</code>
<manufacturer>Quesos Mi Pueblito, LLC, Passaic, NJ.</manufacturer>
<volume>11,000 boxes</volume>
<distibution>NJ, NY, DE, MD, CT, VA</distribution>
</food>
<food type = "A" >
<product>Peanut Brittle No Sugar Added</product>
<product>Peanut Brittle Small Grind</product>
<product>Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</product>
<code>Lots 7109 - 8350 inclusive</code>
<code>Lots 8198 - 8330 inclusive</code>
<code>Lots 7075 - 9012 inclusive</code>
<code>Lots 7100 - 8057 inclusive</code>
<code>Lots 7152 - 8364 inclusive</code>
<manufacturer>Star Kay White, Inc., Congers, NY.</manufacturer>
<volume>5,749 units</volume>
<distibution>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</distribution>
</food>
<food type = "B" >
<product>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice</product>
<code>990-10/2 10/5</code>
<manufacturer>San Mar Manufacturing Corp., Catano, PR</manufacturer>
<volume>384</volume>
<distibution>PR</distribution>
</food>
</FOODS>
<!-- and so forth -->
到目前为止,我的方法(对于大型文本文件可能效率很低)将是以下之一:
循环和多个 Select/Case 语句,其中文件被加载到字符串缓冲区中,并在遍历每一行时,查看它是否与标题/子标题/键之一匹配行,将适当的 xml 标记附加到 xml 字符串变量,然后根据关于哪个键名是最新的 IF 语句将子节点添加到 xml(这似乎很耗时且容易出错,尤其是如果文本更改甚至轻微)——或者
使用 REGEX(正则表达式) 查找关键字段并将其替换为适当的 xml 标记,使用 xml 库对其进行清理,然后导出 xml 文件。问题是,我几乎不使用正则表达式,所以我需要一些基于示例的帮助。
任何帮助或建议将不胜感激。
谢谢。
【问题讨论】:
"请注意……可能还有多行。" - 你能举个例子吗? 致 Max S. - 我已将我的方法添加到问题的底部。对于 VolkerK - 请查看两个示例在 PRODUCT 标题下有多个值,并且我提供的 xml 示例(正确或错误)根据需要具有多个XSLT 1.0 解决方案的另一个提示在这里:http://bytes.com/topic/net/answers/808619-read-plain-file-xslt-1-0-a
【讨论】:
我将不得不在未来的项目中看看这个 - 谢谢。【参考方案2】:使用 XSLT 2.0 unparsed-text() 函数代替 Regex 或 PHP 来读取文件(参见 http://www.biglist.com/lists/xsl-list/archives/200508/msg00085.html)
【讨论】:
【参考方案3】:可以作为起点的示例。至少我希望它能给你一个想法......
<?php
define('TYPE_HEADER', 1);
define('TYPE_KEY', 2);
define('TYPE_DELIMETER', 3);
define('TYPE_VALUE', 4);
$datafile = 'data.txt';
$fp = fopen($datafile, 'rb') or die('!fopen');
// stores (the first) header in 'name' and the root simplexmlelement in 'element'
$container = array('name'=>null, 'element'=>null);
// stores the name for each item element, the value for the type attribute for subsequent item elements and the simplexmlelement of the current item element
$item = array('name'=>null, 'type'=>null, 'current_element'=>null);
// the last **key** encountered, used to create new child elements in the current item element when a value is encountered
$key = null;
while ( false!==($t=getstruct($fp)) )
switch( $t[0] )
case TYPE_HEADER:
if ( is_null($container['element']) )
// this is the first time we hit **header - subheader**
$container['name'] = $t[1][0];
// ugly hack, < . name . />
$container['element'] = new SimpleXMLElement('<'.$container['name'].'/>');
// each subsequent new item gets the new subheader as type attribute
$item['type'] = $t[1][1];
// dummy implementation: "deducting" the item names from header/container[name]
$item['name'] = substr($t[1][0], 0, -1);
else
// hitting **header - subheader** the (second, third, nth) time
/*
header must be the same as the first time (stored in container['name']).
Otherwise you need another container element since
xml documents can only have one root element
*/
if ( $container['name'] !== $t[1][0] )
echo $container['name'], "!==", $t[1][0], "\n";
die('format error');
else
// subheader may have changed, store it for future item elements
$item['type'] = $t[1][1];
break;
case TYPE_DELIMETER:
assert( !is_null($container['element']) );
assert( !is_null($item['name']) );
assert( !is_null($item['type']) );
/* that's maybe not a wise choice.
You might want to check the complete item before appending it to the document.
But the example is a hack anyway ...so create a new item element and append it to the container right away
*/
$item['current_element'] = $container['element']->addChild($item['name']);
// set the type-attribute according to the last **header - subheader** encountered
$item['current_element']['type'] = $item['type'];
break;
case TYPE_KEY:
$key = $t[1][0];
break;
case TYPE_VALUE:
assert( !is_null($item['current_element']) );
assert( !is_null($key) );
// this is a value belonging to the "last" key encountered
// create a new "key" element with the value as content
// and addit to the current item element
$tmp = $item['current_element']->addChild($key, $t[1][0]);
break;
default:
die('unknown token');
if ( !is_null($container['element']) )
$doc = dom_import_simplexml($container['element']);
$doc = $doc->ownerDocument;
$doc->formatOutput = true;
echo $doc->saveXML();
die;
/*
Take a look at gettoken() at http://www.tuxradar.com/practicalphp/21/5/6
It breaks the stream into much simpler pieces.
In the next step the parser would "combine" or structure the simple tokens into more complex things.
This function does both....
@return array(id, array(parameter)
*/
function getstruct($fp)
if ( feof($fp) )
return false;
// shortcut: all we care about "happens" on one line
// so let php read one line in a single step and then do the pattern matching
$line = trim(fgets($fp));
// this matches **key** and **header - subheader**
if ( preg_match('#^\*\*([^-]+)(?:-(.*))?\*\*$#', $line, $m) )
// only for **header - subheader** $m[2] is set.
if ( isset($m[2]) )
return array(TYPE_HEADER, array(trim($m[1]), trim($m[2])));
else
return array(TYPE_KEY, array($m[1]));
// this matches _____________ and means "new item"
else if ( preg_match('#^_+$#', $line, $m) )
return array(TYPE_DELIMETER, array());
// any other non-empty line is a single value
else if ( preg_match('#\S#', $line) )
// you might want to filter the 1),2),3) part out here
// could also be two diffrent token types
return array(TYPE_VALUE, array($line));
else
// skip empty lines, would be nicer with tail-recursion...
return getstruct($fp);
打印
<?xml version="1.0"?>
<FOODS>
<FOOD type="TYPE A">
<PRODUCT>1) Mi Pueblito Queso Fresco Authentic Mexican Style Fresh Cheese;</PRODUCT>
<PRODUCT>2) La Fe String Cheese</PRODUCT>
<CODE>Sell by date going back to February 1, 2009</CODE>
<MANUFACTURER>Quesos Mi Pueblito, LLC, Passaic, NJ.</MANUFACTURER>
<VOLUME OF UNITS>11,000 boxes</VOLUME OF UNITS>
<DISTRIBUTION>NJ, NY, DE, MD, CT, VA</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE A">
<PRODUCT>1) Peanut Brittle No Sugar Added;</PRODUCT>
<PRODUCT>2) Peanut Brittle Small Grind;</PRODUCT>
<PRODUCT>3) Homestyle Peanut Brittle Nuggets/Coconut Oil Coating</PRODUCT>
<CODE>1) Lots 7109 - 8350 inclusive;</CODE>
<CODE>2) Lots 8198 - 8330 inclusive;</CODE>
<CODE>3) Lots 7075 - 9012 inclusive;</CODE>
<CODE>4) Lots 7100 - 8057 inclusive;</CODE>
<CODE>5) Lots 7152 - 8364 inclusive</CODE>
<MANUFACTURER>Star Kay White, Inc., Congers, NY.</MANUFACTURER>
<VOLUME OF UNITS>5,749 units</VOLUME OF UNITS>
<DISTRIBUTION>NY, NJ, MA, PA, OH, FL, TX, UT, CA, IA, NV, MO and IN</DISTRIBUTION>
</FOOD>
<FOOD type="TYPE B">
<PRODUCT>Cool River Bebidas Naturales - West Indian Cherry Fruit Acerola 16% Juice;</PRODUCT>
<CODE>990-10/2 10/5</CODE>
<MANUFACTURER>San Mar Manufacturing Corp., Catano, PR.</MANUFACTURER>
<VOLUME OF UNITS>384</VOLUME OF UNITS>
<DISTRIBUTION>PR</DISTRIBUTION>
</FOOD>
</FOODS>
不幸的是,ANTLR 的 php 模块的状态目前是“Runtime is in alpha status.”,但无论如何还是值得一试...
【讨论】:
这是一个很好的例子——我可以从这里开始。非常感谢您的帮助! 顺便说一句,我错过了什么吗?您在代码中哪里引用了 ANTLR? 哦不不,我没有使用 ANTLR 作为示例。我只是敦促你看看这个项目,即使 php 还不是一个可行的目标平台。【参考方案4】:见:http://www.tuxradar.com/practicalphp/21/5/6
这告诉您如何使用 PHP 将文本文件解析为标记。解析后,您可以将其放入任何您想要的地方。
您需要根据您的条件在文件中搜索特定标记:
例如: 产品
这将为您提供 XML 标记
那么 1) 可以有特殊含义
1) 花生脆...
这会告诉您在 XML 标记中放置什么。
我不知道这是否是完成任务的最有效方式,但它是编译器解析文件的方式,并且有可能变得非常准确。
【讨论】:
谢谢,但这是用于“解析”的实际机制 - 这里缺少的关键部分是如何解析具有不同层次结构的文本文件(或字符串)并将其保存到 xml 文件中. 你解析它然后你得到每一块。然后,您根据令牌中的细节分配层次结构级别。这就是编译器的工作方式。看起来您可以使用数字 1) ... 为 XML 分配级别。 您能否修改您的答案以显示如何将不同层次的层次结构发送到 xml 输出中? 我的意思是,这个“加载行,多个嵌套条件,单独函数调用向xml文件添加节点,循环”的过程是可行的,但这是最好的或最有效的方式来做到这一点,而不是运行少量的正则表达式来搜索和替换,并且至少完成一个xml字符串的粗略切割? 感谢托德的回答。我仍在寻找基于 Regex 的更“简约”的解决方案(请参阅更改后的标题),但我感谢您的时间。以上是关于如何最好地使用正则表达式将层次文本文件转换为 XML?的主要内容,如果未能解决你的问题,请参考以下文章
在 Notepad++ 中使用正则表达式将科学记数法转换为文本或整数
查找文本中的美国街道地址(最好使用 Python 正则表达式)