将html标签存储在xml中
Posted
技术标签:
【中文标题】将html标签存储在xml中【英文标题】:Store html tags in xml 【发布时间】:2011-08-12 05:56:42 【问题描述】:我有一个带有各种 html 标签的 html 格式的字符串。我想将此字符串放在 xml 标记中,以便保留 html 标记。例如
public class XMLfunctions
public final static Document XMLfromString(String xml)
Document doc = null;
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xml));
doc = db.parse(is);
catch (ParserConfigurationException e)
System.out.println("XML parse error: " + e.getMessage());
return null;
catch (SAXException e)
System.out.println("Wrong XML file structure: " + e.getMessage());
return null;
catch (IOException e)
System.out.println("I/O exeption: " + e.getMessage());
return null;
return doc;
/** Returns element value
* @param elem element (it is XML tag)
* @return Element value otherwise empty String
*/
public final static String getElementValue( Node elem )
Node kid;
if( elem != null)
if (elem.hasChildNodes())
for( kid = elem.getFirstChild(); kid != null; kid = kid.getNextSibling() )
if( kid.getNodeType() == Node.TEXT_NODE )
return kid.getNodeValue();
return "";
/*Start Parsing Body */
public static String getBodyXML(String id)
String line = null;
try
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost httpPost = new HttpPost("http://192.168.1.44:9090/solr/core0/select/?q=content_id:"+id+"&version=2.2&start=0&rows=10&indent=on");
HttpResponse httpResponse = httpClient.execute(httpPost);
HttpEntity httpEntity = httpResponse.getEntity();
line = EntityUtils.toString(httpEntity);
catch (UnsupportedEncodingException e)
line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
catch (MalformedURLException e)
line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
catch (IOException e)
line = "<results status=\"error\"><msg>Can't connect to server</msg></results>";
String st= ParseXMLBodyNode(line,"doc");
return st;
public static String ParseXMLBodyNode(String str,String node)
String xmlRecords = str;
String results = "";
String[] result = new String [1];
StringBuffer sb = new StringBuffer();
StringBuffer text = new StringBuffer();
try
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlRecords));
Document doc = db.parse(is);
NodeList indiatimes1 = doc.getElementsByTagName(node);
sb.append("<results count=");
sb.append("\"1\"");
sb.append(">\r\n");
for (int i = 0; i < indiatimes1.getLength(); i++)
Node node1 = indiatimes1.item(i);
if (node1.getNodeType() == Node.ELEMENT_NODE)
Element element = (Element) node1;
NodeList nodelist = element.getElementsByTagName("str");
Element element1 = (Element) nodelist.item(0);
NodeList title = element1.getChildNodes();
title.getLength();
for(int j=0; j<title.getLength();j++)
text.append(title.item(j).getNodeValue());
System.out.print((title.item(0)).getNodeValue());
sb.append("<result>\r\n");
sb.append("<body>");
String tmpText = html2text(text.toString());
sb.append("<![CDATA[<body>");
sb.append(tmpText);
sb.append("</body>]]>");
sb.append("</body>\r\n");
sb.append("</result>\r\n");
result[i] = title.item(0).getNodeValue();
sb.append("</results>");
catch (Exception e)
System.out.println("Exception........"+results );
e.printStackTrace();
return sb.toString();
/*End Parsing Body*/
public static int numResults(Document doc)
Node results = doc.getDocumentElement();
int res = -1;
try
res = Integer.valueOf(results.getAttributes().getNamedItem("count").getNodeValue());
catch(Exception e )
res = -1;
return res;
public static String getValue(Element item, String str)
NodeList n = item.getElementsByTagName(str);
return XMLfunctions.getElementValue(n.item(0));
public static String html2text(String html)
String pText = Jsoup.clean(html, Whitelist.basic());
return pText;
我把这些函数称为
String xml = XMLfunctions.getBodyXML(id);
Document doc = XMLfunctions.XMLfromString(xml);
我希望字体标签在 xml 中作为 html 标签存在。
我们将不胜感激!!!!!!
【问题讨论】:
【参考方案1】:将您的 HTML 包含在 CDATA section 中,这样它就不会被视为 XML 的一部分,而只是普通文本:
<result>
<![CDATA[
<body><font size="2px" face="arial">Hello World</font></body>
]]>
</result>
更新
你的问题可能在这里:
sb.append("<result>\r\n");
sb.append("<body>");
String tmpText = html2text(text.toString());
sb.append("<![CDATA[<body>");
sb.append(tmpText);
sb.append("</body>]]>");
sb.append("</body>\r\n");
sb.append("</result>\r\n");
请注意 CDATA 部分周围的 sb.append("<body>");
和 sb.append("</body>\r\n");
行,它们可能会导致无法正确读取 XML 的问题。也许您应该删除这两行,使其看起来像这样:
sb.append("<result>\r\n");
String tmpText = html2text(text.toString());
sb.append("<![CDATA[<body>");
sb.append(tmpText);
sb.append("</body>]]>");
sb.append("</result>\r\n");
【讨论】:
@BoltClock: 应该像 sb.append("");或者别的什么 我不明白。您是在使用StringBuilder
或其他方式构建 XML 吗?
@BoltClock:首先我解析一个 xml,然后使用 jsoup basic() 方法将所有格式保存在一个字符串中,然后按照我在之前评论中询问的方式附加该字符串 StringBuffer 并创建一个xml,然后再次解析此 xml。目前,即使使用 cdata 后我也没有得到任何东西。
@BoltClock: 首先调用 String xml = XMLfunctions.getBodyXML(id); 调用 getBodyXML(String id) 然后调用 ParseXMLBodyNode(String str,String node ) 被称为
@BoltClock:不,它不起作用。我制作的新 xml 再次被馈送到 Document doc = XMLfunctions.XMLfromString(xml); 它什么都不返回以上是关于将html标签存储在xml中的主要内容,如果未能解决你的问题,请参考以下文章