Java如何解析html里面的内容并存到数据库

Posted 2023-04-03 吳名氏

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Java如何解析html里面的内容并存到数据库相关的知识，希望对你有一定的参考价值。

一、前言

最近接到一个任务,需要爬取五级行政区划的所有数据(大概71万条数据在),需要爬取的网站:行政区划 - 行政区划代码查询发现这个网站不是用接口请求的,而且直接返回html代码,所以,去看了一下Java是如何解析html里面的内容

二、准备工作

我选用的是使用jsoup进行html的读取和解析,需要加入如下依赖:

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.8.3</version>
        </dependency>

jsoup 是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jquery的操作方法来取出和操作数据。它是基于MIT协议发布的。

jsoup的主要功能如下：

1、从一个URL，文件或字符串中解析HTML；

2、使用DOM或CSS选择器来查找、取出数据；

3、可操作HTML元素、属性、文本；

示例代码：

//获取html的文档对象
Document doc = Jsoup.parse("http://www.dangdang.com");
//获取页面下id="content"的标签
Element content = doc.getElementById("content");
//获取页面下的a标签
Elements links = content.getElementsByTag("a");
for (Element link : links) 
    //获取a标签下的href的属性值
    String linkHref = link.attr("href");
    //获取a标签下的文本内容
    String linkText = link.text();

Elements这个对象提供了一系列类似于DOM的方法来查找元素，抽取并处理其中的数据。具体如下：

1、查找元素

getElementById(String id)

getElementsByTag(String tag)

getElementsByClass(String className)

getElementsByAttribute(String key) (and related methods)

Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

Graph: parent(), children(), child(int index)

2、元素数据

attr(String key)获取属性

attr(String key, String value)设置属性

attributes()获取所有属性

id(), className() and classNames()

text()获取文本内容

text(String value) 设置文本内容

html()获取元素内

HTMLhtml(String value)设置元素内的HTML内容

outerHtml()获取元素外HTML内容

data()获取数据内容（例如：script和style标签)

tag() and tagName()

3、操作HTML和文本

append(String html), prepend(String html)

appendText(String text), prependText(String text)

appendElement(String tagName), prependElement(String tagName) html(String value)

三、开始爬取网站数据

直接上代码:

Test.java:

@Slf4j
@SpringBootTest
class Test 

    @Resource
    private PositionService positionService;

    /**
     * 爬取省市区网站
     */
    @Test
    public void test2() throws InterruptedException 
        //一共五级
        for (int i = 0 ; i < 5 ; i++) 
            if (i == 0) 
                List<PositionEntity> positionEntities = PositionUtils.reqPosition(PositionUtils.URL_HEAD);
                savePosition(positionEntities, null, i);
                continue;
            
            List<Position> positions = positionService.findListByLevel(i);
            for (Position parentPosition : positions) 
                List<PositionEntity> positionEntities = PositionUtils.reqPosition(String.format("%s%s%s", PositionUtils.URL_HEAD, parentPosition.getSn(), PositionUtils.URL_TAIL));
                savePosition(positionEntities, parentPosition, i);
            
        
    

    /**
     * 报错地址信息
     */
    private void savePosition(List<PositionEntity> positionEntities, Position parentPosition, int i)
        for (PositionEntity entity : positionEntities) 
            Position position = new Position();
            position.setSn(entity.getCode());
            position.setFullInitials(PinyinUtils.strFirst2Pinyin((parentPosition != null ? parentPosition.getFullName() : "")+entity.getName()));
            position.setFullName((parentPosition != null ? parentPosition.getFullName() : "")+entity.getName());
            position.setLevel(i + 1);
            position.setName(entity.getName());
            position.setOrderNumber(0);
            position.setPsn(parentPosition != null ? parentPosition.getSn() : 0L);
            long count = positionService.countBySn(position.getSn());
            if (count == 0) 
                positionService.savePosition(position);

PositionService.java:

public interface PositionService 

    void savePosition(Position position);

    long countBySn(Long sn);

    List<Position> findListByLevel(Integer level);

PositionServiceImpl.java:

@Service
public class PositionServiceImpl extends ServiceImpl<PositionMapper, Position> implements PositionService 

    @Override
    public void savePosition(Position position) 
        baseMapper.insert(position);
    

    @Override
    public long countBySn(Long sn) 
        return baseMapper.selectCount(new QueryWrapper<Position>().lambda().eq(Position::getSn, sn));
    

    @Override
    public List<Position> findListByLevel(Integer level) 
        return baseMapper.selectList(new QueryWrapper<Position>().lambda().eq(Position::getLevel, level));

PositionMapper.java:

@Repository
public interface PositionMapper extends BaseMapper<Position>

Position.java:

@Data
@TableName("position")
@EqualsAndHashCode()
public class Position implements Serializable 

    @TableId(type = IdType.AUTO)
    private Integer id;

    /**
     * 编码
     */
    private Long sn;

    /**
     * 上级地址编码
     */
    private Long psn;

    /**
     * 名称
     */
    private String name;

    /**
     * 简称
     */
    private String shortName;

    /**
     * 层级
     */
    private Integer level;

    /**
     * 区号
     */
    private String code;

    /**
     * 邮政编码
     */
    private String zip;

    /**
     * 拼音
     */
    private String spell;

    /**
     * 拼音首字母
     */
    private String spellFirst;

    /**
     * 地址全名
     */
    private String fullName;

    /**
     * 地址全名拼音首字母
     */
    private String fullInitials;

    /**
     * 排序
     */
    private Integer orderNumber;

PositionMapper.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">
<mapper namespace="com.wkf.workrecord.dao.PositionMapper">

</mapper>

PositionUtils.java:

public class PositionUtils 

    public final static String URL_HEAD = "https://xingzhengquhua.bmcx.com/";

    public final static String URL_TAIL = "__xingzhengquhua/";

    public static List<PositionEntity> reqPosition(String url) throws InterruptedException 
        String htmlStr = HttpUtils.getRequest(url);
        //解析字符串为Document对象
        Document doc = Jsoup.parse(htmlStr);
        //获取body元素，获取class="fc"的table元素
        Elements table = doc.body().getElementsByAttributeValue("bgcolor", "#C5D5C5");
        //获取tbody元素
        Elements children;
        children = table.first().children();
        //获取tr元素集合
        Elements tr = children.get(0).getElementsByTag("tr");
        List<PositionEntity> result = new ArrayList<>();
        //遍历tr元素，获取td元素，并打印
        for (int i = 3; i < tr.size(); i++) 
            Element e1 = tr.get(i);
            Elements td = e1.getElementsByTag("td");
            if (td.size() < 2) 
                break;
            
            String name = td.get(0).getElementsByTag("td").first().getElementsByTag("a").text();
            String code = td.get(1).getElementsByTag("td").first().getElementsByTag("a").text();
            if (CheckUtils.isEmpty(name) || CheckUtils.isEmpty(code)) 
                continue;
            
            result.add(new PositionEntity(name, Long.parseLong(code)));
        
        //防止ip被封
        Thread.sleep(10000);
        return result;

PinyinUtils.java:

public class PinyinUtils 
	private final static HanyuPinyinOutputFormat format = new HanyuPinyinOutputFormat();
	static 
		format.setCaseType(HanyuPinyinCaseType.LOWERCASE);
		format.setToneType(HanyuPinyinToneType.WITHOUT_TONE);
		format.setVCharType(HanyuPinyinVCharType.WITH_V);
	

	/**
	 * 字符串转拼音
	 * 
	 * @param str
	 *            中文字符串
	 * @param fill
	 *            分隔符
	 * @return 返回中文的拼音串
	 */
	public static String str2Pinyin(String str, String fill) 
		if (null == str) 
			return null;
		
		try 
			StringBuilder sb = new StringBuilder();
			if (fill == null)
				fill = "";
			boolean isCn = true;
			for (int i = 0; i < str.length(); i++) 
				char c = str.charAt(i);
				if (i > 0 && isCn) 
					sb.append(fill);
				
				if (c == ' ') 
					sb.append(fill);
				
				// 1、判断c是不是中文
				if (c >= '\\u4e00' && c <= '\\u9fa5') 
					isCn = true;
					String[] piyins = PinyinHelper.toHanyuPinyinStringArray(c, format);
					if (null == piyins || 0 >= piyins.length) 
						continue;
					
					sb.append(piyins[0]);
				 else 
					// 不是中文
					if (c >= 'A' && c <= 'Z') 
						sb.append((char)(c + 32));
					 else 
						sb.append(c);
					
					isCn = false;
				
			
			return sb.toString();
		 catch (BadHanyuPinyinOutputFormatCombination e) 
			e.printStackTrace();
		
		return null;
	

	/**
	 * 拼音首字母
	 * 
	 * @param str
	 *            中文字符串
	 * @return 中文字符串的拼音首字母
	 */
	public static String strFirst2Pinyin(String str) 
		if (null == str) 
			return null;
		
		try 
			StringBuilder sb = new StringBuilder();
			for (int i = 0; i < str.length(); i++) 
				char c = str.charAt(i);
				// 1、判断c是不是中文
				if (c >= '\\u4e00' && c <= '\\u9fa5') 
					String[] piyins = PinyinHelper.toHanyuPinyinStringArray(c, format);
					if (null == piyins || 0 >= piyins.length) 
						continue;
					
					sb.append(piyins[0].charAt(0));
				 else 
					// 不是中文
					if (c >= 'A' && c <= 'Z') 
						sb.append((char)(c + 32));
					 else 
						sb.append(c);
					
				
			
			return sb.toString();
		 catch (BadHanyuPinyinOutputFormatCombination e) 
			e.printStackTrace();
		
		return null;

java读取pdf内容

我想通过java来读取pdf内容，然后把读取到的内容存到表里。我以前用过iText插件，但是现在手头没有现成代码，早就忘记了。请帮忙给我一个读取pdf的代码，还有个问题就是，如果pdf中有表格，该怎么保存这个内容呢？帮忙写个Demo给我吧(把第三方jar放里面)

参考技术A 用Java简单的读取pdf文件中的数据：
第一步：下载PDFBox-0.7.2.jar。提供一个下载地址：http://pdfhome.hope.com.cn/Resource.aspx?CID=63844604-5253-4ae1-b023-258c9e324061&RID=20cd8f94-1cee-40b6-a3df-0ef024f8e0d2解压后，把lib文件下的PDFBox-0.7.2.jar，PDFBox-0.7.2-log4j.jar放到你classpath路径下。（我把源码以及jar包都放到下面的附件里，方面你的使用。）
第二步：写个简单的读取pdf文件的程序。(PdfReader.java)
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.net.MalformedURLException;
import java.net.URL;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;
public class PdfReader
public void readFdf(String file) throws Exception
// 是否排序
boolean sort = false;
// pdf文件名
String pdfFile = file;
// 输入文本文件名称
String textFile = null;
// 编码方式
String encoding = "UTF-8";
// 开始提取页数
int startPage = 1;
// 结束提取页数
int endPage = Integer.MAX_VALUE;
// 文件输入流，生成文本文件
Writer output = null;
// 内存中存储的PDF Document
PDDocument document = null;
try
try
// 首先当作一个URL来装载文件，如果得到异常再从本地文件系统//去装载文件
URL url = new URL(pdfFile);
//注意参数已不是以前版本中的URL.而是File。
document = PDDocument.load(pdfFile);
// 获取PDF的文件名
String fileName = url.getFile();
// 以原来PDF的名称来命名新产生的txt文件
if (fileName.length() > 4)
File outputFile = new File(fileName.substring(0, fileName
.length() - 4)
+ ".txt");
textFile = outputFile.getName();

catch (MalformedURLException e)
// 如果作为URL装载得到异常则从文件系统装载
//注意参数已不是以前版本中的URL.而是File。
document = PDDocument.load(pdfFile);
if (pdfFile.length() > 4)
textFile = pdfFile.substring(0, pdfFile.length() - 4)
+ ".txt";

// 文件输入流，写入文件倒textFile
output = new OutputStreamWriter(new FileOutputStream(textFile),
encoding);
// PDFTextStripper来提取文本
PDFTextStripper stripper = null;
stripper = new PDFTextStripper();
// 设置是否排序
stripper.setSortByPosition(sort);
// 设置起始页
stripper.setStartPage(startPage);
// 设置结束页
stripper.setEndPage(endPage);
// 调用PDFTextStripper的writeText提取并输出文本
stripper.writeText(document, output);
finally
if (output != null)
// 关闭输出流
output.close();

if (document != null)
// 关闭PDF Document
document.close();

/**
* @param args
*/
public static void main(String[] args)
// TODO Auto-generated method stub
PdfReader pdfReader = new PdfReader();
try
// 取得E盘下的SpringGuide.pdf的内容
pdfReader.readFdf("E:\\SpringGuide.pdf");
catch (Exception e)
e.printStackTrace();

这样就简单的完成了从pdf中读取数据了。在你的pdf文件所在的目录下生成一个同名的txt文件。追问

附件在哪？那个下载地址是无效的

参考技术B

Java读取PDF文件:

下载Spire.Pdf for Java，导入jar。（也可以从maven仓库安装）

读取PDF文件中的文本内容：

import com.spire.pdf.PdfDocument;
import com.spire.pdf.PdfPageBase;
import java.io.*;

public class Extract_Text

       public static void main(String[] args)

           //创建PdfDocument实例
           PdfDocument doc= new PdfDocument();

           //加载PDF文件
           doc.loadFromFile("test.pdf");

           StringBuilder sb= new StringBuilder();

           PdfPageBase page;

           //遍历PDF页面，获取文本
           for(int i=0;i<doc.getPages().getCount();i++)
               page=doc.getPages().get(i);
               sb.append(page.extractText(true));


           FileWriter writer;

           try
               //将文本写入文本文件
               writer = new FileWriter("ExtractText.txt");
               writer.write(sb.toString());
               writer.flush();
            catch (IOException e)
              e.printStackTrace();


           doc.close();

读取图片也是支持的，你可以自己试一下，但是PDF中是没有表格的概念的，表格是画到页面上的，和office文件中的表格不一样，只能通过从PDF页面指定矩形范围内提取数据来实现提取表格内容

以上是关于Java如何解析html里面的内容并存到数据库的主要内容，如果未能解决你的问题，请参考以下文章