使用正则提取字符串中URL等信息
Posted wslook
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用正则提取字符串中URL等信息相关的知识,希望对你有一定的参考价值。
一、说明
背景:最近在做同步京东商品信息时遇到一个问题,同步后的商品详情无法在富文本中修改,强制修改会导致图片无法正常显示,研究发现详情中的图片是在css的作为背景图指定的。
解决:经过多次尝试,最后使用自定义html标签模板,提取css样式中background-image:url的图片地址和尺寸,并替换到自定义的模板中
技术:Java语言、正则表达式
二、代码
public static void main(String[] args) { StringBuilder stringBuilder = new StringBuilder(); //商品详情 String goodsDesc = "<div cssurl=‘//sku-market-gw.jd.com/css/pc/100002519219.css?t=1581586700014‘></div><div id=‘zbViewModulesH‘ value=‘4797‘></div><input id=‘zbViewModulesHeight‘ type=‘hidden‘ value=‘4797‘/><div skudesign=\"100010\"></div><div class=\"ssd-module-wrap\" >\n <div id=\"ssd-vc-goods\" class=\"ssd-module ssd-module-goods M15541052686741\" data-id=\"M15541052686741\">\n <ul class=\"ssd-goods-4\">\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t26977/50/1495537803/456791/ca60d3de/5be4e374Nf8e94aa9.jpg\" alt=\"印尼进口 Tango威化饼干 进口威化饼干 零食礼盒 零食大礼包 潘多拉礼盒684g\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 进口威化饼干 零食礼盒 零食大礼包 潘多拉礼盒684g\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t23938/203/1285847551/405421/27964aa9/5b57e55eN969c2d3f.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 饼干 咔芙尔焦糖威化饼干73.5g\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 饼干 咔芙尔焦糖威化饼干73.5g\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t28057/267/707899178/312718/1054f7be/5bfbda66Ne622ae83.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 乳酪夹心威化饼干160g\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 乳酪夹心威化饼干160g\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t24409/100/1278216587/342196/7f15ac48/5b580b36Nb9007958.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 巧克力夹心威化饼干125g\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 巧克力夹心威化饼干125g\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t1/26596/26/9557/317836/5c7f4fedE8e6d5730/940a4d2112e62fc3.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 巧克力咔咔脆组合装320g(160g*2盒)\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 巧克力咔咔脆组合装320g(160g*2盒)\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t1/16950/37/10436/362577/5c8741b5E238f9c4a/ad91f31e0b26302c.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 泡泡糖味80g/盒\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 泡泡糖味80g/盒\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t20488/87/2361646474/244765/b67e1c77/5b503ba8N075a3501.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 牛奶味160g\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 牛奶味160g\n </p>\n </div>\n </div>\n </a>\n </li>\n <li>\n <a >\n <div class=\"ssd-good-item\">\n <div class=\"ssd-good-img\">\n <img src=\"http://img30.360buyimg.com/n1/jfs/t1/12175/32/10619/337857/5c8741e3E45420cc9/b3dab30dd73a7d8a.jpg\" alt=\"印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 草莓味80g/盒\"/>\n </div>\n <div class=\"ssd-good-info\">\n <p class=\"ssd-good-name\">\n 印尼进口 Tango威化饼干 休闲零食 咔咔脆威化饼干 草莓味80g/盒\n </p>\n </div>\n </div>\n </a>\n </li>\n </ul>\n</div><div class=\"ssd-module M15518471203811 animate-M15518471203811\" data-id=\"M15518471203811\">\n \n</div>\n<div class=\"ssd-module M15518471298134 animate-M15518471298134\" data-id=\"M15518471298134\">\n \n</div>\n<div class=\"ssd-module M15518471291853 animate-M15518471291853\" data-id=\"M15518471291853\">\n \n</div>\n<div class=\"ssd-module M15518471283932 animate-M15518471283932\" data-id=\"M15518471283932\">\n \n</div>\n\n</div>\n<!-- 2019-07-01 10:02:50 --> \n<style>.ssd-module-wrap{position:relative;margin:0 auto;width:750px;text-align:left;background-color:#fff}.ssd-module-wrap .ssd-module,.ssd-module-wrap .ssd-module-heading{width:750px;position:relative;overflow:hidden}.ssd-module-wrap .ssd-module{background-repeat:no-repeat;background-position:left top;background-size:100% 100%}.ssd-module-wrap .ssd-module-heading{background-repeat:no-repeat;background-position:left center;background-size:100% 100%}.ssd-module-wrap .ssd-module-heading .ssd-module-heading-layout{display:inline-block}.ssd-module-wrap .ssd-module-heading .ssd-widget-heading-ch{float:left;display:inline-block;margin:0 6px 0 15px;height:100%}.ssd-module-wrap .ssd-module-heading .ssd-widget-heading-en{float:left;display:inline-block;margin:0 15px 0 6px;height:100%}.ssd-module-wrap .ssd-widget-pic,.ssd-module-wrap .ssd-widget-text,.ssd-module-wrap .ssd-widget-line,.ssd-module-wrap .ssd-widget-rectangle,.ssd-module-wrap .ssd-widget-circle,.ssd-module-wrap .ssd-widget-triangle,.ssd-module-wrap .ssd-widget-table{position:absolute;overflow:hidden}.ssd-module-wrap .ssd-widget-rectangle{box-sizing:border-box;-moz-box-sizing:border-box;-webkit-box-sizing:border-box}.ssd-module-wrap .ssd-widget-table table{width:100%;height:100%}.ssd-module-wrap .ssd-widget-table td{position:relative;white-space:pre-line;word-break:break-all}.ssd-module-wrap .ssd-widget-pic img{display:block;width:100%;height:100%}.ssd-module-wrap .ssd-widget-text{line-height:1.5;word-break:break-all}.ssd-module-wrap .ssd-widget-text span{display:block;overflow:hidden;width:100%;height:100%;padding:0;margin:0;word-break:break-all;word-wrap:break-word;white-space:normal}.ssd-module-wrap .ssd-widget-link{position:absolute;left:0;top:0;width:100%;height:100%;background:transparent;z-index:100}.ssd-module-wrap .ssd-cell-text{position:absolute;top:0;left:0;right:0;width:100%;height:100%;overflow:auto}.ssd-module-wrap .M15541052686741{width:750px; height:492px}\n.ssd-module-wrap .M15541052686741 ul {\n padding: 5px;\n line-height: 1.15;\n background: #F3F4F7;\n overflow: hidden;\n}\n\n.ssd-module-wrap .M15541052686741 li {\n list-style-type: none;\n padding: 5px;\n float: left;\n -moz-box-sizing: border-box;\n box-sizing: border-box;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-goods-1 li {\n width: 100%;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-goods-2 li {\n width: 50%;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-goods-3 li {\n width: 33.33%;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-goods-4 li {\n width: 25%;\n}\n\n.ssd-module-wrap .M15541052686741 a {\n display: block;\n overflow: hidden;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-good-item {\n background-color: #fff;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-good-img {\n position: relative;\n padding-top: 100%;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-good-img img {\n position: absolute;\n top: 0;\n left: 0;\n width: 100%;\n height: 100%;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-good-info {\n padding: 10px;\n margin: 0;\n}\n\n.ssd-module-wrap .M15541052686741 .ssd-good-name {\n margin: 0;\n height: 36px;\n line-height: 18px;\n font-size: 14px;\n color: #333333;\n display: -webkit-box;\n overflow: hidden;\n text-overflow: ellipsis;\n -webkit-line-clamp: 2;\n -webkit-box-orient: vertical; \n}.ssd-module-wrap .M15518471203811{width:750px; background-color:#e9e9e9; background-size:100% 100%; background-image:url(http://img30.360buyimg.com/sku/jfs/t1/31717/2/4671/349535/5c7f4f07E899abe1e/9dd81eaf2aac0863.jpg); height:1083px}\n.ssd-module-wrap .M15518471298134{width:750px; background-color:#e9e9e9; background-size:100% 100%; background-image:url(http://img30.360buyimg.com/sku/jfs/t1/14459/14/9500/215997/5c7f4f06E886e02de/9de0bdce8ff65b3c.jpg); height:786px}\n.ssd-module-wrap .M15518471291853{width:750px; background-color:#e9e9e9; background-size:100% 100%; background-image:url(http://img30.360buyimg.com/sku/jfs/t1/25970/3/9647/494996/5c7f4f07E79829fc4/41a47699929ca408.jpg); height:1416px}\n.ssd-module-wrap .M15518471283932{width:750px; background-color:#e9e9e9; background-size:100% 100%; background-image:url(http://img30.360buyimg.com/sku/jfs/t1/79600/29/3390/325766/5d196888Ee80899ac/6b260d5e4eab426d.jpg); height:1020px}\n</style>"; //商品详情模板 String goodsDescTemplate = "<p><img src=%s data-width=750 data-height=%s /></p>"; //定义提取图片URL和height值的正则表达式,提取的字段用group的()语法 Pattern pattern = Pattern.compile("background-image:url\((https?://.*)\).*height:(\d+)"); //研究原串后,先以尺寸进行分组 String[] split = goodsDesc.split("px}"); for (String s : split) { if (s.contains("background-image:url")){ //过去掉不含背景图片的数据 Matcher matcher = pattern.matcher(s); //指定匹配器 while (matcher.find()){ //进行查找,并判断是否匹配 System.out.println("匹配到的字符串:"+ matcher.group()); System.out.println("提取的图片地址:"+ matcher.group(1)); System.out.println("提取的height值:"+ matcher.group(2)); stringBuilder.append(String.format(goodsDescTemplate, matcher.group(1), matcher.group(2))); } } } System.out.println("拼接的字符串:"+ stringBuilder); }
三、打印日志
匹配到的字符串:background-image:url(http://img30.360buyimg.com/sku/jfs/t1/31717/2/4671/349535/5c7f4f07E899abe1e/9dd81eaf2aac0863.jpg); height:1083 提取的图片地址:http://img30.360buyimg.com/sku/jfs/t1/31717/2/4671/349535/5c7f4f07E899abe1e/9dd81eaf2aac0863.jpg 提取的height值:1083 匹配到的字符串:background-image:url(http://img30.360buyimg.com/sku/jfs/t1/14459/14/9500/215997/5c7f4f06E886e02de/9de0bdce8ff65b3c.jpg); height:786 提取的图片地址:http://img30.360buyimg.com/sku/jfs/t1/14459/14/9500/215997/5c7f4f06E886e02de/9de0bdce8ff65b3c.jpg 提取的height值:786 匹配到的字符串:background-image:url(http://img30.360buyimg.com/sku/jfs/t1/25970/3/9647/494996/5c7f4f07E79829fc4/41a47699929ca408.jpg); height:1416 提取的图片地址:http://img30.360buyimg.com/sku/jfs/t1/25970/3/9647/494996/5c7f4f07E79829fc4/41a47699929ca408.jpg 提取的height值:1416 匹配到的字符串:background-image:url(http://img30.360buyimg.com/sku/jfs/t1/79600/29/3390/325766/5d196888Ee80899ac/6b260d5e4eab426d.jpg); height:1020 提取的图片地址:http://img30.360buyimg.com/sku/jfs/t1/79600/29/3390/325766/5d196888Ee80899ac/6b260d5e4eab426d.jpg 提取的height值:1020 拼接的字符串:<p><img src=http://img30.360buyimg.com/sku/jfs/t1/31717/2/4671/349535/5c7f4f07E899abe1e/9dd81eaf2aac0863.jpg data-width=750 data-height=1083 /></p><p><img src=http://img30.360buyimg.com/sku/jfs/t1/14459/14/9500/215997/5c7f4f06E886e02de/9de0bdce8ff65b3c.jpg data-width=750 data-height=786 /></p><p><img src=http://img30.360buyimg.com/sku/jfs/t1/25970/3/9647/494996/5c7f4f07E79829fc4/41a47699929ca408.jpg data-width=750 data-height=1416 /></p><p><img src=http://img30.360buyimg.com/sku/jfs/t1/79600/29/3390/325766/5d196888Ee80899ac/6b260d5e4eab426d.jpg data-width=750 data-height=1020 /></p>
以上是关于使用正则提取字符串中URL等信息的主要内容,如果未能解决你的问题,请参考以下文章