HTML HTML5 base clean - Ben Wellby

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了HTML HTML5 base clean - Ben Wellby相关的知识,希望对你有一定的参考价值。

<!DOCTYPE html>

<html lang="en">
<head>
    <!-- Meta -->
    <meta charset="utf-8">
    <meta name="description" content="">
    <meta name="keywords" content="">
    <meta name="author" content="Ben Wellby">
    <!-- End Meta -->
    
    <!-- Favicons -->
    <link rel="shortcut icon" href="favicon.ico">
    <link rel="apple-touch-icon" href="youricon.png">
    <!-- End Favicons -->
	
    <!-- Title -->
    <title>Webpage Title</title>
    <!-- End Title -->
    
    <!-- CSS -->
    <link rel="stylesheet" href="css/stylesheet.css" />
    <!-- End CSS -->
</head>

<body>
    <!-- Page Content -->
    <div id="container">
    
    </div>
    <!-- End Page Content -->
    
    <!-- Javascript -->
    <script src="js/script.js"></script>
    <!-- End Javascript -->
</body>
</html>

Python nltk.clean_html 未实现

【中文标题】Python nltk.clean_html 未实现【英文标题】:Python nltk.clean_html not implemented 【发布时间】:2014-11-18 01:43:15 【问题描述】:

我一直在尝试使用

myNews=urlopen(url).read()    
myNews=nltk.clean_html(myNews)

我收到以下错误:

文件“/usr/local/lib/python2.7/dist-packages/nltk-3.0.0-py2.7.egg/nltk/util.py”,第 346 行,在 clean_html raise NotImplementedError ("要删除 HTML 标记,请使用 BeautifulSoup 的 get_text() 函数") NotImplementedError:要删除 HTML 标记,请使用 BeautifulSoup 的 get_text() 函数

当我查看文件 util.py 时,我可以看到它没有实现:

def clean_html(html):
    raise NotImplementedError ("To remove HTML markup, use BeautifulSoup's get_text() function")

不应该实现吗?

【问题讨论】:

【参考方案1】:

如果你的代码是

raw = nltk.clean_html(html) 
tokens = nltk.word_tokenize(raw)

你可以使用

raw = BeautifulSoup(html).get_text()
tokens = nltk.word_tokenize(raw)

相反,请查看其他答案。

【讨论】:

【参考方案2】:

正如其他答案所述,ntlk dropped this feature 并建议“要删除 HTML 标记,请使用 BeautifulSoup 的 get_text() 函数。”如果您要从特定元素中提取文本,Beautiful Soup 可能是要走的路,但是如果您想要整个页面的文本恕我直言,请使用 nltk 函数。以下是两种方法的比较:

import mechanize
import nltk
from bs4 import BeautifulSoup
from html2text import html2text 
import re


def clean_html(html):
    """
    Copied from NLTK package.
    Remove HTML markup from the given string.

    :param html: the HTML string to be cleaned
    :type html: str
    :rtype: str
    """

    # First we remove inline JavaScript/CSS:
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
    # Then we remove html comments. This has to be done before removing regular
    # tags since comments can contain '>' characters.
    cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
    # Next we can remove the remaining tags:
    cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
    # Finally, we deal with whitespace
    cleaned = re.sub(r"&nbsp;", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    cleaned = re.sub(r"  ", " ", cleaned)
    return cleaned.strip()

url = "http://www.nytimes.com/2015/08/31/business/challenged-on-left-and-right-the-fed-faces-a-decision-on-rates.html"
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Firefox')]
html = br.open(url).read().decode('utf-8')
cleanhtml = clean_html(html)
text = html2text(cleanhtml)
soup = BeautifulSoup(html)
text2 = soup.get_text()

使用 nltk 函数,我得到了一个非常干净的结果(see here,帖子最多超过 30,000 个字符,所以我必须把它放在一个 pastebin 中才能发布)。还有美丽的汤:

u'\n  \n\n\n\n\nChallenged on Left and Right, the Fed Faces a Decision on Rates - The New York Times\nwindow.NREUM||(NREUM=),__nr_require=function(n,e,t)function r(t)if(!e[t])var o=e[t]=exports:;n[t][0].call(o.exports,function(e)var o=n[t][1][e];return r(o?o:e),o,o.exports)return e[t].exportsif("function"==typeof __nr_require)return __nr_require;for(var o=0;o<t.length;o++)r(t[o]);return r(QJf3ax:[function(n,e)function t(n)function e(e,t,a)n&&n(e,t,a),a||(a=);for(var u=c(e),f=u.length,s=i(a,o,r),p=0;f>p;p++)u[p].apply(s,t);return sfunction a(n,e)f[n]=c(n).concat(e)function c(n)return f[n]||[]function u()return t(e)var f=;returnon:a,emit:e,create:u,listeners:c,_events:ffunction r()returnvar o="nr@context",i=n("gos");e.exports=t(),gos:"7eSDFh"],ee:[function(n,e)e.exports=n("QJf3ax"),],3:[function(n,e)function t(n)return function()r(n,[(new Date).getTime()].concat(i(arguments)))var r=n("handle"),o=n(1),i=n(2);"undefined"==typeof window.newrelic&&(newrelic=window.NREUM);var a=["setPageViewName","addPageAction","setCustomAttribute","finished","addToTrace","inlineHit","noticeError"];o(a,function(n,e)window.NREUM[e]=t("api-"+e)),e.exports=window.NREUM,1:12,2:13,handle:"D5DuLP"],gos:[function(n,e)e.exports=n("7eSDFh"),],"7eSDFh":[function(n,e)function t(n,e,t)if(r.call(n,e))return n[e];var o=t();if(Object.defineProperty&&Object.keys)tryreturn Object.defineProperty(n,e,value:o,writable:!0,enumerable:!1),ocatch(i)return n[e]=o,ovar r=Object.prototype.hasOwnProperty;e.exports=t,],D5DuLP:[function(n,e)function t(n,e,t)return r.listeners(n).length?r.emit(n,e,t):(o[n]||(o[n]=[]),void o[n].push(e))var r=n("ee").create(),o=;e.exports=t,t.ee=r,r.q=o,ee:"QJf3ax"],handle:[function(n,e)e.exports=n("D5DuLP"),],XL7HBI:[function(n,e)function t(n)var e=typeof n;return!n||"object"!==e&&"function"!==e?-1:n===window?0:i(n,o,function()return r++)var r=1,o="nr@id",i=n("gos");e.exports=t,gos:"7eSDFh"],id:[function(n,e)e.exports=n("XL7HBI"),],loader:[function(n,e)e.exports=n("G9z0Bl"),],G9z0Bl:[function(n,e)function t()var n=h.info=NREUM.info;if(n&&n.licenseKey&&n.applicationID&&f&&f.body)c(l,function(e,t)e in n||(n[e]=t)),h.proto="https"===d.split(":")[0]||n.sslForHttp?"https://":"http://",a("mark",["onload",i()]);var e=f.createElement("script");e.src=h.proto+n.agent,f.body.appendChild(e)function r()"complete"===f.readyState&&o()function o()a("mark",["domContent",i()])function i()return(new Date).getTime()var a=n("handle"),c=n(1),u=(n(2),window),f=u.document,s="addEventListener",p="attachEvent",d=(""+location).split("?")[0],l=beacon:"bam.nr-data.net",errorBeacon:"bam.nr-data.net",agent:"js-agent.newrelic.com/nr-593.min.js",h=e.exports=offset:i(),origin:d,features:;f[s]?(f[s]("DOMContentLoaded",o,!1),u[s]("load",t,!1)):(f[p]("onreadystatechange",r),u[p]("onload",t)),a("mark",["firstbyte",i()]),1:12,2:3,handle:"D5DuLP"],12:[function(n,e)function t(n,e)var t=[],o="",i=0;for(o in n)r.call(n,o)&&(t[i]=e(o,n[o]),i+=1);return tvar r=Object.prototype.hasOwnProperty;e.exports=t,],13:[function(n,e)function t(n,e,t)e||(e=0),"undefined"==typeof t&&(t=n?n.length:0);for(var r=-1,o=t-e||0,i=Array(0>o?0:o);++r<o;)i[r]=n[e+r];return ie.exports=t,],,["G9z0Bl"]);\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"pageconfig":"ledeMediaSize":"large","keywords":["article-medium","has-embedded-interactive"]\n\n            []    \n\nvar googletag=googletag||;googletag.cmd=googletag.cmd||[],function()var t=document.createElement("script");t.async=!0,t.type="text/javascript";t.src="http://www.googletagservices.com/tag/js/gpt.js";var o=document.getElementsByTagName("script")[0];o.parentNode.insertBefore(t,o)();\n\n\n[\n    \n        "testId": "0012",\n        "testName": "tallWatchingModule",\n        "throttle": 1.0,\n        "allocation": 0.9,\n        "variants": 1,\n        "applications": ["homepage"]\n    ,\n    \n        "testId": "0033",\n        "testName": "recommendedLabelTest",\n        "throttle": 1,\n        "allocation": 0.833,\n        "variants": 5,\n        "applications": ["article"]\n    ,\n    \n        "testId": "0036",\n        "testName": "velcroSocialFollow",\n        "throttle": 0.1,\n        "allocation": 0.5,\n        "variants": 1,\n        "applications": ["article", "homepage"]\n    ,\n    \n        "testId": "0050",\n        "testName": "styledMostEmailed",\n        "throttle": 1,\n        "allocation": 0.667,\n        "variants": 2,\n        "applications": ["article"]\n    ,\n    \n        "testId": "0051",\n        "testName": "shuffleRecommendations",\n        "throttle": 1.0,\n        "allocation": 0.667,\n        "variants": 1,\n        "applications": ["article"]\n    ,\n    \n        "testId": "0052",\n        "testName": "paidPostDriver",\n        "throttle": 1.0,\n        "allocation": 0.875,\n        "variants": 7,\n        "applications": ["article"]\n    ,\n    \n        "testId": "0061",\n        "testName": "paidPostFivePack",\n        "throttle": 0,\n        "allocation": 0,\n        "variants": 1,\n        "applications": ["homepage"]\n    \n]\n\n\n\n "meta": ,\n  "data": \n    "id": "0",\n    "name": "",\n    "subscription": ["","_RPV"],\n    "demographics": \n  \n\n\n\nvar require = \n    baseUrl: \'http://a1.nyt.com/assets/\',\n    waitSeconds: 20,\n    paths: \n        \'foundation\': \'article/20150828-192044/js/foundation\',\n        \'shared\': \'article/20150828-192044/js/shared\',\n        \'article\': \'article/20150828-192044/js/article\',\n        \'application\': \'article/20150828-192044/js/article/article\',\n        \'videoFactory\': \'http://static01.nyt.com/js2/build/video/2.0/videofactoryrequire\',\n        \'videoPlaylist\': \'http://static01.nyt.com/js2/build/video/players/extended/2.0/appRequire\',\n        \'auth/mtr\': \'http://static01.nyt.com/js/mtr\',\n        \'auth/growl\': \'http://static01.nyt.com/js/auth/growl/default\',\n        \'vhs\': \'http://static01.nyt.com/video/vhs/build/vhs-2.x.min\'\n    ,\n    map: \n        \'*\': \n            \'article/main\': \'article/article/main\'\n        \n    \n;\n\n\n\n\n\n\nwindow.magnum.processFlags(["limitFabrikSave","moreFollowSuggestions","dfpAds","dfpWhitelist","criticsPickAdditionalInfo","restaurantAttributes","theaterAttributes","movieAttributes","followFeature","restaurantReviewAdditionalDetails","theaterReviewAdditionalDetails","restaurantReviewHideInfoBox","theaterReviewHideInfoBox","restaurantReviewShowRestaurantName","restaurantReviewShowGoogleMap","restaurantReviewShowNotes","restaurantReviewShowLastUpdated","styledMostEmailed","videoVHSCover","restaurantReviewShowMenuLink","allTheEmphases","androidDeepLinks","autoPlayVideos","restaurantOpenStatus","standaloneSlideshowPromo","showNewTMagLogo"]);\n\n\nrequire([\'foundation/main\'], function () \n    require([\'auth/mtr\', \'auth/growl\']);\n);\n\n\n\n\n    .lt-ie10 .messenger.suggestions \n        display: block !important;\n        height: 50px;\n    \n\n    .lt-ie10 .messenger.suggestions .message-bed \n        background-color: #f8e9d2;\n        border-bottom: 1px solid #ccc;\n    \n\n    .lt-ie10 .messenger.suggestions .message-container \n        padding: 11px 18px 11px 30px;\n    \n\n    .lt-ie10 .messenger.suggestions .action-link \n        font-family: "nyt-franklin", arial, helvetica, sans-serif;\n        font-size: 10px;\n        font-weight: bold;\n        color: #a81817;\n        text-transform: uppercase;\n    \n\n    .lt-ie10 .messenger.suggestions .alert-icon \n        background: url(\'http://i1.nyt.com/images/icons/icon-alert-12x12-a81817.png\') no-repeat;\n        width: 12px;\n        height: 12px;\n        display: inline-block;\n        margin-top: -2px;\n        float: none;\n    \n\n    .lt-ie10 .masthead,\n    .lt-ie10 .navigation,\n    .lt-ie10 .comments-panel \n        margin-top: 50px !important;\n    \n\n    .lt-ie10 .ribbon \n        margin-top: 97px !important;\n    \n\n\n\n\n\n\nNYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade your browser.\nLEARN MORE \xbb\n\n\n\n\n\n\n\n\n\nSections\n\nHome\n\nSearch\nSkip to content\nSkip to navigation\nView mobile version\n\n\n\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'business\', \'masthead-theme-standard\', \'standard\', \'branding-heading-link\');\n\n\nEconomy|Challenged on Left and Right, the Fed Faces a Decision on Rates\n\n\n\nAdvertisement\n\n\n\n\n\n\n\nSearch\n\n\nLog In\n0\nSettings\n\n\n\n\nClose search\n\nsearch sponsored by\n\n\n\n\n\n\nSearch NYTimes.com\n\n\n\nClear this text input\n\n\n\nGo\n\n\n\n\n\n\nhttp://nyti.ms/1VpLa1D\n\n\n\n\nLoading...\n\n\n\n\nSee next articles\n\n\n\n\n\nSee previous articles\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\nAdvertisement\n\n\n\n\n\n\nEconomy \nChallenged on Left and Right, the Fed Faces a Decision on Rates\n\nBy BINYAMIN APPELBAUMAUG. 30, 2015\n\n\nInside\n\n\n\nSupported by\n\n\n\n\n\n\n\n\nPhoto\n\n\n\n\n\n\nJanet L. Yellen, the Federal Reserve chairwoman.\n\nCredit\n            Stephen Crowley/The New York Times        \n\n\n\n\nAdvertisement\n\nContinue reading the main story\n\n\n\n\n\nContinue reading the main story\nShare This Page\n\nContinue reading the main story\n\n\nContinue reading the main story\n\n\n\nJACKSON HOLE, Wyo. \u2014  Conservative activists who want the Federal Reserve to raise interest rates distributed chocolate coins in golden wrappers at the local airport last week as Fed officials arrived for their annual policy retreat.Liberal activists in green \u201cWhose Recovery?\u201d T-shirts formed a receiving line at the resort hotel in the heart of Grand Teton National Park where the meeting was held, to personalize their argument that the Fed should wait.Sometime soon \u2014 possibly as early as mid-September and probably no later than the end of the year \u2014 the Fed plans to raise its benchmark interest rate one-quarter of one percentage point, a mathematically minor move that has become a very big deal.Investors, who always pay attention to the Fed, are paying particular attention now. The central bank has held short-term rates near zero since December 2008; the impending end of that era is one cause of recent financial market turmoil. \n\nContinue reading the main story\n\n\n                            Related Coverage\n                    \n\n\n\n\n\n\n\n\n\n\nOptimistic About Inflation, Stanley Fischer Suggests That Fed Will Stick to Plan on RatesAUG. 29, 2015\n\n\n\n\n\n\n\nBut the Fed\u2019s plans have also become the latest point of contention in a broader debate about the government\u2019s management of the American economy, pitting liberals who see a need for more aggressive measures to bolster growth against conservatives concerned that Washington and the Fed are already doing much too much. \nContinue reading the main story\n\n\n\n                When Will the Fed Raise Rates?            \n\n                More than seven years ago the Federal Reserve put its benchmark interest rate close to zero, as a way to bolster the economy. But that policy is about to change.            \n\n\n\n\n\n\n\n\n\n\n\n\u201cThere shouldn\u2019t be this intense interest in a quarter-point increase, and there shouldn\u2019t be this intense interest in whether it comes in September or December,\u201d said Alan S. Blinder, a Princeton economist and the Fed\u2019s vice chairman in the mid-1990s. \u201cBut the Fed remains the center of the financial universe. People stare at it like they stare at the North Star.\u201dAnd so, as Fed officials conferred with other central bankers and academics, the liberal activists held two days of \u201cFed Up\u201d teach-ins in a room directly below the main conference, while the conservatives convened a \u201cJackson Hole Summit\u201d at a nearby dude ranch.In the decades before the financial crisis, policy makers generally agreed that central banks should focus on moderating inflation. Now, both that goal and the best way to achieve it are subjects of debate. Liberals argue that the Fed should aim more broadly to lower unemployment and encourage rising living standards. Conservatives want to strengthen the focus on inflation by requiring officials to follow rules in making policy.\nAdvertisement\n\nContinue reading the main story\nWith the critics lining up outside, central bankers found no escape inside the main conference, where a series of academics warned policy makers that their view of inflation was oversimplified, and that their policies were less effective as a consequence.\u201cThe conference was more about what we don\u2019t know, about a candid willingness to analyze what we don\u2019t know,\u201d said Lucrezia Reichlin, a professor at London Business School and former director general of research at the European Central Bank. \u201cIt did not really inspire confidence\u201d in monetary policy.The formal program, on \u201cInflation Dynamics and Monetary Policy,\u201d was devoted to the vexing reality that inflation in recent years has not behaved as economists predicted. The basic paradigm, known as the Phillips Curve, is that inflation falls as unemployment rises, and rises as unemployment falls. But inflation did not fall as much as expected during the Great Recession, and it has remained surprisingly weak during the recovery.\nAdvertisement\n\nContinue reading the main story\nOver the course of two days, the invited academics argued that the real story was more complicated. One study, for example, presented evidence that prices fall more slowly during recessions because cash-short firms actually tend to increase prices in the face of declining demand for their products.\u201cOnce you integrate all these dynamics, it may turn out that life is not that simple,\u201d said Eric M. Leeper, an economist at Indiana University and co-author of a paper arguing that central banks need better economic models.Central bankers, however, have shown little interest in paradigm shifts. Several said that the basic understanding of inflation, while obviously imperfect, remains more functional than any alternatives.\u201cI don\u2019t think the folks at the Fed are of a mind to redesign monetary policy just because of what happened during the crisis,\u201d said Jon Faust, a professor of economics at Johns Hopkins University and a former adviser to the Fed\u2019s chairwoman, Janet L. Yellen, and her predecessor, Ben S. Bernanke.Indeed, V\xedtor Const\xe2ncio, vice president of the European Central Bank, said the euro area was currently experiencing \u201ca renaissance of the Phillips Curve.\u201dStanley Fischer, vice chairman of the Federal Reserve, painted a somewhat more complicated picture of inflation, arguing that the role of labor market slack is easily overstated, and that exchange rates play an important role.\nContinue reading the main story\nVideo\n\nThe Fed\u2019s Button on the Economy\n\nWhen it comes to raising or lowering interest rates, what the Fed is really trying to do is balance growth and inflation. But they have a limited set of tools to accomplish their goal.\n\n                    By Andrew Ross Sorkin, Aaron Byrd and Erica Berenstein on                                                                Publish Date July 29, 2015.\n                                    \n\n                                            Photo by Aaron Byrd/The New York Times.\n                                    \nWatch in Times Video \xbb\n\n\n\nBut his bottom line, too, was that the Fed understands inflation well enough to predict its movements. While domestic inflation has been surprisingly sluggish for years now, Mr. Fischer said on Friday that his confidence in an eventual rebound remained \u201cpretty high.\u201dThe organizers of the fringe conferences acknowledged the odds against their more radical proposals.\u201cFed Up\u201d is mostly funded by the foundation of a Facebook co-founder, Dustin Moskovitz, which said: \u201cOur best guess is that the campaign is unlikely to have an impact on the Fed\u2019s monetary policy, but that if it does, the benefits would be very large.\u201dJim DeMint, president of the Heritage Foundation, spoke at the conservative conference of \u201ca long and difficult battle that we can and must win.\u201dThe Center for Public Democracy, which organized the \u201cFed Up\u201d campaign, wants the Fed to keep rates near zero even as overall unemployment falls, to spur wage gains and help members of minorities, in particular, find jobs. It brought about 50 people to Jackson Hole as part of an effort to engage community groups that generally focus on civil rights or local issues like minimum wage laws.Dawn O\u2019Neal, 48, makes $8.50 an hour as a day care worker in suburban Atlanta; her husband has not found regular construction work in a year. When Ms. O\u2019Neal needs a refill on her asthma medication, she cuts back on food, buying hot dogs instead of beef and canned vegetables instead of fresh vegetables.\u201cI don\u2019t feel like anyone at the Fed has ever had to make a decision about whether to eat or get medication, and so when I hear that they\u2019re going to raise interest rates in September, it angers me and it scares me,\u201d Ms. O\u2019Neal said.\nAdvertisement\n\nContinue reading the main story\n\n\nAdvertisement\n\nContinue reading the main story\nThe protesters struck a chord with some officials at the main meeting. Jason Furman, President Obama\u2019s chief economic adviser, went downstairs and delivered an impromptu speech. \u201cWe don\u2019t comment on monetary policy, but what I can say is that monetary policy matters,\u201d he told the activists. The prosperity of the late 1990s, he added, resulted in part from \u201ca set of decisions made by the Federal Reserve that allowed that to happen.\u201dOther officials, however, said the push for low rates was misguided.\u201cThe biggest risk for those that are less fortunate is that we would go back into recession,\u201d said James Bullard, president of the Federal Reserve Bank of St. Louis, who said he leaned toward raising rates in September. \u201cI\u2019m hoping my policy would lengthen out the expansion longer.\u201dThe conservative conference was aligned with efforts by congressional Republicans to impose new restrictions on the Fed\u2019s conduct of monetary policy. A leading proposal would require the Fed to choose a formula for setting rates and stick with it.This view has few fans among the central bankers, who see their own judgment as an essential part of policy making.Mr. Blinder said part of the disconnect between the officials and the activists may reflect that broader concerns motivate liberals and conservatives. Conservatives see the Fed as enabling the growth of the federal debt, while liberals see the Fed as contributing to the rise of inequality.Mr. Blinder said the central bank had little power to reverse either trend. \u201cThey overstate the importance and power of the Federal Reserve,\u201d he said. All it can do, he added, is \u201caddress these problems around the edges.\u201d\n\n\nA version of this article appears in print on August 31, 2015, on page A1 of the New York edition with the headline: Left and Right Work to Shift Fed\u2019s Direction.  Order Reprints| Today\'s Paper|Subscribe\n\n\n\n\n\n\n\n\n\n\n\nLoading...\n\n\n\n\n\n\n\n\n\nGo to Home Page \xbb\n\nSite Index\n\nThe New York Times\n\n\nwindow.magnum.writeLogo(\'small\', \'http://a1.nyt.com/assets/article/20150828-192044/images/foundation/logos/\', \'\', \'\', \'standard\', \'site-index-branding-link\');\n\n\n\n\nNews\n\n\nWorld\n\n\nU.S.\n\n\nPolitics\n\n\nN.Y.\n\n\nBusiness\n\n\nTech\n\n\nScience\n\n\nHealth\n\n\nSports\n\n\nEducation\n\n\nObituaries\n\n\nToday\'s Paper\n\n\nCorrections\n\n\n\n\nOpinion\n\n\nToday\'s Opinion\n\n\nOp-Ed Columnists\n\n\nEditorials\n\n\nContributing Writers\n\n\nOp-Ed Contributors\n\n\nOpinionator\n\n\nLetters\n\n\nSunday Review\n\n\nTaking Note\n\n\nRoom for Debate\n\n\nPublic Editor\n\n\nVideo: Opinion\n\n\n\n\nArts\n\n\nToday\'s Arts\n\n\nArt & Design\n\n\nArtsBeat\n\n\nBooks\n\n\nDance\n\n\nMovies\n\n\nMusic\n\n\nN.Y.C. Events Guide\n\n\nTelevision\n\n\nTheater\n\n\nVideo Games\n\n\nVideo: Arts\n\n\n\n\nLiving\n\n\nAutomobiles\n\n\nCrossword\n\n\nFood\n\n\nEducation\n\n\nFashion & Style\n\n\nHealth\n\n\nJobs\n\n\nMagazine\n\n\nN.Y.C. Events Guide\n\n\nReal Estate\n\n\nT Magazine\n\n\nTravel\n\n\nWeddings & Celebrations\n\n\n\n\nListings & More\n\n\nClassifieds\n\n\nTools & Services\n\n\nTimes Topics\n\n\nPublic Editor\n\n\nN.Y.C. Events Guide\n\n\nTV Listings\n\n\nBlogs\n\n\nCartoons\n\n\nMultimedia\n\n\nPhotography\n\n\nVideo\n\n\nNYT Store\n\n\nTimes Journeys\n\n\nSubscribe\n\n\nManage My Account\n\n\n\n\nSubscribe\n\nSubscribe\n\n\nTimes Premier\n\n\n\nHome Delivery\n\n\n\nDigital Subscriptions\n\n\n\nNYT Opinion\n\n\n\nCrossword\n\n\n\n\nEmail Newsletters\n\n\nAlerts\n\n\nGift Subscriptions\n\n\nCorporate Subscriptions\n\n\nEducation Rate\n\n\n\n\nMobile Applications\n\n\nReplica Edition\n\n\nInternational New York Times\n\n\n\n\n\n\n\n\n\n\n\n                    \xa9 2015 The New York Times Company\n\n\nHome\nSearch\nContact Us\nWork With Us\nAdvertise\nYour Ad Choices\nPrivacy\nTerms of Service\nTerms of Sale\n\n\n\n\nSite Map\nHelp\nSite Feedback\nSubscriptions\n\n\n\n\n\n\nrequire([\'foundation/main\'], function () \n    require([\'article/main\']);\n    require([\'jquery/nyt\', \'foundation/views/page-manager\'], function ($, pageManager) \n        if (window.location.search.indexOf(\'disable_tagx\') > 0) \n            return;\n        \n        $(document).ready(function () \n            require([\'http://static01.nyt.com/bi/js/tagx/tagx.js\'], function () \n                pageManager.trackingFireEventQueue();\n            );\n        );\n    );\n);\n\n\n\n\n\n\n\n\n\n\nwindow.NREUM||(NREUM=);NREUM.info="beacon":"bam.nr-data.net","licenseKey":"b5bcf2eba4","applicationID":"4491457","transactionName":"YwFXZhRYVhAEVUZcX1pLYEAPFlkTFRhCXUA=","queueTime":0,"applicationTime":305,"ttGuid":"","agentToken":"","userAttributes":"","errorBeacon":"bam.nr-data.net","agent":"js-agent.newrelic.com\\/nr-593.min.js"\n\n'

如果您滚动浏览它,您会看到,Beautiful Soup 版本包含许多不可见的文本。不是很漂亮。

【讨论】:

【参考方案3】:

clean_html()clean_url() 是 NLTK 中的一个可爱的函数,由于 BeautifulSoup 在解析标记语言方面做得更好,所以它被删除了,请参阅 https://github.com/nltk/nltk/commit/39a303e5ddc4cdb1a0b00a3be426239b1c24c8bb

这是 BeautifulSoup 的文档:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

【讨论】:

以上是关于HTML HTML5 base clean - Ben Wellby的主要内容,如果未能解决你的问题,请参考以下文章

移动设备上的 HTML5 base64 编码音频

html5 图片转为base64格式异步上传

Base64 PNG 数据到 HTML5 画布

从 HTML5 Canvas (readAsBinaryString) 获取二进制 (base64) 数据

jCrop HTML5 Canvas Base64

HTML5 视频作为 base64 编码数据 URI 在 iPad 和 iPhone 浏览器中不起作用