Können Sie Beispiele für das Parsen von HTML angeben?

Question 1

Wie können Sie HTML mit einer Vielzahl von Sprachen und Analysebibliotheken analysieren?

Bei der Beantwortung:

Einzelne Kommentare werden in Antworten auf Fragen zum Parsen von HTML mit regulären Ausdrücken verknüpft, um die richtige Vorgehensweise aufzuzeigen.

Aus Gründen der Konsistenz fordere ich das Beispiel auf, eine HTML-Datei für die hrefIn-Anker-Tags zu analysieren . Um die Suche in dieser Frage zu vereinfachen, bitte ich Sie, diesem Format zu folgen

Sprache: [Name der Sprache]

Bibliothek: [Bibliotheksname]

[example code]

Bitte machen Sie die Bibliothek zu einem Link zur Dokumentation der Bibliothek. Wenn Sie ein anderes Beispiel als das Extrahieren von Links bereitstellen möchten, geben Sie bitte auch Folgendes an:

Zweck: [was die Analyse tut]

Question 2

Sprache: JavaScript
Bibliothek: jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(Verwenden von firebug console.debug für die Ausgabe ...)

Und jede HTML-Seite laden:

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Ich habe jede Funktion für diese verwendet, ich denke, es ist sauberer, wenn Methoden verkettet werden.

Question 3

Sprache: C #
Bibliothek: HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}

Question 4

Sprache: Python-
Bibliothek: BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links

Ausgabe:

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

auch möglich:

for link in links:
    print link['href']

Ausgabe:

http://foo.com
http://bar.com
http://baz.com

Question 5

Sprache: Perl
Bibliothek: pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

Question 6

Sprache: Shell
Bibliothek: Luchs (Nun, es ist keine Bibliothek, aber in der Shell ist jedes Programm eine Art Bibliothek)

lynx -dump -listonly http://news.google.com/

Question 7

Sprache: Ruby
Bibliothek: Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

Question 8

Sprache: Python-
Bibliothek: HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

Question 9

Sprache: Perl
Bibliothek: HTML :: Parser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
        sub {
            my ($tag, $attr) = @_;
            if ($tag eq 'a' and exists $attr->{href}) {
                print "$attr->{href}\n";
            }
        }, 
        "tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);

Question 10

Sprach-Perl-
Bibliothek: HTML :: LinkExtor

Das Schöne an Perl ist, dass Sie Module für ganz bestimmte Aufgaben haben. Wie Link-Extraktion.

Gesamtes Programm:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Erläuterung:

Verwenden von strict - Aktiviert den "strict" -Modus - erleichtert das potenzielle Debuggen, das für das Beispiel nicht vollständig relevant ist
benutze HTML :: LinkExtor - Laden eines interessanten Moduls
Verwenden Sie LWP :: Simple - nur eine einfache Möglichkeit, HTML für Tests abzurufen
meine $ url = ' http://www.google.com/ ' - von welcher Seite wir URLs extrahieren werden
Mein $ content = get ($ url) - ruft die HTML-Seite ab
my $ p = HTML :: LinkExtor-> new (\ & process_link, $ url) - erstellt ein LinkExtor-Objekt, gibt ihm einen Verweis auf die Funktion, die als Rückruf für jede URL verwendet wird, und $ url als BASEURL für relative URLs
$ p-> parse ($ content) - ziemlich offensichtlich, denke ich
exit - Programmende
sub process_link - Beginn der Funktion process_link
my ($ tag,% attr) - Ruft Argumente ab, bei denen es sich um Tag-Namen und deren Attribute handelt
return, es sei denn, $ tag eq 'a' - überspringe die Verarbeitung, wenn das Tag nicht <a> ist
return, sofern nicht definiert $ attr {'href'} - Verarbeitung überspringen, wenn das <a> -Tag kein href-Attribut hat
print "- $ attr {'href'} \ n"; - ziemlich offensichtlich, denke ich :)
Rückkehr; - Beenden Sie die Funktion

Das ist alles.

Question 11

Sprache: Ruby
Bibliothek: Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

Question 12

Sprache: Common Lisp
Bibliothek: Closure Html , Closure Xml , CL-WHO

(Wird mit der DOM-API ohne Verwendung der XPATH- oder STP-API angezeigt.)

(defvar *html*
  (who:with-html-output-to-string (stream)
    (:html
     (:body (loop
               for site in (list "foo" "bar" "baz")
               do (who:htm (:a :href (format nil "http://~A.com/" site))))))))

(defvar *dom*
  (chtml:parse *html* (cxml-dom:make-dom-builder)))

(loop
   for tag across (dom:get-elements-by-tag-name *dom* "a")
   collect (dom:get-attribute tag "href"))
=> 
("http://foo.com/" "http://bar.com/" "http://baz.com/")

Question 13

Sprache: Clojure
Bibliothek: Enlive (ein selektorbasiertes (à la CSS) Vorlagen- und Transformationssystem für Clojure)

Selektorausdruck:

(def test-select
     (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

Jetzt können wir bei der REPL Folgendes tun (ich habe Zeilenumbrüche hinzugefügt test-select):

user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
 {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
 {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")

Zum Ausprobieren benötigen Sie Folgendes:

Präambel:

(require '[net.cgrand.enlive-html :as html])

HTML testen:

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
                        ["</body></html>"])))

Question 14

Sprache: Perl
Bibliothek: XML :: Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

Vorsichtsmaßnahme: Bei Seiten wie dieser können Fehler mit großen Zeichen auftreten (wenn Sie die URL in die auskommentierte ändern, wird dieser Fehler angezeigt), aber die obige HTML :: Parser-Lösung teilt dieses Problem nicht.

Question 15

Sprache: Perl
Bibliothek: HTML :: Parser
Zweck: Wie kann ich nicht verwendete, verschachtelte HTML-Span-Tags mit einem Perl-Regex entfernen?

Question 16

Sprache: Java
Bibliotheken: XOM , TagSoup

Ich habe absichtlich fehlerhaftes und inkonsistentes XML in dieses Beispiel aufgenommen.

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

TagSoup fügt dem Dokument standardmäßig einen XML-Namespace hinzu, der auf XHTML verweist. Ich habe mich entschieden, dies in diesem Beispiel zu unterdrücken. Bei Verwendung des Standardverhaltens muss der Aufruf root.queryeinen Namespace wie folgt enthalten:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

Question 17

Sprache: C #
Bibliothek: System.XML (Standard .NET)

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}

Question 18

Sprache: PHP
Bibliothek: SimpleXML (und DOM)

<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);

$links = $xml->xpath('//a[@href]');
foreach($links as $link)
    echo $link['href']."\n";

Question 19

Sprache: JavaScript
Bibliothek: DOM

var links = document.links;
for(var i in links){
    var href = links[i].href;
    if(href != null) console.debug(href);
}

(Verwenden von firebug console.debug für die Ausgabe ...)

Question 20

Sprache: Schläger

Bibliothek: (Planet Ashinn / HTML-Parser: 1) und (Planet Clements / Sxml2: 1)

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))

Das obige Beispiel verwendet Pakete aus dem neuen Paketsystem: html-parsing und sxml

(require net/url
         html-parsing
         sxml)

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))

Hinweis: Installieren Sie die erforderlichen Pakete mit 'raco' über eine Befehlszeile mit:

raco pkg install html-parsing

und:

raco pkg install sxml

Question 21

Sprache: Python-
Bibliothek: lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml verfügt auch über eine CSS-Auswahlklasse zum Durchlaufen des DOM, wodurch die Verwendung der Verwendung von JQuery sehr ähnlich sein kann:

for a in tree.cssselect('a[href]'):
    print a.get('href')

Question 22

Sprache: Objective-C
Bibliothek: libxml2 + libxml2-Wrapper von Matt Gallagher + ASIHTTPRequest von Ben Copsey

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
    NSData *response = [request responseData];
    NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
    [request release];
}
else 
    @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];

...

- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
    NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
    if (nodes != nil)
        return nodes;
    return nil;
}

Question 23

Sprache: Perl
Bibliothek: HTML :: TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}

Question 24

Sprache: Python
Bibliothek: HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print url, text;

Einfach und intuitiv.

Question 25

Sprache: Ruby
Bibliothek: Nokogiri

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }

puts hrefs

Welche Ausgänge:

/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General%20website%20feedback

Dies ist eine kleine Abweichung von der obigen, die zu einer Ausgabe führt, die für einen Bericht verwendet werden kann. Ich gebe nur das erste und das letzte Element in der Liste der hrefs zurück:

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }

puts hrefs
  .each_with_index                     # add an array index
  .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
  .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output

  1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com

Question 26

Sprache: Java
Bibliothek: jsoup

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
        final Elements links = document.select("a[href]");
        for (final Element element : links) {
            System.out.println(element.attr("href"));
        }
    }
}

Question 27

Sprache: PHP Bibliothek: DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
    echo $links->item($i)->getAttribute('href'), "\n";

Manchmal ist es nützlich, ein @Symbol vor $doc->loadHTMLFilezu setzen , um ungültige HTML-Parsing-Warnungen zu unterdrücken

Question 28

Speichern Sie diese Datei mit phantomjs als extract-links.js:

var page = new WebPage(),
    url = 'http://www.udacity.com';

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var results = page.evaluate(function() {
            var list = document.querySelectorAll('a'), links = [], i;
            for (i = 0; i < list.length; i++) {
                links.push(list[i].href);
            }
            return links;
        });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

Lauf:

$ ../path/to/bin/phantomjs extract-links.js

Question 29

Sprache: Coldfusion 9.0.1+

Bibliothek: jSoup

<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
    var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
    if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
}
return res; 
}   

//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

Gibt ein Array von Strukturen zurück. Jede Struktur enthält ein HREF- und ein TEXT-Objekt.

Question 30

Sprache: JavaScript / Node.js.

Bibliothek: Anfrage und Cheerio

var request = require('request');
var cheerio = require('cheerio');

var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html);
        var anchorTags = $('a');

        anchorTags.each(function(i,element){
            console.log(element["attribs"]["href"]);
        });
    }
});

Die Anforderungsbibliothek lädt das HTML-Dokument herunter, und mit Cheerio können Sie JQuery-CSS-Selektoren verwenden, um das HTML-Dokument als Ziel festzulegen.