This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. Unlike the parser in htmllib, this parser is not based on the SGML parser in sgmllib.
An HTMLParser instance is fed HTML data and calls handler functions when tags begin and end. The HTMLParser class is meant to be overridden by the user to provide a desired behavior.
Unlike the parser in htmllib, this parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.
HTMLParser instances have the following methods:
The tag argument is the name of the tag converted to
lower case. The attrs argument is a list of (name,
value)
pairs containing the attributes found inside the tag's
<>
brackets. The name will be translated to lower case
and double quotes and backslashes in the value have been
interpreted. For instance, for the tag <A
HREF="http://www.cwi.nl/">
, this method would be called as
"handle_starttag('a', [('href', 'http://www.cwi.nl/')])".
<a .../>
). This method
may be overridden by subclasses which require this particular lexical
information; the default implementation simple calls
handle_starttag() and handle_endtag().
'text'
. It is
intended to be overridden by a derived class; the base class
implementation does nothing.
<!
...>
markup.It is intended to be overridden
by a derived class; the base class implementation does nothing.