| tagsoup-0.7: Parsing and extracting information from (possibly malformed) HTML documents | Contents | Index |
|
Text.HTML.TagSoup | Portability | portable | Stability | unstable | Maintainer | http://www.cs.york.ac.uk/~ndm/ |
|
|
|
|
|
Description |
This module is for extracting information out of unstructured HTML code,
sometimes known as tag-soup. This is for situations where the author of
the HTML is not cooperating with the person trying to extract the information,
but is also not trying to hide the information.
The standard practice is to parse a String to Tags using parseTags,
then operate upon it to extract the necessary information.
|
|
Synopsis |
|
|
|
|
Data structures and parsing
|
|
data Tag |
An HTML element, a document is [Tag].
There is no requirement for TagOpen and TagClose to match
| Constructors | TagOpen String [Attribute] | An open tag with Attributes in their original order.
| TagClose String | A closing tag
| TagText String | A text node, guaranteed not to be the empty string
| TagComment String | A comment
| TagWarning String | Meta: Mark a syntax error in the input file
| TagPosition !Row !Column | Meta: The position of a parsed element
|
| Instances | |
|
|
type Attribute = (String, String) |
An HTML attribute id="name" generates ("id","name")
|
|
module Text.HTML.TagSoup.Parser |
|
canonicalizeTags :: [Tag] -> [Tag] |
Turns all tag names and attributes to lower case and
converts DOCTYPE to upper case.
|
|
Tag identification
|
|
isTagOpen :: Tag -> Bool |
Test if a Tag is a TagOpen
|
|
isTagClose :: Tag -> Bool |
Test if a Tag is a TagClose
|
|
isTagText :: Tag -> Bool |
Test if a Tag is a TagText
|
|
isTagWarning :: Tag -> Bool |
Test if a Tag is a TagWarning
|
|
isTagOpenName :: String -> Tag -> Bool |
Returns True if the Tag is TagOpen and matches the given name
|
|
isTagCloseName :: String -> Tag -> Bool |
Returns True if the Tag is TagClose and matches the given name
|
|
Extraction
|
|
fromTagText :: Tag -> String |
Extract the string from within TagText, crashes if not a TagText
|
|
fromAttrib :: String -> Tag -> String |
Extract an attribute, crashes if not a TagOpen.
Returns "" if no attribute present.
|
|
maybeTagText :: Tag -> Maybe String |
Extract the string from within TagText, otherwise Nothing
|
|
maybeTagWarning :: Tag -> Maybe String |
Extract the string from within TagWarning, otherwise Nothing
|
|
innerText :: [Tag] -> String |
Extract all text content from tags (similar to Verbatim found in HaXml)
|
|
Utility
|
|
sections :: (a -> Bool) -> [a] -> [[a]] |
This function takes a list, and returns all suffixes whose
first item matches the predicate.
|
|
partitions :: (a -> Bool) -> [a] -> [[a]] |
This function is similar to sections, but splits the list
so no element appears in any two partitions.
|
|
Combinators
|
|
class TagRep a |
Define a class to allow String's or Tag's to be used as matches
| | Instances | |
|
|
class IsChar a |
| Instances | |
|
|
(~==) :: TagRep t => Tag -> t -> Bool |
Performs an inexact match, the first item should be the thing to match.
If the second item is a blank string, that is considered to match anything.
For example:
(TagText "test" ~== TagText "" ) == True
(TagText "test" ~== TagText "test") == True
(TagText "test" ~== TagText "soup") == False
For TagOpen missing attributes on the right are allowed.
|
|
(~/=) :: TagRep t => Tag -> t -> Bool |
Negation of ~==
|
|
Produced by Haddock version 0.8 |