by Neil Mitchell
TagSoup is a library for extracting information out of unstructured HTML code, sometimes known as tag-soup. The HTML does not have to be well formed, or render properly within any particular framework. This library is for situations where the author of the HTML is not cooperating with the person trying to extract the information, but is also not trying to hide the information.
This document gives two particular examples, and two more may be found in the Example file from the darcs repository. The examples we give are:
This library was written without knowledge of the Java version of TagSoup. They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.
Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann.
There are two things that may go wrong with these examples:
Our goal is to develop a program that displays the Haskell.org hit count. This example covers all the basics in designing a basic web-scraping application.
We first need to find where the information is displayed, and in what format. Taking a look at the front web page, when not logged in, you may notice that there is no hit count. However, looking at the source shows us:
<div class="printfooter"> <p>Retrieved from "<a href="http://www.haskell.org/haskellwiki/Haskell"> http://www.haskell.org/haskellwiki/Haskell</a>"</p> <p>This page has been accessed 615,165 times. This page was last modified 15:44, 15 March 2007. Recent content is available under <a href="/haskellwiki/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">a simple permissive license</a>.</p>
So we see that the hit count is available, but not shown. This leads us to rule 1:
Scrape from what the page returns, not what a browser renders, or what view-source gives.
Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are three ways to get the page as it will appear to your program.
Tagsoup provides a module Text.Html.Download, which contains openURL.
import Text.HTML.Download main = do src <- openURL "http://haskell.org/haskellwiki/Haskell" writeFile "temp.htm" src
Now open temp.htm, find the fragment of HTML containing the hit count, and examine it.
Tagsoup installs both as a library and a program. The program contains all the examples mentioned on this page, along with a few other useful functions. In order to download a URL to a file:
$ tagsoup grab http://haskell.org/haskellwiki/Haskell > temp.htm
An alternative fragment of text for reading in the original web page, using the network library, can be written as:
import qualified Data.ByteString.Lazy.Char8 as BS import Network.HTTP (rspBody) import Network.HTTP.UserAgent as UA main :: IO () main = do rsp <- UA.get "http://haskell.org/haskellwiki/Haskell" let src = BS.unpack $ rspBody rsp writeFile "temp.htm" src
Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment has that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further:
Do not be robust to design changes, do not even consider the possibility when writing the code.
If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.
So now, lets consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice "id" property, or a "class" - something which is unlikely to occur multiple times. In the above example, "printfooter" as the class seems perfect. We decide that to find the snippet, we will start at a "div" tag, with a "class" attribute with the value "printfooter".
haskellHitCount = do tags <- liftM parseTags $ openURL "http://haskell.org/haskellwiki/Haskell" let count = fromFooter $ head $ sections (~== "<div class=printfooter>") tags putStrLn $ "haskell.org has been hit " ++ show count ++ " times"
Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of Tags. We then apply the sections function, which returns all the lists whose first element matches the query. We use the (~==) operator to construct the query - in this case asking for the "div" we mentioned earlier. This (~==) operator is very different from standard equality, it allows additional attributes to be present but does not match them. We write "<div class=printfooter>" as syntactic sugar for TagOpen "div" [("class","printfooter")]. If we just wanted any open tag with the given class we could have written (~== TagOpen "" [("class","printfooter")]) and this would have matched. Any empty strings in the second element of the match are considered as wildcards.
Once we have a list of all matching prefixes, we take the head - assuming that only one will match. Then we apply fromFooter which needs to perform the traversal from the "printfooter" attribute onwards to the actual hit count data.
Now we have a stream starting at the right place, we generally mangle the code using standard list operators:
fromFooter x = read (filter isDigit num) :: Int where num = ss !! (i - 1) Just i = findIndex (== "times.") ss ss = words s TagText s = sections (~== "<p>") x !! 1 !! 1
This code finds s, the text inside the appropriate paragraph by knowing that its the second (!! 1) paragraph, and within that paragraph, its the second tag - the actual text. We then split up the text using words, find the message that comes after hit count, and read all the digits we can find - filtering out the comma. I'm pretty sure this could be done better using regular expressions, and I invite a reader to submit improved code. This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.
TagSoup is for extracting information where structure has been lost, use more structured information if it is available.
Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his home page. The largest change to the previous example is that now we desire a list of papers, rather than just a single result.
As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.
First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple take/drop pair:
takeWhile (~/= "<a name=haskell>") $ drop 5 $ dropWhile (~/= "<a name=current>") tags
This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:
map f $ sections (~== "<a>") $ ...
Remember that the function to select all tags with name "a" could have been written as (~== TagOpen "a" ), or alternatively isTagOpenName "a". Afterwards we map each item with an f function. This function needs to take the tags starting just after the link, and find the text inside the link.
f = dequote . unwords . words . fromTagText . head . filter isTagText
Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the filter drops all those that are not, until we find a pure text node. The unwords . words deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.
For completeness, we now present the entire example:
spjPapers :: IO () spjPapers = do tags <- liftM parseTags $ openURL "http://research.microsoft.com/~simonpj/" let links = map f $ sections (~== "<a>") $ takeWhile (~/= "<a name=haskell>") $ drop 5 $ dropWhile (~/= "<a name=current>") tags putStr $ unlines links where f :: [Tag] -> String f = dequote . unwords . words . fromTagText . head . filter isTagText dequote ('\"':xs) | last xs == '\"' = init xs dequote x = x
Several more examples are given in the Example file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All can be invoked using the tagsoup executable program. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.
ndmPapers :: IO () ndmPapers = do tags <- liftM parseTags $ openURL "http://www-users.cs.york.ac.uk/~ndm/downloads/" let papers = map f $ sections (~== "<li class=paper>") tags putStr $ unlines papers where f :: [Tag] -> String f xs = fromTagText (xs !! 2)
currentTime :: IO () currentTime = do tags <- liftM parseTags $ openURL "http://www.timeanddate.com/worldclock/city.html?n=136" let time = fromTagText (dropWhile (~/= "<strong id=ct>") tags !! 1) putStrLn time