of Joachim Nielandt
XPath-Based Information Extraction
- Extraction of data from web pages
- Usually not easy to do
- Present the data in orderly fashion
- So it can be used for other purposes
Three problems are answered in this thesis
- Can a data extractor be built, using user examples and XPath?
- Is it possible to increase the quality using context?
- Can automation be built into the system?
- Meant to be viewed by human users
- Contains a lot of structure
- Allows for interaction
What does it take to get a web page to a user?
The server typically serves HTML content
- Will be processed by the browser
- Looks like a bunch of code? It is a bunch of code
- Can be very pretty
Just like unicorns are
very
pretty horses
This piece of the IMDB website...
Looks like this as HTML code.
HTML code is built using tags
<div> | → | opens the tag |
</div> | → | closes the tag |
<div>anything in between is content</div> |
<and attributes="have values"> |
The browser downloads the HTML and it...
- builds an internal DOM model,
- allows for interaction, modifications, ...
- produces a visual page for the user.
Eventually this is about me, right?
- I love movies
- I want to build my own cinema
- I need to track all the movies i own
- IMDB has the information, but I cannot easily use it
Impractical to get all that information manually
- There’s a lot of IMDB content: 4.1 million titles
- What do you do with new movies?
- What about errors corrected by IMDB?
- Let's make it easier!
- Have a program do the heavy lifting
- Ask the user for minimal input: a couple of examples
Assume: each movie page shows
its stars in the same spot
The user gives two examples, Jeff and John...
... and the goal is to automatically find Julianne!
Ideally, this process can then be performed
on other movies' pages too
Let's look at the underlying DOM structure
- We need to build a rule: “Get all stars”
- Requires a way to extract elements from DOM
- XPath is a query language
- Can be used to retrieve elements from a structured document
- ... such as the DOM model of a web page
- Makes use of the node names in the DOM (span, a, h4 ...)
An XPath example...
This XPath selects Jeff Bridges from the HTML example:
/span/a/span/text()
XPaths can be very complex, but for now we keep to this structure:
/step1/step2/step3
A step looks like this:
axis::nodetest[predicate]
The full version of the example XPath would be:
/child::span[1]/child::a[1]/child::span[1]/child::text()[1]
In detail...XPath step
axis::nodetest[predicate]
- The axis is a first selection:
- All the children (child), the parent (parent), itself (self) ...
- The nodetest is a second selection:
- Only span elements: span
- All the elements of the axis: *
- Only text elements: text()
- The predicate is a last filter
- Can be very complex
- The most simple, a single number: div[1]
- Don’t worry, we’re getting there.
- Remember the examples we have to give to the system?
We can now point to the user's examples with XPaths!
/div[1]/span[1]/a[1]/span[1]/text()
/div[1]/span[2]/a[1]/span[1]/text()
Considering n XPaths, create a single one that does the same: a generalised XPath
- Remove anything that is not similar!
- /div[1]/span[1]/a[1]/span[1]/text()[1]
- /div[1]/span[2]/a[1]/span[1]/text()[1]
- This leaves us with an XPath that selects all actors
- /div[1]/span/a[1]/span[1]/text()[1]
Okay.
It's not that simple.
- The structure of documents is not always that similar
- We need a nice way of making generalisation always possible
Finding and exploiting similarities has been done with strings
- A string is a list of characters, e.g., "thedude"
- Edit distance between two strings a and b
- = the minimal amount of edit operations to transform a into b
- Add character
- Remove character
- Replace character
For example, "Thedude" can be transformed in "Teddy" by:
- Removing the h
- Removing the u
- Replacing e by y
Leading to an edit distance of 3
Using the edit distance, an alignment can be calculated!
This is also possible for multiple strings
We can use this alignment for our XPaths!
Instead of characters in strings...
...we use the XPath /s/t/e/p/s in the XPaths.
Problem 1 Generalise user's examples
Now it's for real, these are some user XPaths!
- /html[1]/body[1]/div[2]/div[1]/table[1]/tr[1]/td[1]
- /html[1]/body[1]/div[2]/div[2]/table[1]/tr[3]/td[1]
- /html[1]/div[2]/div[2]/table[1]/tr[3]/td[1]
A possible alignment:
- /html[1]/body[1]/div[2]/div[1]/table[1]/tr[1]/td[1]
- /html[1]/body[1]/div[2]/div[2]/table[1]/tr[3]/td[1]
- /html[1]/. /div[2]/div[2]/table[1]/tr[3]/td[1]
Always the same amount of steps for each!
Merge the steps...
...and create the generalised XPath!
The result: /html[1]//div[2]/div/table[1]/tr/td[1]
- A single generalised XPath
- That retrieves all the original examples
- Possibly even more?
Solution 1 Generalised XPath
The generalised XPath solves the first problem:
Given some user examples, find all the relevant items.
Solution:
- Convert all user examples to XPaths
- Align and merge the examples, create generalised XPath
- Execute the generalised XPath on web pages
Problem 2 Improve quality
Generalised XPaths work well for:
- Documents with a lot of structure
- Uniquely identifiable structures
- Not much replication
Problematic situations exist where structure alone is not enough
For example!
Imagine two web pages with a similar structure
But, disaster...
someone made a mistake.
Probably Frank
">
3rd row
2nd row
Based on these examples...
The generalised XPath
will be too general.
Solution 2 Exploit context!
- Exploit more than just structure
- Look around for helpful evidence or context
- Exploit text
- Exploit styling
- Exploit common ancestors
Introducing Advanced predicates
XPath predicates already helped us:
We can make them more complex:
- div[child::a] → select a div that has an a child
- div[preceding-sibling::table]
- div[preceding-sibling::text()=”Stars”]
Solution 2 Predicate enrichment!
Add predicates to generalised XPath
to make it more
narrow
- Not allowing undesired results ...
- ... increases data extraction precision
- In this work, 6 types of predicates are proposed
- They are automatically added
- A preparation step is needed
- Build set of indicated nodes
- Build set of overflow nodes
Predicate enrichment Preparation
<body>
<div>
<span>
<b>Some bold statement</b>
</span>
</div>
</body>
The circles are user examples.
Using the examples, build this generalised XPath
/body[1]/div//span[1]/b[1]
...which results in one undesired node
/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]
The nodes that we travelled through on the way to an example are indicated nodes, others are overflow nodes
/body[1]/div//span[1]/b[1]
Solution 2 Predicate enrichment
- The generalised XPath targets too much
- Use predicates to narrow the scope
- Find out a suitable predicate
- Automatically
- Based on rules and context
- Not too restrictive
- Should still be readable
Example
- The desired b tags all have a span parent and div grandparent
- This can be added as a predicate to the span tags
- The indicated nodes fulfill it
- ... and the overflow nodes don't
- Result: /body[1]/div//span[1][parent::div]/b[1]
Solution 2 Predicate enrichment
In this work, 6 predicates are proposed
- Each fulfills a specific purpose
- Easily expandable
Tests indicate increased precision
- Less undesired elements are retrieved
- Positive impact on overall quality
Can everything up until now be somehow automated?
Consider this question in the context of
set expansion
- Begin with a set of elements
- Automatically expand that set with relevant elements
Let's say we already know a set of movie titles
Question
If we get a list of 1000 IMDB web pages, can we automatically find new movie titles to extend our set?
Strategy
- Look for the titles we know
- Figure out rules for them
- Generalise XPaths of the found titles
Matches
Titles are found in obvious places...
Matches
...and less obvious places.
Found terms?
- For each match, generate its XPath
- If there are multiple on a page
- Cluster the XPaths
- Similar XPaths get grouped together
- Each cluster results in generalised XPath
Assume that known movie titles are found multiple times on the known web pages
- /body[1]/html[1]/div[2]/div[4]/h1[1]
- /body[1]/html[1]/head[1]/title[1]
- /body[1]/html[1]/div[2]/div[4]/h1[1]
- /body[1]/html[1]/div[2]/div[4]/div[1]/div[1]/a[1]/b[1]
- /body[1]/html[1]/div[2]/div[3]/h1[1]
- /body[1]/html[1]/div[1]/div[3]/div[1]/div[1]/a[1]/b[1]
{
One cluster
- Has similar XPaths
- They can be generalised!
A cluster is merged...
- /body[1]/html[1]/div[2]/div[4]/h1[1]
- /body[1]/html[1]/div[2]/div[4]/h1[1]
- /body[1]/html[1]/div[2]/div[3]/h1[1]
...into a generalised XPath
- /body[1]/html[1]/div[2]/div/h1[1]
Repeat execution of generalised XPath on:
- The known web pages
- Could contain lists of movie titles?
- Maybe multiple pages contain a single title on the same place?
- Web page on which nothing was found yet
- Maybe we can find new items?
- This expands the original set!
Main focus: Data extraction from (semi-)structured documents.
The following problems were investigated:
Problem 1
Can a data extractor method be constructed using user examples?
Problem 2
Can context be used to increase precision?
Predicate enrichment of generalised XPaths
Problem 3
Can set expansion be performed with enriched XPaths?
Automated lookups were investigated using enriched XPaths
Thank you for your attention
Any question is well appreciated!