Public PhD defence
of Joachim Nielandt
XPath-Based Information Extraction

Three problems are answered in this thesis

  1. Can a data extractor be built, using user examples and XPath?
  2. Is it possible to increase the quality using context?
  3. Can automation be built into the system?

What does it take to get a web page to a user?

Server

www.google.com

HTML

Browser

Google Chrome

Visual

User

Internet detective

The server typically serves HTML content

Just like unicorns are

very

pretty horses

This piece of the IMDB website...

Looks like this as HTML code.

HTML code is built using tags

<div>opens the tag
</div>closes the tag
<div>anything in between is content</div>
<and attributes="have values">

Server

www.google.com

HTML

Browser

Google Chrome

Visual

User

Internet detective

The browser downloads the HTML and it...

Eventually this is about me, right?
  1. I love movies
  2. I want to build my own cinema
  3. I need to track all the movies i own
  4. IMDB has the information, but I cannot easily use it

Extract all the stars

Impractical to get all that information manually

  • There’s a lot of IMDB content: 4.1 million titles
  • What do you do with new movies?
  • What about errors corrected by IMDB?

Assume: each movie page shows
its stars in the same spot

The user gives two examples, Jeff and John...

... and the goal is to automatically find Julianne!

Ideally, this process can then be performed
on other movies' pages too

Let's look at the underlying DOM structure

An XPath example...

This XPath selects Jeff Bridges from the HTML example:

/span/a/span/text()

XPaths can be very complex, but for now we keep to this structure:

/step1/step2/step3

A step looks like this:

axis::nodetest[predicate]

The full version of the example XPath would be:

/child::span[1]/child::a[1]/child::span[1]/child::text()[1]

In detail...XPath step

axis::nodetest[predicate]

We can now point to the user's examples with XPaths!

/div[1]/span[1]/a[1]/span[1]/text()
/div[1]/span[2]/a[1]/span[1]/text()

Considering n XPaths, create a single one that does the same: a generalised XPath

Okay.

It's not that simple.

Finding and exploiting similarities has been done with strings

For example, "Thedude" can be transformed in "Teddy" by:

Leading to an edit distance of 3

Using the edit distance, an alignment can be calculated!

Thedude

T ed dy

This is also possible for multiple strings

We can use this alignment for our XPaths!

Instead of characters in strings...

...we use the XPath /s/t/e/p/s in the XPaths.

Problem 1 Generalise user's examples

Now it's for real, these are some user XPaths!

A possible alignment:

Always the same amount of steps for each!

Merge the steps...

...and create the generalised XPath!

The result: /html[1]//div[2]/div/table[1]/tr/td[1]

Solution 1 Generalised XPath

The generalised XPath solves the first problem:

Given some user examples, find all the relevant items.

Solution:

Problem 2 Improve quality

Generalised XPaths work well for:

Problematic situations exist where structure alone is not enough

For example!

Imagine two web pages with a similar structure

But, disaster...

someone made a mistake.

Probably Frank

3rd row 2nd row

Based on these examples...

The generalised XPath

will be too general.

Solution 2 Exploit context!
  • Exploit text
  • Exploit styling
  • Exploit common ancestors
Introducing Advanced predicates

XPath predicates already helped us:

We can make them more complex:
Solution 2 Predicate enrichment!

Add predicates to generalised XPath

to make it more

narrow

  • Not allowing undesired results ...
  • ... increases data extraction precision
Predicate enrichment Preparation
<body>
    <div>
        <span>
            <b>Some bold statement</b>
        </span>
    </div>
</body>
            
The circles are user examples.

Using the examples, build this generalised XPath

/body[1]/div//span[1]/b[1]

...which results in one undesired node

/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]
/body[1]/div//span[1]/b[1]

The nodes that we travelled through on the way to an example are indicated nodes, others are overflow nodes

/body[1]/div//span[1]/b[1]
Solution 2 Predicate enrichment
Example
  • The desired b tags all have a span parent and div grandparent
  • This can be added as a predicate to the span tags
    • The indicated nodes fulfill it
    • ... and the overflow nodes don't
  • Result: /body[1]/div//span[1][parent::div]/b[1]
Solution 2 Predicate enrichment

In this work, 6 predicates are proposed

Tests indicate increased precision

Can everything up until now be somehow automated?

Consider this question in the context of

set expansion

Let's say we already know a set of movie titles

Question
If we get a list of 1000 IMDB web pages, can we automatically find new movie titles to extend our set?
Strategy
  • Look for the titles we know
  • Figure out rules for them
  • Generalise XPaths of the found titles
Matches
Titles are found in obvious places...
Matches
...and less obvious places.
Found terms?
  • For each match, generate its XPath
  • If there are multiple on a page
    • Cluster the XPaths
    • Similar XPaths get grouped together
  • Each cluster results in generalised XPath

Assume that known movie titles are found multiple times on the known web pages

Cluster!
{
One cluster
  • Has similar XPaths
  • They can be generalised!

A cluster is merged...

...into a generalised XPath

Repeat execution of generalised XPath on:

Main focus: Data extraction from (semi-)structured documents.

The following problems were investigated:

Problem 1
Can a data extractor method be constructed using user examples?
Generalised XPaths
Problem 2
Can context be used to increase precision?
Predicate enrichment of generalised XPaths
Problem 3
Can set expansion be performed with enriched XPaths?
Automated lookups were investigated using enriched XPaths

Thank you for your attention

Any question is well appreciated!