1. Project Presentation:
To perform information retrieval in a set of
texts and documents, it is very useful to know what topics are covered in
the material. However, we do not always have this information, and instead
of browsing manually through the documents, we want to come up automatically
with a list of noun phrases contained in the texts, in order to create an
index of subjects addressed by the author(s). This index can then be used
as a reference by web searching engines for example. The purpose of this
project is to find an efficient way of detecting noun phrases in a given
text, and to mark them.
2.1. Noun phrases
2.2. Hidden Markov Model
3.1 Collecting data to build the hmm
3.2 Building the hmm
3.3. Evaluating and using the model
4.0. Evaluation of usevit on 25 files randomly
taken from the PennTree bank
4.1. Size of training data
4.2. "esthmm" function.
4.3. Texts marked with "usevit"
2. Definitions:
2.1. Noun phrases
We call a noun phrase any group of words that ends in a noun. A noun phrase
does not contain any verb or preposition. A simple noun phrase is composed
of: An optional determiner, zero or more adjectives, a singular or plural
noun. This structure can be represented by the following tag sequence: [DT]
+ JJ* + { NN | NNS } However, in practise, noun phrases with various structures
can be found; they may contain sequences of nouns or proper nouns etc.
Example: (from a PennTree Bank file)
These old events have no bearing on our work force today said Darrell
Phillips.
----------------- |
---------- |
---------------- |
---------------- |
DT +JJ + NNS |
DT+ NN |
PRP$+NN+NN |
NNP + NNP |
To "mark" noun phrases, we adopt the following convention:
-The tag marks the start of a noun phrase
-The tag marks the end of a noun phrase
This how our previous example would look:
<NP> these old events<eNP>have <NP> no bearing
<eNP> on <NP> our work force <eNP>today said <NP>darrell
phillips <eNP><NP>manager<eNP> of<NP>human ressources<eNP>
In our problem, we want to find the noun phrases boundaries in a plain text.
It is easy to obtain a tagged text with a tagger such as the Brill tagger.
Here's a tagged version of our example: (sentence boundary tags have been
added)
/<S> These/DT old/JJ events/NNS have/VBZ no/DT bearing/NN on/IN
our/PRP$ work/NN force/NN today/NN said/VBD Darrell/NNP Phillips/NNP manager/NN
of/IN human/JJ resources/NNS /<eS>
2.2. Hidden Markov Model
To perform noun phrase detection, we are going to build a hidden Markov
model.
A hidden Markov model l = (M, N, A, B, pi) is defined by :
-M, the number of output symbols
- V, a set of symbols V = {1, 2, …, M}
- N, the number of hidden states
- Q, a set of states Q = {1, 2, …, N}
- A, the state transition probability matrix
|
aij = P(q(t+1) = j | qt =i) |
1 <= i, j, <= N |
- B, the observation probability distribution
|
Bj(k) = P(ot = k| qt = j) |
1 <= k <= M |
- pi, the initial state distribution
|
Pi = P(q1 = i) |
1 <= i <= N |
Therefore, if we can find a way to define the noun phrases as states, we'll
be able to know the probability to start a sentence with a noun phrase for
instance, and the probability to encounter noun phrases further in the text,
given the words we encounter. In fact, we will use a five states model:
- starting a noun phrase (<NP>)
- ending a noun phrase (<eNP>)
- ending a noun phrase and immediately starting a new one (<eNP><NP>)
- in a noun phrase
- not in a noun phrase
In the above example, the state sequence is:
<NP>- in <NP> - in <NP> -<eNP> -<NP>
- in <NP> - <eNP> - <NP> - in <NP> - in <NP>
- <eNP> - not in <NP> - <NP> - in <NP> - <eNP><NP>
- <eNP> - <NP>- in <NP>- <eNP>
However, in a text where the noun phrases are not marked, what we actually
see is a sequence of tags called the "output". The output takes the form
of pairs of tags, and in our example, it would be:
<S>-DT, DT-JJ, JJ-NNS, NNS-VBZ, ..., NNS- <eS>.
The problem really is to find out which states the model goes through to
produce this output. The states are the unknowns; this is why our model
is called "hidden".
3. Creating a noun phrase marker:
This Markov model can be evaluated from training data (texts where noun
phrases are already marked) with the Baum-Welch algorithm. Then, given the
hmm and the output sequence corresponding to case data (texts where noun
phrases are unknown), we can find the state sequence with the Viterbi algorithm
(algorithm method given in appendix), and mark the texts. Both algorithms
have been implemented in the umdhmm-v1.02
software created by the University of Maryland.
3.1. Collecting data to build the hmm.
The training data used to build the Hidden Markov Model is a set of texts
taken from the PennTree bank in which the noun phrases have already been
marked, and tags associated to each word. After some filtering in order
to keep only the useful information, we can obtain a noun-phrases marked
file, and a tagged file.
<file p from Penntree bank> > Pennfilter > <p.pn and p.tag>
3.2 Building the hmm
Then, with a program called HMM, we obtain the hmm model, consisting in
:
-A .hmm file containing M, N, A, B, and pi.
- A .states file containing the sequence of states encountered while building
the model
- A .sym file containing the M different possible output symbols
- A .seq file containing the whole coded sequence of output
<p.pn and p.tag> > HMM > <p.hmm p.states p.sym p.seq>
3.3. Evaluating and using the model
We now want to evaluate the model we have created, and to use it to produce
marked texts. In both cases (evaluation and production), the raw output
<S>-DT, DT-JJ, JJ-NNS, ... of the text we use for evaluation or production
has to be converted in a set of numbers corresponding to the convention
adopted in the model.hmm file. A program called tag2sym does this. It uses
the .sym file created previously by HMM to convert our tagged file in a
.sym2 file usable by the viterbi function.
<p.sym and text.tag> > tag2sym > <text.sym2>
"testvit", a function included in the hmm
software produced by the University of Maryland, gives us an optimised
estimation of the hmm states corresponding to a given output sequence. Testvit
actually gives two different estimations obtained with a logarithmic method,
and with the regular method. They give very similar results. In our case,
each sentence of the text we want to mark produces a sequence. To get estimation
on the whole text, we need to use testvit recurrently. The function usevit
was created for this purpose.
-In evaluation mode, usevit takes a training .sym2 file, estimates
the states sequence with testvit, compares the results with the actual states
and gives us the accuracy for each method (logarithmic and regular).
<p.hmm text.sym2 text.states> > usevit > <accuracy for marking
methods>
-In production mode, usevit takes a case .sym2 file, estimates the
states sequence with testvit, and produces the resulting noun phrases tagged
file.
<p.hmm and text.sym2> > usevit > <text.np2>
4. Results and observations :
4.0. Evaluation of usevit on 25 files randomly taken from the PennTree bank:
file
|
# of sentences
|
accuracy
|
wsj_0003.mrg
|
31
|
76.60%
|
wsj_0114.mrg
|
40
|
90.12%
|
wsj_0261.mrg
|
16
|
82.40 %
|
wsj_0265.mrg
|
12
|
85.60 %
|
wsj_0267.mrg
|
104
|
82.29 %
|
wsj_0664.mrg
|
82
|
67.96 %
|
wsj_0664.mrg
|
31
|
64.08 %
|
wsj_0810.mrg
|
30
|
38.19 %
|
wsj_0814.mrg
|
82
|
86.09 %
|
wsj_0817.mrg
|
26
|
63.98 %
|
wsj_0818.mrg
|
68
|
76.32 %
|
wsj_0819.mrg
|
36
|
81.89 %
|
wsj_1480.mrg
|
83
|
66.53 %
|
wsj_1481.mrg
|
8
|
90.06 %
|
wsj_1482.mrg
|
26
|
53.56 %
|
wsj_1790.mrg
|
41
|
47.82 %
|
wsj_1792.mrg
|
35
|
87.12 %
|
wsj_1795.mrg
|
63
|
53.52 %
|
wsj_1797.mrg
|
35
|
84.92 %
|
wsj_1900.mrg
|
9
|
85.67 %
|
wsj_1901.mrg
|
8
|
86.77 %
|
wsj_1902.mrg
|
7
|
88.76 %
|
wsj_1902.mrg
|
51
|
71.77 %
|
wsj_1904.mrg
|
4
|
87.66 %
|
wsj_1905.mrg
|
12
|
48.67 %
|
|
average accuracy:
|
73.93 %
|
4.1. Size of training data
The PennTree bank is very large, and to build the model that produced the
following results, I used only the files contained in the directories 20
to 24 for storage quota reasons. However, I have tested my programs on smaller
models (built with one of the directories only), and the results from the
evaluation are better with a larger model.
For example, evaluation on file wsj_0114.mrg gave 77% of accuracy with a
smaller model (instead of 90%) and evaluation on file wsj_0003.mrg gave
71% of accuracy, instead of 76%.
Therefore, a bigger model would certainly give even better results, and
certainly reduce the number of texts where the accuracy is under 50%.
The size of the training data can be increased until we reach the point
of convergence of the matrices, beyond which the accuracy won't be affected
anymore.
4.2. "esthmm" function.
In the
software package of the University of Maryland, there is a function
called esthmm which estimates a hidden Markov model using Baum-Welch algorithm
given N, M, and the output sequence of the training data. I tried to use
this function on small sets of training data, and compared the results it
gave with those obtained from HMM. I noticed quite significant differences
in the matrices A and B, and especially in the vector pi. In fact, as pi
is the initial state distribution, it can only take two values: 1 (start
a noun phrase) or 5 (not in a noun phrase). The other values all have probability
zero ("end a noun phrase", "end a noun phrase and start a new one" and "in
a noun phrase" are impossible states if one has not already started a noun
phrase). However, esthmm would sometimes give a probability close to 1 to
one of the three impossible solutions. Moreover, when I tried to build a
large model with esthmm, the function would run for a very long time with
no results at all. These problems may result from a misunderstanding of
how to use the function, but I decided to use HMM to build the Markov model
to use in the next steps.
4.3. Texts marked with "usevit"
I have tried to mark the noun phrases in a few texts using "usevit" in production
mode, and I observed that most of the errors come from noun phrases that
have not been tagged, and not from part of the text that would be wrongly
tagged as a noun phrase, for example. Therefore, if we were to use the noun
phrases detected with usevit to build an index, we would get a slightly
incomplete one, but it would still be useful to determine which topics are
dealt with in the text.
5. References:
-Package UMDHMM version 1.02 produced by T. Kanungo at the University of
Maryland, available at the following address:
http://www.cfar.umd.edu/~kanungo/software/software.html
- CSC401 lecture notes, slides for topics 6 and 8, by Pr. Hirst at the University
of Toronto.
I would like to thank Pr. Hirst for his help throughout this project. |