<-Back to homepage
<- Work

APM 361 Course Project Report:

Markov Model for noun-phrase detection

1. Project Presentation:

To perform information retrieval in a set of texts and documents, it is very useful to know what topics are covered in the material. However, we do not always have this information, and instead of browsing manually through the documents, we want to come up automatically with a list of noun phrases contained in the texts, in order to create an index of subjects addressed by the author(s). This index can then be used as a reference by web searching engines for example. The purpose of this project is to find an efficient way of detecting noun phrases in a given text, and to mark them.

2. Definitions

2.1. Noun phrases
2.2. Hidden Markov Model

3. Creating a noun phrase marker

3.1 Collecting data to build the hmm
3.2 Building the hmm
3.3. Evaluating and using the model

4. Results and observations

4.0. Evaluation of usevit on 25 files randomly taken from the PennTree bank
4.1. Size of training data
4.2. "esthmm" function.
4.3. Texts marked with "usevit"

5. References

2. Definitions:

2.1. Noun phrases

We call a noun phrase any group of words that ends in a noun. A noun phrase does not contain any verb or preposition. A simple noun phrase is composed of: An optional determiner, zero or more adjectives, a singular or plural noun. This structure can be represented by the following tag sequence: [DT] + JJ* + { NN | NNS } However, in practise, noun phrases with various structures can be found; they may contain sequences of nouns or proper nouns etc.

Example: (from a PennTree Bank file)
These old events have no bearing on our work force today said Darrell Phillips.
----------------- ---------- ---------------- ----------------

To "mark" noun phrases, we adopt the following convention:
-The tag
marks the start of a noun phrase
-The tag marks the end of a noun phrase

This how our previous example would look:

<NP> these old events<eNP>have <NP> no bearing <eNP> on <NP> our work force <eNP>today said <NP>darrell phillips <eNP><NP>manager<eNP> of<NP>human ressources<eNP>

In our problem, we want to find the noun phrases boundaries in a plain text. It is easy to obtain a tagged text with a tagger such as the Brill tagger.
Here's a tagged version of our example: (sentence boundary tags have been added)

/<S> These/DT old/JJ events/NNS have/VBZ no/DT bearing/NN on/IN our/PRP$ work/NN force/NN today/NN said/VBD Darrell/NNP Phillips/NNP manager/NN of/IN human/JJ resources/NNS /<eS>

2.2. Hidden Markov Model

To perform noun phrase detection, we are going to build a hidden Markov model.
A hidden Markov model l = (M, N, A, B, pi) is defined by :
-M, the number of output symbols
- V, a set of symbols V = {1, 2, , M}
- N, the number of hidden states
- Q, a set of states Q = {1, 2, , N}
- A, the state transition probability matrix
  aij = P(q(t+1) = j | qt =i) 1 <= i, j, <= N
- B, the observation probability distribution
  Bj(k) = P(ot = k| qt = j) 1 <= k <= M
- pi, the initial state distribution
  Pi = P(q1 = i) 1 <= i <= N

Therefore, if we can find a way to define the noun phrases as states, we'll be able to know the probability to start a sentence with a noun phrase for instance, and the probability to encounter noun phrases further in the text, given the words we encounter. In fact, we will use a five states model:
- starting a noun phrase (<NP>)
- ending a noun phrase (<eNP>)
- ending a noun phrase and immediately starting a new one (<eNP><NP>)
- in a noun phrase
- not in a noun phrase

In the above example, the state sequence is:

<NP>- in <NP> - in <NP> -<eNP> -<NP> - in <NP> - <eNP> - <NP> - in <NP> - in <NP> - <eNP> - not in <NP> - <NP> - in <NP> - <eNP><NP> - <eNP> - <NP>- in <NP>- <eNP>
However, in a text where the noun phrases are not marked, what we actually see is a sequence of tags called the "output". The output takes the form of pairs of tags, and in our example, it would be:
<S>-DT, DT-JJ, JJ-NNS, NNS-VBZ, ..., NNS- <eS>.
The problem really is to find out which states the model goes through to produce this output. The states are the unknowns; this is why our model is called "hidden".

3. Creating a noun phrase marker:

This Markov model can be evaluated from training data (texts where noun phrases are already marked) with the Baum-Welch algorithm. Then, given the hmm and the output sequence corresponding to case data (texts where noun phrases are unknown), we can find the state sequence with the Viterbi algorithm (algorithm method given in appendix), and mark the texts. Both algorithms have been implemented in the umdhmm-v1.02 software created by the University of Maryland.

3.1. Collecting data to build the hmm.

The training data used to build the Hidden Markov Model is a set of texts taken from the PennTree bank in which the noun phrases have already been marked, and tags associated to each word. After some filtering in order to keep only the useful information, we can obtain a noun-phrases marked file, and a tagged file.
<file p from Penntree bank> > Pennfilter > <p.pn and p.tag>

3.2 Building the hmm

Then, with a program called HMM, we obtain the hmm model, consisting in :
-A .hmm file containing M, N, A, B, and pi.
- A .states file containing the sequence of states encountered while building the model
- A .sym file containing the M different possible output symbols
- A .seq file containing the whole coded sequence of output
<p.pn and p.tag> > HMM > <p.hmm p.states p.sym p.seq>

3.3. Evaluating and using the model

We now want to evaluate the model we have created, and to use it to produce marked texts. In both cases (evaluation and production), the raw output <S>-DT, DT-JJ, JJ-NNS, ... of the text we use for evaluation or production has to be converted in a set of numbers corresponding to the convention adopted in the model.hmm file. A program called tag2sym does this. It uses the .sym file created previously by HMM to convert our tagged file in a .sym2 file usable by the viterbi function.
<p.sym and text.tag> > tag2sym > <text.sym2>
"testvit", a function included in the hmm software produced by the University of Maryland, gives us an optimised estimation of the hmm states corresponding to a given output sequence. Testvit actually gives two different estimations obtained with a logarithmic method, and with the regular method. They give very similar results. In our case, each sentence of the text we want to mark produces a sequence. To get estimation on the whole text, we need to use testvit recurrently. The function usevit was created for this purpose.

-In evaluation mode, usevit takes a training .sym2 file, estimates the states sequence with testvit, compares the results with the actual states and gives us the accuracy for each method (logarithmic and regular).
<p.hmm text.sym2 text.states> > usevit > <accuracy for marking methods>

-In production mode, usevit takes a case .sym2 file, estimates the states sequence with testvit, and produces the resulting noun phrases tagged file.
<p.hmm and text.sym2> > usevit > <text.np2>

4. Results and observations :

4.0. Evaluation of usevit on 25 files randomly taken from the PennTree bank:

# of sentences
82.40 %
85.60 %
82.29 %
67.96 %
64.08 %
38.19 %
86.09 %
63.98 %
76.32 %
81.89 %
66.53 %
90.06 %
53.56 %
47.82 %
87.12 %
53.52 %
84.92 %
85.67 %
86.77 %
88.76 %
71.77 %
87.66 %
48.67 %
average accuracy:
73.93 %

4.1. Size of training data

The PennTree bank is very large, and to build the model that produced the following results, I used only the files contained in the directories 20 to 24 for storage quota reasons. However, I have tested my programs on smaller models (built with one of the directories only), and the results from the evaluation are better with a larger model.
For example, evaluation on file wsj_0114.mrg gave 77% of accuracy with a smaller model (instead of 90%) and evaluation on file wsj_0003.mrg gave 71% of accuracy, instead of 76%.
Therefore, a bigger model would certainly give even better results, and certainly reduce the number of texts where the accuracy is under 50%.
The size of the training data can be increased until we reach the point of convergence of the matrices, beyond which the accuracy won't be affected anymore.

4.2. "esthmm" function.

In the software package of the University of Maryland, there is a function called esthmm which estimates a hidden Markov model using Baum-Welch algorithm given N, M, and the output sequence of the training data. I tried to use this function on small sets of training data, and compared the results it gave with those obtained from HMM. I noticed quite significant differences in the matrices A and B, and especially in the vector pi. In fact, as pi is the initial state distribution, it can only take two values: 1 (start a noun phrase) or 5 (not in a noun phrase). The other values all have probability zero ("end a noun phrase", "end a noun phrase and start a new one" and "in a noun phrase" are impossible states if one has not already started a noun phrase). However, esthmm would sometimes give a probability close to 1 to one of the three impossible solutions. Moreover, when I tried to build a large model with esthmm, the function would run for a very long time with no results at all. These problems may result from a misunderstanding of how to use the function, but I decided to use HMM to build the Markov model to use in the next steps.

4.3. Texts marked with "usevit"

I have tried to mark the noun phrases in a few texts using "usevit" in production mode, and I observed that most of the errors come from noun phrases that have not been tagged, and not from part of the text that would be wrongly tagged as a noun phrase, for example. Therefore, if we were to use the noun phrases detected with usevit to build an index, we would get a slightly incomplete one, but it would still be useful to determine which topics are dealt with in the text.

5. References:

-Package UMDHMM version 1.02 produced by T. Kanungo at the University of Maryland, available at the following address:

- CSC401 lecture notes, slides for topics 6 and 8, by Pr. Hirst at the University of Toronto.

I would like to thank Pr. Hirst for his help throughout this project.
April 2001