Project Proposal
Participants:
Cartic Ramakrishnan
Christopher Thomas
Boanerges Aleman-Meza
Dmitri Kolychev
Aim:
Automated taxonomy generation as a first step towards the harder problem of automated Ontology Learning. We propose to develop a completely tunable system that will enable the knowledge engineer to generate a taxonomy from unstructured documents pertaining to a particular domain. This system will be developed for the medical domain.
Design:
We propose to design this system as an experiment. The purpose of this experiment is to determine the set of techniques and approaches (described later) that yield the best taxonomy. In this paper we will define what we mean by the best taxonomy.
The basic idea is to use a combination of Natural Language Processing and Statistical Clustering techniques to extract from unstructured text a taxonomy of concepts. The proposed system architecture is shown below in Figure 1.
As shown in the architecture the dataset that is used for the taxonomy generation process is obtained from a database of indexed medical journal articles called PubMed. PubMed is a database of developed at the National Library of Medicine (NLM). A query interface is available to this database from which articles are retrieved using a keyword search. In addition to PubMed our system will also use the MeSH hierarchy also developed at NLM. This hierarchy consists of a forest of 6 trees in which various terms in the medical domain have been arranged by knowledge engineers. This MeSH hierarchy will serve as the goal standard. We plan to compare the taxonomy we generate with the appropriate portion of the MeSH taxonomy. This will give us a means of evaluating the performance of our system. For our purposes we have delineated a part of MeSH and will use the terms in this part of MeSH as keywords to retrieve documents from PubMed.
This set of documents will then be sent to the Natural Language Parser, PhraseX. PhraseX is a minimal commitment domain specific parser that has the ability extract from unstructured text noun phrases and verbs. These extracted noun phrases will be the features of the documents that will be used to cluster them. Having extracted the noun phrases from all the documents in the dataset we plan to use a document vector generation utility called SMART that generates the document vectors. The next step in the process is to cluster these document vectors based on some similarity measures. We plan to implement a whole library of similarity measures. Since the system will be completely tunable our experiments with the taxonomy extraction process will measure the record the quality of the taxonomy generated with each of the similarity measures. This tool will hence allow the knowledge engineer to adapt the process that suits his/her needs the best.

Figure 1
The clustering process itself is a combination of 2 means and K means being applied iteratively to the document vectors. This process will generate a history of cluster memberships. At each step iteration (iteration is defined as applying 2means and kMeans once) the clustering algorithm will attempt to discover one new cluster. This process will continue until such time as no new clusters are discovered or no document vector jumps from one cluster to another. During the process of clustering the algorithm will compare 2 successive levels to make sure that a vector previously classified as belonging to cluster p at level l does not now belong to a cluster that is not a child of p. If such a situation is detected it is interpreted as a mistake in the clustering at the level l-1.
Once the clustering process stops the human assisted step in the taxonomy extraction process is invoked. In this process for each cluster in the hierarchy of clusters the top k terms contributing to the centroids of that cluster will be extracted and stored as the labels of the “central concept” of each cluster of documents. The knowledge engineer will now be presented with the labels for one cluster at a time starting from the root node in the hierarchy of clusters. If the user feels that a particular child node of the current node being shown is indeed a child concept of the current concept then that node and its labels are retained as a child of the current concept. Each node has a cohesiveness measure associated with it. This measure for the newly selected child node will be use to plot a locus through the hierarchy of clusters and all nodes on that locus will become siblings of the new child node. This process will result in the taxonomy of concepts. Figure 2 shows the process diagrammatically.

Figure 2
The tree in figure 2 shows the taxonomy extraction process being applied to the hierarchy of clusters obtained from the clustering process. Figure 3 shows the result of that process.

Figure 3