Conversational bots/Natural language processing

From Wikibooks, open books for an open world
Jump to navigation Jump to search

The echo bot presented in the previous section processes the text inputted by the user as a whole. There are natural language processing (NLP) tools for English and other languages that tokenise and analyse sentences but the iBot app relieves us from having to use them on the server because it carries out this task on the user's device and sends the results to the server. This is a useful function, if only for the fact that you don't have to burden the server with additional code.

The iBot app delivers to the server a list of tokens (in the tokens attribute). For example, the sentence I don't like vegetables is tokenised as [I,do,n't,like,vegetables]. More importantly, a complete morphological analysis is delivered in the chart attribute. Human languages are ambiguous, that is, a word might have several meanings. For example, the word book can be a noun or a verb, as in I bought a book and I want to book a hotel respectively. The morphological analyser delivers a list of all the possible meanings in the form of a table (so-called chart) to a parser whose task it is to analyse the text syntactically. One of the by-products of syntactic analysis is the morphological disambiguation of the text.

In what follows we'll implement a chart parser aimed at building syntactic structures of sentences. Chart parsing was invented by Alain Colmerauer and later improved by Martin Kay and others. It is an algorithm that makes use of dynamic programming, thanks to which the time complexity of (context-free) parsing is only .

To construct syntactic structures we'll use context-free rules and feature structures. On the left side of a rule is a symbol (so-called nonterminal) that represents a phrase. On the right side is a list of symbols that represent the phrases out of which the bigger phrase is composed. Feature structures are attribute-value matrices (AVM) that contain morphological and syntactic properties of words and phrases. For example, the following rule represents a verb phrase (VP) with an optional noun phrase (NP) serving as its direct object:

The equations beneath the symbols on the right are used to build up the feature structure of the phrase. stands for the feature structure of the mother node and for that of the daughter node. In the rule given above, states that the feature structure of V unifies with that of VP and states that the feature structure of the NP unifies with the dobj (direct object) attribute of the mother node's feature structure. For example, for the verb phrase reading books whose syntax tree is , we would get the following feature structure:

The lexico-semantic information (FUNC) and the additional attributes of the individual words are provided by the lexicon.

A sample grammar[edit | edit source]

You can implement grammars using the parser downloadable (for OS X and Linux) from here. Below is a simple grammar:

// the girl reads a book

0 Det ^det=1 .
1 N	^func=girl ^person?3.
2 V	^func=read ^vform=fin ^tense=pres.
3 Det ^det=0 .
4 N	^func=book ^person?3 .

NP -> Det? A* N .
	^=. ; ^adj+. ; ^=. ;
VP -> V NP? .
	^=. ^voice=act ; ^obj=. ;
S -> NP VP .
	^subj=. ; ^=. ^vform=fin ;

Note that you have to provide the initial chart, that is, there is no built-in morphological analyser. The lines beginning with numbers define the initial chart (analysed words of the input sentence). The other lines define context-free rules with annotations, whereby ^ means and . means . The operator = signifies equality (unification), the operator ? signifies a default value and + signifies set membership (). After saving the code to grammar.txt, run the parser using ./parser grammar.txt.