Home
Products
Download
Contacts
UstLF
UltraGram
News
Latest Releases
UstLF
2.0.34
UltraGram
6.0.62
Working with UltraGram
In this article we will examine some common problems that parser developer can face and the ways to handle them using
UltraGram
.
Brief overview
To create a working parser code with
UltraGram
you need to do the following:
Create a new grammar file
Compile it and remove all design type errors
Open or create some number of test files and insure that your grammar works
Generate the parser source code from your grammar targeting one of the following programming languages:
C++, C#, VB.NET, Java
The
UltraGram
grammar file has the following sections:
%pragma
,
precedence
,
%tokens
and
%production
. In the
%pragma
section different directives are listed that affect the behaviour of the parser. They are related generally to the error handling and conflict resolution (in more details described below). In the
%tokens
section a "terminal symbols" are defined ( also known as a "token type" ) . Each symbol represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to indicate that a token in that class is allowed. Each symbol has the following format:
'expression'
[ identifier, [
'alias'
]] [
%ignore
] ;
Here:
expression
defines the rule by which the token in the input text will be recognized
identifier
is a unique name of the symbol
alias
is an alias of this symbol
%ignore
directive instructs to skip the token
Examples:
'[ \t\r\n\f\b]+'
%ignore
;
'[a-z_A-Z0-9][a-z_A-Z0-9]*'
Id,
'Identifier'
;
'[0-9]+'
Integer,
'Int'
;
'([0-9]+\.[0-9]*)|(\.[0-9]+)'
Real,
'Real'
;
'\"[^\"]*\"'
String,
'Str'
;
You may refer to each symbol or name or by alias enclosed in quotes.
In the optional precedence section declarations for operator precedence allow you to specify when to shift and when to reduce. Use the
%left
,
%right
or
%nonassoc
declaration to declare a token and specify its precedence and associativity, all at once.
In the
%production
section grammar rules are defined. Each rule has the following format:
name : production_items ;
where
name
is the nonterminal symbol that this rule describes and
production_items
is an optional list of terminal or nonterminal symbol related to this rule and separated by a whitespace. Example:
exp : exp
'-'
exp ;
This rule defines that two groupings of type
exp
, with a
'-'
token in between, can be combined into a larger grouping of type
exp
. Multiple rules for the same name can be written separately or can be joined with the vertical-bar character
|
as follows:
name: production_items1 | production_items2... ;
Example:
exp : exp
'-'
exp | exp
'+'
exp ;
If rule
production_items
in a rule is empty, it means that name can match the empty string. Here is how to define a comma-separated sequence of zero or more
exp
groupings:
exp_opt :
// empty rule
| expseq ;
expseq : exp
| expseq
','
exp ;
The entry point ( main rule ) is defined by a name that follows the
%production
keyword. There must always be a rule with the name that corresponds to it.
UltraGram IDE
allows you to compile and extensively test and debug your grammar. You can define a set of test files, run then individually or in a batch mode, set breakpoints, render graphical representation of some errors, create a complete
DFA
graph,...
Let’s begin
Lets skip the detailed
IDE
description (just download and take a look, it’s really simple) and focus on the parsing issues that are common for a general parser developer and solutions that
UltraGram
offers. We will do this review on several examples. All examples are very simple and deliberately constructed to
highlight common problems
a developer may run into. Some knowledge of parsing theory and experience will be helpful.
Token-Token conflicts
Let’s say you plan to create a parser that should dial with a file containing list of employee names separated by semicolon. Like that:
John Smith ;
Allan Brown ;
Bret Voisen ;
Simple grammar for this case will look as following:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
// ---- token declaration section ( terminal symbols ) ---------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
;
This grammar works fine for our input file. Now we will complicate things a bit. Let’s say some of the names may contain additional info in front so the test file might look like that:
John Smith ;
Allan Brown ;
Doctor
Bret Voisen ;
Now we have to modify our grammar. The most natural approach will result in the following:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
// ---- token declaration section ( terminal symbols ) ---------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'Doctor'
Info;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
Grammar complies nicely but when you start to parse the test file it fails. This is result of a
Token-Token
conflict. What happens exactly? When parser reaches some state it looks for the lookahead tokens from some set of tokens. If it finds one it takes an appropriate action. Look at the third line of the input file:
Doctor
Bret Voisen ;
The first word (
"Doctor"
) matches the token definitions for
Info
and for the
FirstName
. In
UltraGram
tokens that are declared last have higher priority (other parsers can have this differently). According to this the word
"Doctor"
is treated as a token
FirstName
, after which comes
LastName
and semicolon in the rule:
Item : FirstName LastName
';'
;
But that is not the case! We expect here the rule:
Item : Info FirstName LastName
';'
;
So, what are our options?
The first option is to remove
Token-Token
conflict by not allowing to declare conflicting tokens. Some parsers do exactly this. The error is razed in this case. Is this correct? Well, in ideal world - yes. In reality there are many cases when conflicting tokens are declared and will be declared. Some parsers raise an error in that case. Other parsers allow any token declarations and do not detect token conflicts at all. As result there are chances that your grammar will compile but will not work against some test files and you most likely will get no explanations why.
The second option is to change the priority of the tokens ( we need to have the token
Info
set to the highest priority ). If you realize that you have conflicting tokens you may change declarations order for example like this:
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
'Doctor'
Info;
This will change token priorities. But will it help? In our sample - no. You will run into the problem again but at different location. Frustrating isn’t it? Especially when your grammar is much more complicated then this primitive sample. Now let’s see how this can be handled by
UltraGram
. First of all
UltraGram
always detects token conflicts, reports them and may optionally stop at the conflicting locations. Also there are several option to dial with the situation. One of them is to enable special processing by using pragma
"dkey( on | off )"
. This stands for “
d
etect
key
words”. When this option is turned on
UltraGram
analyzes token declarations and filters out declarations that look like keywords in a programming language ( for example consist from low or upper case chars and digits, have fixed length and no "wildcards" ). In our sample there is only one token declaration of this kind:
'Doctor'
Info;
The grammar with the above mentioned pragma will look like this:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
dkey( on );
// detect keywords mode is turned ON
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'Doctor'
Info;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
We added only one line:
dkey( on );
. Now when parsing starts tokens are examined and the highest priority is given to the "keyword" tokens. They are searched first. If one of them will be found - appropriate rule will be reduced. If none of them are found then the rest of the lookahead set is examined and appropriate action is taken. The problem is fixed. This is a good reliable solution for many cases.
Now let’s examine some more complicated situations. Let’s say we need to parse the following text:
John Smith ;
Allan Brown ;
Doctor Bret Voisen ;
Engineer Rendy Adams ;
For this case our token
Info
should be more complicated. Let’s make it handle any word. This will result in the following:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
Info;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
Obviously we don’t have "keywords" here and pragma
"dkey"
will be useless.
UltraGram
offers another pragma
"rtc (on|off,<level>)"
( this stands for
r
esolve
t
oken
c
onflicts ). The value
<level>
can be in a range from 1 to 10 . When this pragma is turned on and conflicting tokens are encountered a special analyzing process kicks in that allows to select a correct token. The reliability of the process can be tuned by the
<level>
value. This value to some extend represents how far in the text parser will look to insure whether the token is correct or not. This is more powerful solution then the usage of the
dkey
pragma but requires some extra processing. After this change pragma section of our grammar will look like:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
rtc( on, 4 );
// resolve token conflicts mode is turned ON
Now test file can be parsed successfully.
Shift-Reduce and Reduce-Reduce conflicts
Let’s go back and see what happens when we try to eliminate conflicting tokens. Let's rewrite our grammar to the following:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
Identifier;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
FirstName : Identifier;
LastName : Identifier;
Info : Identifier;
As you can see we have here only one token
Identifier
. Everything else is built on it. Our change looks natural and simple. Compile grammar ... and get a
Reduce-Reduce
conflict !
Shift-Reduce
and
Reduce-Reduce
conflicts are common. They happen quite often and ideally they should be eliminated. But how? One way is to modify/rewrite your grammar. This can be easily done for our sample. But it is not the goal of this article. Modifying grammar may help in many cases but sometimes changing some grammar rule may result in a message about conflict that happens in the place that seems to be completely unrelated to your modifications. This may be very confusing. Fortunately
UltraGram
gives you a clear graphical representation of events driving this conflict. Click on the conflict message and you will get:
This is a small fragment of the DFA graph. Here parsing starts at Node 0. From this point it is possible to find two identical paths to Node 5 (consuming token
Identifier
). In the Node 5 - if the next lookahead token is
Identifier
we can reduce two rules:
Info : Identifier ;
and
FirstName : Identifier ;
So which one should be reduced? We need only one. This is a
Reduce-Reduce
conflict (in similar way happen
Shift-Reduce
conflicts). This graph is very helpful when grammars are more complicated and the chain of steps that lead to the conflict is longer. Anyway, which path should be chosen? And how? What if all your attempts to resolve conflict by modifying grammar failed and conflict still exist? For this case
UltraGram
offers couple options. First one is to set priority to some conflicting rules using
%prec
directive. This directive is an old one, it has some history. To keep it short we will skip details. Just note, that if for example you modify grammar in the following way
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
Identifier;
%left
Low;
%left
High;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
FirstName : Identifier
%prec
Low ;
// has low priority
LastName : Identifier;
Info : Identifier
%prec
High ;
// has high priority
conflict will be gone. Here we simply informed parser that rule
Info
has higher priority then the rule
FirstName
. As result conflict is resolved. In some cases this conflict resolution technique works fine. Unfortunately for our sample this is not good enough. Parsing will fail. Let’s take a look at the second solution. It is called
GLR
. Again – we will skip the theory. In short – parser will choose from all conflicting paths a successful path if any. Let’s look at the grammar:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
glr( on ) ;
// GLR mode is turned ON
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
Identifier;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
FirstName : Identifier ;
LastName : Identifier ;
Info : Identifier ;
GLR
mode is enabled by pragma
"glr (on|off)"
. Necessary to note that
GLR
requires additional processing. However if you are not designing "real-time" software, your files are not very large and computers are fast ( nowdays ), in many cases this is an acceptable solution and you will not notice any significant differences in parsing time. Just give it a try to understand whether it is acceptable for you.
Parsing Algorithms
Necessary to mention that
GLR
can be applied not only to the
LALR(1)
parsing algorithm, but to
LR(1)
algorithm as well.
UltraGram
supports
LR(1)
. This is about the way parsing tables are calculated. In general
LR(1)
tables are more "robust" comparing to
LALR(1)
but the size of the tables may be enormous. So the common practice is to use
LALR
. The table calculation algorithm can be selected by setting pragma
"alg( lalr | lr1 )"
. Also keep in mind that in some cases
Shift-Reduce
and
Reduce-Reduce
conflicts that exist in
LALR(1)
do not exist in
LR(1)
. So when you run into this issue try different techniques to resolve them.
Parsing unformatted text
Sometimes there are cases when it is necessary to parse only some special sections of a file and do not care or do not know format of other sections. Let’s look at the sample file first:
John Smith ;
1 2 3 4 5 ;
Doctor Bret Voisen ;
Engineer Rendy Adams ;
agent 007 ;
What means "1 2 3 4 5" or "agent 007"? This does not correspond to the grammar. Can file contain other strange entries separated by semicolon? May be… So what are the options? One of the options is to figure out exact format of the file and create grammar for it. But sometimes this can be a complicated and even impossible thing to do.
UltraGram
offers another option – a
%text
token. Think about it as about a token with unknown (or don’t care) content. It makes things simple. Here is our revised grammar:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
glr( on ) ;
// GLR mode is turned ON
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-z_A-Z][a-z_A-Z]*'
Identifier;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
|
%text
';'
;
FirstName : Identifier ;
LastName : Identifier ;
Info : Identifier ;
After this change all unknown entries of the file will be treated as
%text
tokens and parser will succeed.
Error handling
Files that are parsed very often contain different errors. As result error handling is quite important part of the parser core.
UltraGram
supports two error handling modes:
Typo errors mode
Generic errors mode
These modes are different and may be used together complementing each other.
Typographical error handling
This kind of errors are often created during text typing and include duplication, omission, transposition or substitution of characters. To illustrate how they can be handled lets return back to the following grammar:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
dkey( on );
// detect keywords mode is turned ON
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'Doctor'
Info;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
This grammar perfectly parses the following file:
John Smith ;
Allan Brown ;
Doctor Bret Voisen ;
Now what happens if we add some error entries so our test file will look for example like that:
John Smith ;
Allan Brown ;
Dotcor
Bret Voisen ;
Dotor
David Duffy ;
Dooctor
Jerry Duncan ;
Docto
Jeff Springer ;
These are very common types of errors that mainly happen during manual typing. One way of handling them is to create long and complicated token declarations that will handle all possible error cases. However this approach is not practical, since there may be too many of combinations that should be considered. Fortunately
UltraGram
has a build in mechanism to handle this types of errors. Look at the following modified grammar:
// ---- optional pragma section --------------------------------------
%pragma
intok( on, off );
// allow inline token declarations
dkey( on );
// detect keywords mode is turned ON
// ---- token declaration section ( terminal symbols ) --------------
%tokens
'[ \n\t\r]+'
wspace,
%ignore
;
'[a-zA-Z]'
MyCharSet;
'Doctor'
Info,
%chars
MyCharSet;
'[a-z_A-Z][a-z_A-Z]*'
FirstName;
'[a-z_A-Z][a-z_A-Z]*'
LastName;
// ---- production section -------------------------------------------
%production
Start
Start : ItemList;
ItemList : Item | ItemList Item ;
Item : FirstName LastName
';'
| Info FirstName LastName
';'
;
The difference with the previous grammar is only in two lines:
'[a-zA-Z]'
MyCharSet;
'Doctor'
Info,
%chars
MyCharSet;
The first line is a token declaration that defines a set of characters that might be accidently injected or substitute a character in the input text. The next line is the token declaration that has char set
MyCharSet
attached to it. This tells the parser to enable error correction on this token and use the char set to validate all unexpected characters. Note, this also enables error correction for missing characters or swapped characters ( transposition ). With this change out test file with errors will be parsed successfully. One defined char set may be used with multiple token declarations, however it is possible to have multiple char sets. Also It is possible to enable this type of error correction for inline tokes ( that do not have explicit token declaration ) as well. Last thing to mention is that this type of error correction is applicable only for token that have well defined pattern for which corrections make sense. Token declarations like
FirstName
or
LastName
in the sample above cannot have this type or error correction since it is impossible to predict correct input.
Generic errors
If typo error handling is not enough and some other errors are expected in the input, the
%error
token can be used. In the current release of
UltraGram
the usage and behaviour of this token is similar to the
%text
token. However there are some differences. First of all
%error
token has the lowest priority and its processing will start only after all other options will fail. The second and in fact the major difference is that
%error
token does not consider
%ignore
tokens. This means that in case of
%error
token parser expects the worst scenario ( like damaged file ) and will not make any difference between normal formatted text or some comments residing in it. Parser will just try to skip one char after another ( and assign them to the
%error
token ) until parsing of the input can be successfully resumed. This behaviour allows to handle errors in grammars that already utilize
%text
tokens (however this kind of mix is rare and is used in really complicated cases). Here is the fragment of the above grammar with generic error handling:
Item : FirstName LastName
';'
| Info FirstName LastName
';'
|
%error
';'
;
There are other useful features in
UltraGram
including full
DFA
graph with lookaheads,
UNICODE
support, batch file parsing,… Give it a try.