|
A formal grammar for
parsing GEDCOM 5.5x files
The following LALR(1) grammar can be used to parse
GEDCOM 5.5x files generated by a wide variety of programs. Pointer strings, if
used, must be manually trimmed to remove prefix and suffix characters
('@'). Likewise, embedded escape sequences in value strings must be
manually extracted.
- ALNUM: //
an alphanumeric character
- [a-z_A-Z0-9]
-
- WHITE: //
any whitespace character
- [ \t\v\b\f\a]
-
- ANY:
// any character except newline
- [^\n]
-
- POINTER:
// a pointer string
- @ALNUM[^@]*@
-
- BOM8: // a
UTF-8 byte order mark
- \xEF\xBB\xBF
-
- UBOM: // a
Unicode byte order mark
- \xFF\xFE
-
- NEWLINE:
- \n
-
- SPACE:
- ' ' //
single space character (0x20)
-
- TAG:
- ALNUM+
-
- LEVEL:
- 0|([1-9][0-9]*)
-
- VALUE:
- ANY+
-
%ignore%:-
- BOM8
// UTF-8 byte order mark
-
UBOM // Unicode byte order mark
-
WHITE* // leading whitespace
-
- file:
- line
-
file line
- line:
- LEVEL SPACE TAG
- LEVEL SPACE TAG
line_value NEWLINE
-
LEVEL SPACE POINTER SPACE TAG
-
LEVEL SPACE POINTER SPACE TAG line_value NEWLINE
- line_value:
- VALUE
-
POINTER
-
-
|