GEDCOM Data
Analysis A
random selection of 50 user-contributed GEDCOM files were downloaded from
Ancestry.com and analyzed. The results are summarized below, but the
analysis reports for all files may be downloaded (original data not
included): download
- 40KB
File Size
The smallest file contained 402 lines. The largest file contained
7,104,328 lines.
Product
| Name |
No. of files |
| Family Tree Maker for Windows |
20 |
| Online Family Tree |
18 |
| Ancestry Family Tree |
4 |
| Personal Ancestral File |
3 |
| Legacy |
1 |
| Family Trees Quick & Easy |
1 |
| Roots Magic |
1 |
| EasyTree |
1 |
| gwb2ged |
1 |
This
breakdown does not necessarily represent market share.
The figures may well be skewed becaues Ancestry.com produces/promotes
the most prevalent applications in this sample.
GEDCOM Version
| Version |
No. of files |
| 5.5 |
31 |
| 4.0 |
18 |
|
unspecified
|
1 |
File Character
Encoding
| Encoding |
No. of files |
| ANSI |
39 |
| ANSEL |
6 |
| UTF-8 |
3 |
| ASCII |
1 |
| IBM-WINDOWS |
1 |
This result is a bit surprising. The only four
encodings supported by the GEDCOM specification are ANSEL, ASCII, UNICODE,
and UTF-8. Neverthless, the majority of files used the non-standard encoding
ANSI (not the same as ASCII). Interestingly, one file specified IBM Windows
encoding which is expressly forbidden by the GEDCOM specification.
Non-Standard (Custom) Tags
Custom tags are permitted but discouraged by the
GEDCOM specification. Even so, a total of 53 custom tags were used in the
sample. Family Tree Maker used the most custom tags, whereas Online Family Tree
used none. Every custom tag began with the underscore character
as required by the spec. The longest custom tag was _ALT_BIRTH.
Non-ASCII and Non-Standard Characters
14 of 50 files contained at least one non-ASCII
(i.e. non-English) character. 34 of 50 files contained at least one
non-standard (improper?) character. Non-ASCII characters are perfectly valid,
but provide an indication of the number of non-English words found in the file.
Non-standard characters include values that represent neither ASCII nor
ANSEL characters. One explanation for the existence of non-standard characters
may be the use of ANSI character encoding. The reason for non-standard
characters in ANSEL-encoded files is unknown.
Citations
A common complaint is that user-contributed GEDCOM files do not contain source
citations. Nevertheless, 27 of the 50 files in the sample do contain
citations, although their extent and consistency varies. As a rule, the
citations probably do not meet professional-quality standards, but the results
suggest that many users may be more conscientious than presumed.
|