Can We Trust Web Page Metadata

Anders Ardö

Can We Trust Web Page Metadata

2010

Anders Ardö

A statistical study of embedded metadata in a sample of more than 4 million HTML Web pages is reported. The researchers try to determine and quantify the validity of this metadata. Of particular interest is to see if the metadata are trustworthy enough for determining the topic of a Web page. Datasets are collected by a Web crawler running as both a general and a focused crawler. Metadata fields “title,” “author,” “keywords,” “description,” and “language” are analyzed in detail together with Dublin Core metadata. The study reveals problems with how metadata are created. Among the 75% of all Web pages that have interesting metadata, the field “language” is the most trustworthy. All other metadata fields show a high degree of duplication thus degrading their usefulness. The strict answer to the title question is no, however, there is a lot of meaningful and useful information, but it must be interpreted and used with care. The study provides statistics on the usage of metadata today and how it has changed o...

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations