XMLData Repository

UW CSE - UW Database Group - Dan Suciu


The XML Data Repository collects publicly available datasets in XML form, and provides statistics on the datasets, for use in research experiments. Whenever possible, DTDs for the datasets are included, and the datasets are validated. Some of the datasets are large, and each is provided in compressed form using gzip and XMILL. Dataset statistics were computed using the XML Toolkit.

Detailed View of Datasets

Repository Contents

NameDescriptionUncompressed SizePreviewDate
Protein Sequence DatabaseIntegrated collection of functionally annotated protein sequences.683 MBNov 9 2001
SwissProtSWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. 109 MB1998
Auction DataAuction data converted to XML from web sources.23 KB2001
DBLP Computer Science BibliographyThe DBLP server provides bibliographic information on major computer science journals and proceedings. DBLP stands for Digital Bibliography Library Project.127 MBOct 2002
University CoursesCourse data derived from university websites.277 KB1999
NasaDatasets converted from legacy flat-file format into XML and made available to the public.23 MB2001
SIGMOD RecordIndex of articles from SIGMOD Record467 KB2001
TPC-H Relational Database BenchmarkTPC-H Benchmark, 10 MB version, in XML form. Converted to XML by Zack Ives.603 KB2002
Treebank (partially encrypted)English sentences, tagged with parts of speech. The text nodes have been encrypted because they are copywritten text from the Wall Street Journal. Nevertheless, the deep recursive structure of this data makes it an interesting case for experiments.82 MBadded Nov 2002
MondialWorld geographic database integrated from the CIA World Factbook, the International Atlas, and the TERRA database among other sources.1 MB2002

Other Resources for XML Datasets

The XML Data Repository is maintained by Gerome Miklau. Please send us your comments, suggestions, or dataset submissions to xmldatasets@cs.washington.edu

Last Modified: