XMLData Repository
UW CSE - UW Database Group - Dan Suciu
The XML Data Repository collects publicly available datasets in XML form, and provides statistics on the datasets, for use in research experiments. Whenever possible, DTDs for the datasets are included, and the datasets are validated. Some of the datasets are large, and each is provided in compressed form using gzip and XMILL. Dataset statistics were computed using the XML Toolkit.
Name | Description | Uncompressed Size | Preview | Date |
Protein Sequence Database | Integrated collection of functionally annotated protein sequences. | 683 MB | ![]() | Nov 9 2001 |
SwissProt | SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. | 109 MB | ![]() | 1998 |
Auction Data | Auction data converted to XML from web sources. | 23 KB | ![]() | 2001 |
DBLP Computer Science Bibliography | The DBLP server provides bibliographic information on major computer science journals and proceedings. DBLP stands for Digital Bibliography Library Project. | 127 MB | ![]() | Oct 2002 |
University Courses | Course data derived from university websites. | 277 KB | ![]() | 1999 |
Nasa | Datasets converted from legacy flat-file format into XML and made available to the public. | 23 MB | ![]() | 2001 |
SIGMOD Record | Index of articles from SIGMOD Record | 467 KB | ![]() | 2001 |
TPC-H Relational Database Benchmark | TPC-H Benchmark, 10 MB version, in XML form. Converted to XML by Zack Ives. | 603 KB | ![]() | 2002 |
Treebank (partially encrypted) | English sentences, tagged with parts of speech. The text nodes have been encrypted because they are copywritten text from the Wall Street Journal. Nevertheless, the deep recursive structure of this data makes it an interesting case for experiments. | 82 MB | ![]() | added Nov 2002 |
Mondial | World geographic database integrated from the CIA World Factbook, the International Atlas, and the TERRA database among other sources. | 1 MB | ![]() | 2002 |
The XML Data Repository is maintained by Gerome Miklau. Please send us your comments, suggestions, or dataset submissions to xmldatasets@cs.washington.edu
Last Modified: