Home

XML File Corpus (back to XML study)

E-mail 

 

Test Corpus Files (see also the Calgary and Canterbury corpora)

 

The table below is a hyper-linked subset of Tables 2 & 11 of our paper. All files were pre-processed as described in Section 5.1. To avoid potential copyright issues, we do not provide links to our post-processed file versions, only the original source links. In some instances, a source file is shipped with a given compressor, denoted ‘*’.

 

File Name

Description

AB_FR_META

Weather Data (2004):

France, Norway, Turkey

AB_NO_META

AB_TR_META

BB_1998STATS

Baseball Stats (1998)

CB_CONTENT

OpenDocument Sample File

CB_WMS_CAPS

GIS Map Server Data

LW_H2385_RH

US House Of Representatives: Bill Resolutions

LW_H3738_IH

LW_H3779_IH

LW_ROLL014

US House Of Representatives: Roll Call Votes

LW_ROLL020

LW_ROLL031

NT_BOOLEAN

NIST XML Data Type Conformance Tests

NT_NORMSTRING

NT_POSLONG

OD_ALLEN

Oracle Database: Sample Transactions

OD_FORD

OD_MILLER

PD_CONNOW

DoD Per Diem Data (2003)

PD_CONUSMIL

PD_CONUSNM

PY_AS_YOU

Shakespeare (2): As You Like It, Comedy Of Errors, Hamlet

PY_COM_ERR

PY_HAMLET

RS_AP

RSS “Top Story” Feeds: AP, CNET (1 2), Reuters

RS_CNET_SMALL

RS_CNET_LARGE

RS_REUTERS

WX_29

NOAA Weather Forecasts: (3 locations, servers 1 & 2)

WX_38

WX_39

*XBIS

XB_FACTBOOK

XB_PERIODIC

CIA World Factbook

Periodic Table of Elements

XG_STUDENT

*Student Degree Listing

*XMill (2)

XM_DBLP

XM_SHAKE

XM_SPROT

XM_TPC

XM_TREEBANK

XM_WEBLOG

Bibliographic Database

Shakespeare: Antony & Cleopatra

DNA Sequences

Database Benchmarks

Wall Street Journal Linguistics

Apache Web Server Log

XX_F21000

FCC Ham Radio Listings, part of XML-Xpress

XX_F26000

XX_F29000

XZ_UNSPSC

UN Product Catalog Codes

This file is no longer posted online, however, the raw UN Product Codes (UNSPSC) are posted here in raw format under “downloads”.