|
XML File Corpus (back to XML study) |
Test Corpus Files (see also the Calgary and Canterbury corpora)
The table below is a hyper-linked subset of Tables 2 & 11 of the paper (pdf); all files were pre-processed as described in Section 5.1. To avoid potential copyright issues, we do not provide links to our post-processed file versions, only the original source links. In some instances, a source file is shipped with a given compressor, denoted ‘*’.
|
File Name |
Description |
|
AB_FR_META |
Weather Data (2004): |
|
AB_NO_META |
|
|
AB_TR_META |
|
|
BB_1998STATS |
Baseball Stats
(1998) |
|
CB_CONTENT |
OpenDocument Sample File |
|
CB_WMS_CAPS |
|
|
LW_H2385_RH |
US House Of
Representatives: |
|
LW_H3738_IH |
|
|
LW_H3779_IH |
|
|
LW_ROLL014 |
US House Of
Representatives: |
|
LW_ROLL020 |
|
|
LW_ROLL031 |
|
|
NT_BOOLEAN |
NIST
XML Data Type Conformance Tests |
|
NT_NORMSTRING |
|
|
NT_POSLONG |
|
|
OD_ALLEN |
Oracle Database |
|
OD_FORD |
|
|
OD_MILLER |
|
|
PD_CONNOW |
DoD Per Diem Data
(2003) |
|
PD_CONUSMIL |
|
|
PD_CONUSNM |
|
|
PY_AS_YOU |
Shakespeare
(2):
As You Like It, Comedy Of Errors, Hamlet |
|
PY_COM_ERR |
|
|
PY_HAMLET |
|
|
RS_AP |
|
|
RS_CNET_SMALL |
|
|
RS_CNET_LARGE |
|
|
RS_REUTERS |
|
|
WX_29 |
NOAA Weather Forecasts |
|
WX_38 |
|
|
WX_39 |
|
|
*XBIS XB_FACTBOOK XB_PERIODIC |
CIA World
Factbook Periodic
Table of Elements |
|
XG_STUDENT |
|
|
XM_DBLP XM_SHAKE XM_SPROT XM_TPC XM_TREEBANK XM_WEBLOG |
Bibliographic
Database |
|
Shakespeare:
Antony & Cleopatra |
|
|
DNA
Sequences |
|
|
Database
Benchmarks |
|
|
Wall Street
Journal Linguistics |
|
|
Apache Web
Server Log |
|
|
XX_F21000 |
FCC Ham
Radio Listings |
|
XX_F26000 |
|
|
XX_F29000 |
|
|
XZ_UNSPSC |
UN Product
Catalog Codes This file is no
longer posted online, however, the raw UN Product Codes (UNSPSC) are posted here in raw format
under “downloads”. |