|
XML
File Corpus (back to XML study) |
Test
Corpus Files (see also the Calgary
and Canterbury
corpora)
The table below is a
hyper-linked subset of Tables 2 & 11 of our paper. All files
were pre-processed as described in Section 5.1. To avoid potential copyright
issues, we do not provide links to our post-processed file versions, only the
original source links. In some instances, a source file is shipped with a given
compressor, denoted ‘*’.
|
File Name |
Description |
|
AB_FR_META |
Weather Data (2004): |
|
AB_NO_META |
|
|
AB_TR_META |
|
|
BB_1998STATS |
Baseball
Stats (1998) |
|
CB_CONTENT |
OpenDocument
Sample File |
|
CB_WMS_CAPS |
|
|
LW_H2385_RH |
US House Of Representatives: |
|
LW_H3738_IH |
|
|
LW_H3779_IH |
|
|
LW_ROLL014 |
US House Of Representatives: |
|
LW_ROLL020 |
|
|
LW_ROLL031 |
|
|
NT_BOOLEAN |
NIST XML Data Type Conformance Tests |
|
NT_NORMSTRING |
|
|
NT_POSLONG |
|
|
OD_ALLEN |
Oracle Database |
|
OD_FORD |
|
|
OD_MILLER |
|
|
PD_CONNOW |
DoD
Per Diem Data (2003) |
|
PD_CONUSMIL |
|
|
PD_CONUSNM |
|
|
PY_AS_YOU |
Shakespeare (2):
As You Like It, Comedy Of Errors, Hamlet |
|
PY_COM_ERR |
|
|
PY_HAMLET |
|
|
RS_AP |
|
|
RS_CNET_SMALL |
|
|
RS_CNET_LARGE |
|
|
RS_REUTERS |
|
|
WX_29 |
NOAA
Weather Forecasts |
|
WX_38 |
|
|
WX_39 |
|
|
*XBIS XB_FACTBOOK XB_PERIODIC |
CIA World Factbook Periodic Table of Elements |
|
XG_STUDENT |
|
|
XM_DBLP XM_SHAKE XM_SPROT XM_TPC XM_TREEBANK XM_WEBLOG |
Bibliographic Database |
|
Shakespeare: Antony & Cleopatra |
|
|
DNA Sequences |
|
|
Database Benchmarks |
|
|
Wall Street Journal Linguistics |
|
|
Apache Web Server Log |
|
|
XX_F21000 |
FCC Ham Radio Listings |
|
XX_F26000 |
|
|
XX_F29000 |
|
|
XZ_UNSPSC |
UN Product Catalog Codes This
file is no longer posted online, however, the raw UN Product Codes (UNSPSC)
are posted here in raw
format under “downloads”. |