LDC Corpora available at UIUC
The Beckman Institute has an LDC membership for the years 1996, 1997,
1999, and 2003-2008. We actually have copies of
the following corpora.
Margaret Fleck has (web-accessible, ask her for details):
- Arabic Treebank (volumes 1-3)
- Arabic Gigaword
- Boston Radio Speech Corpus (1997)
- Buckwalter Arabic morphological analyzer
- CELEX-2
- Chinese Treebank 5.0
- Comlex English lexicon (Pronlex)
- ECI Multilingual Corpus 1 (1994)
- English Gigaword
- Gulf Arabic Conversational Telephone Speech Transcripts
- HUB4 Broadcast News text data (1996)
- Map task, which is actually the new version
from the HCRC web site because our LDC CD's seem to be permanently lost.
- Switchboard:
- Switchboard I Release 2 (1997) audio
- Mississippi State word-level transcriptions
- ICSI phonetic transcriptions
- see Treebank-3 for versions with POS, disfluency, and/or syntactic parses
- Egyptian Arabic CALLHOME (transcripts)
- Propbank
- TDT 2 (version 3.2, English text only)
- Treebank-3
Margaret Fleck inherited the following CD's from Richard Sproat,
but hasn't put them onto the web yet:
- Levantine Arabic QT Training Data, Set 3 (Speech)
- Prague Arabic Dependency Treebank 1.0
- ISI Arabic-English Automatically Extracted Parallel Text
- Chinese Gigaword
- Korean Newswire
- ARL Urdu Speech Database Training Data
- 2004 NIST Speaker Recognition Evaluation
- Low density languages version 0.5-Bengali
- Less Commonly Taught Languages (LCTL): Begali 2005 v0.6, Bengali v 1.0,
Thai 2005 v 0.11, Thai v1.0, Thai Resource Kit 0.1, Thai 2005 v0.10
- Gale Y1 web text collection
Q1, Q3, Q4, Phase 3 Release 2
- Gale Y1 Web IT 5-gram Version 1:
- Gale phase 2, release 1 and 2, web text
- Gale phase 3, release 1 and 2, web text
- Gale phase 3 devtest Broadcast Audio v1.0
- Gale Y1 Distillation Evaluation Audio
Richard Sproat apparently had the following corpus but didn't
hand them over to Margaret, so we don't know if we still have
a copy.
- SigHan Chinese Treebank Segmentation Evaluation Corpus
Mark Hasegawa-Johnson has so much stuff that it's in a separate file
Dan Roth has:
- ACE-2
- ACE 2004 Multilingual Training Corpus
- Comlex English lexicon (Pronlex)
- Google n-grams
- Hong Kong Hansards Parallel Text
- MUC-7
- North American News Corpus
- Propbank
- Reuters-21578 (1997) newswire data
- TDT 2 (version 3.2, English text only)
- TREC/ACQUAINT
- Treebank-2
ChengXiang Zhai has
- TREC-1 to TREC-8 disks
- lots more stuff, details coming soon
David Forsyth has
- TRECVID news video (pre-release version, 2004)
Dave Dubin (GSLIS) has
- Tipster/TREC volumnes 1-5
Someone supposedly has TRAINS (1995) but this isn't showing
up in our LDC online records.