NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. For information about downloading them, see : Cumulative Word Length Distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.
Unfortunately, for many languages, substantial corpora are not yet available.
NLTK's small collection of web text includes content from a Firefox discussion forum, conversations overheard in New York, the movie script of There is also a corpus of instant messaging chat sessions, originally collected by the Naval Postgraduate School for research on automatic detection of Internet predators.
The corpus contains over 10,000 posts, anonymized by replacing usernames with generic names of the form "User NNN", and manually edited to remove any other identifying information.
We can ask for the topics covered by one or more documents, or for the documents included in one or more categories.
For convenience, the corpus methods accept a single fileid or a list of fileids.
Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.