sodict.Z -- the Shorter Oxford Dictionary Wordlist [see also sodict.shar.Z, which contains some useful shell scripts] Contains part-of-speach and other information. See Proc. Fall Joint Computr Conference, 1963, pp119-423 J. L. Dolby, H.L. Resnikoff & E.MacMurray A Tape Dictionary for Linguistic Experiments for the original paper describing this file. I have used sed to delete unnecessary whitespace and also the original second field, which contained the reverse of each word, and the trailing field, which was simply a sequence number. There do not appear to be any copyright or redistribution constraints on this file. Liam Quin, lee@sq.com June 1990 Format of original on-line dictionary: 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 |....W.o.r.d........The word reversed) CSsssssssssswwwwwwwwwwmmmmmmmmmm pos: what: 1 a leading space (deleted) 2-21 the word itself, all uppercase (turned to lower case) 22-41 the reverse of the word (deleted) 42: A syllable count determined by deleteing trailing e's and then counting consecutive vowels as a single syllable. Useful only because for certain types of word it contains a single-character flag instead: h -- hyphenated words s -- suffixes p -- prefixes b -- broken words (i.e. words containing spaces) 43: Status -- x -- the word appears in the ``Shorter Oxford Dictionary'' [2 vols] w -- the word appears in ``Webster's New International Dictionary'' b -- the word appears in both dictionaries -- [empty or blank] the word appears in neither dictionary [see note 1] 44-53 Part of speech and Status from the Shorter Oxford Dictionary 54-63 Part of speech and Status from Webster's New International Dictionary 54-63 Merged part of speech and status information The codes in these three fields are as follows: $ -- Specialised a -- Archaic c -- Capital d -- Dialectical e -- Erroneous f -- Alien h -- Rhetoric n -- Nonsense o -- Obsolete p -- Poetical q -- Colloquial r -- Rare s -- Standard w -- Nonce Word z -- Substandard [sic] The column within which the letter appears indicates the part of speech: 1 -- Noun 2 -- Adjective 3 -- Verb 4 -- Adverb 5 -- Preposition 6 -- Conjunction 7 -- Pronoun 8 -- Interjection 9 -- Past 10 -- Other (note that trailing blanks have been removed from these fields) Notes: [1] 2,490 words were added from a list at Cornell University. [2] The merged field gives precedence to the standard meaning of each word, or to the Shorter Oxford Dictionary. [3] The researchers indicated (in 1963) that they had ``undertaken to punch [sic!] all of the words of Funk and Wagnell's ``New Practical Standard Dictionary'' in the syllabic form given by that source together with the accent information there given.'' Whether or not they ever did this is unknown... Example: ACOOL LOOCA 2 O O 0506190 ACOP POCA 2 O O 0406200 [...] ACORN-SHELL LLEHS-NROCA HXS S 1106230 Changed By Liam Quin (lee@sq.com): d-delete/k-keep: dkkkkkkkkkkkkkkkkkkkkddddddddddddddddddddkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkddddddd d|....W.o.r.d.......| (word reversed) |CSsssssssssswwwwwwwwwwmmmmmmmmmm Insert-After: | | | | | Resulting output after stripping trailing blanks in $1 and $0: acool|2| | o| acop|2| | o| . . . acorn-shell|h|x|s| Hence acorn-shell is a hyphenated word in the SOD but not Webster's, and has a standard meaning, and `acop' is an obsolete adverb not found in either dictionary, sice the Status field is blank and `o' appears in column 4 of the first field. Unix shell-script used to change the file from oldsodict.Z to sodict.Z: #! /bin/sh zcat oldsodict.Z | sed -e ' s/.\(....................\)....................\(.\)\(.\)\(..........\)\(..........\)\(..........\)......./\1|\2|\3|\4|\5/ s/\([^| ][^| ]*\) *|/\1|/g s/[ ]*$// y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ ' > sodict exit $? If there are any questions about the transformation, feel free to contact me as lee@sq.com by electronic mail, or at SoftQuad Inc., Toronto (416) 963-8337.