This code is the lexical complexity analyzer described in

Lu, Xiaofei (2012). The relationship of lexical richnes to the quality 
of ESL speakers' oral narratives. The Modern Language Journal, 96(2), 190-208. 

Version 1.1 Released on February 12, 2013

Copyright (C) 2013 Xiaofei Lu
 
This program is free software; you can redistribute it and/or modify it under 
the terms of the GNU General Public License as published by the Free Software 
Foundation; either version 2 of the License, or (at your option) any later 
version.

This program is distributed in the hope that it will be useful, but WITHOUT 
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more 
details.

You should have received a copy of the GNU General Public License along with 
this program; if not, write to the Free Software Foundation, Inc., 59 Temple 
Place, Suite 330, Boston, MA  02111-1307  USA

To download the latest version of this software, follow the appropriate link 
at
        http://www.personal.psu.edu/xxl13/download.html


1. About

This tool computes the lexical complexity of English texts using 25 different 
measures. Information on the measures can be found in Lu (2012). This 
tool uses frequency lists derived from the British National Corpus and the
American National Corpus. 

2. Running the tool

2.1 Input files: All input files must be POS-tagged and lemmatized first and 
must be in the following format (see files in the samples folder for 
examples). The file should contain a minumum of 50 words. 

lemma_pos lemma_pos lemma_pos ...

or 

lemma_pos
lemma_pos
lemma_pos

You can use any POS tagger and lemmatizer, as long as the Penn Treebank POS 
tagset is adopted and the input file is appropriately formatted. In Lu 
(2012), the following POS tagger and lemmaitzer were used:

The Stanford POS tagger: 
http://nlp.stanford.edu/software/tagger.shtml

MORPHA: 
http://www.informatics.susx.ac.uk/research/groups/nlp/carroll/morph.html

2.2 Analyzing a single file: To get the lexical complexity of a single file, 
run the following from this directory. Replace input_file with the actual 
name of your input file and output_file with the desired name of your output 
file.

python lc.py input_file > output_file

e.g.,

python lc.py samples/1.lem > 1.lex

To use the American National Corpus (ANC) wordlist instead of the BNC wordlist
for lexical sophistication analysis, use the lc-anc.py script, e.g.,

python lc-anc.py samples/1.lem > 1-anc.lex

2.3 Analyzing multiple files: To get the lexical complexity of two or more 
files within a single folder, run the following from this directory. Replace 
path_to_folder with the actual path to the folder that contains your files 
and output_file with the desired name of your output file. The folder should 
only contain the files you want to analyze.

python folder-lc.py path_to_folder > output_file

e.g.,

python folder-lc.py samples/ > samples.lex

To use the American National Corpus (ANC) wordlist instead of the BNC wordlist
for lexical sophistication analysis, use the folder-lc-anc.py script, e.g.,

python folder-lc-anc.py samples/ > samples-anc.lex

2.4 Using the output: The output file is comma-delimited and can be loaded to 
excel and spss directly for analysis.