Spaces:
Sleeping
Sleeping
This code is the lexical complexity analyzer described in | |
Lu, Xiaofei (2012). The relationship of lexical richnes to the quality | |
of ESL speakers' oral narratives. The Modern Language Journal, 96(2), 190-208. | |
Version 1.1 Released on February 12, 2013 | |
Copyright (C) 2013 Xiaofei Lu | |
This program is free software; you can redistribute it and/or modify it under | |
the terms of the GNU General Public License as published by the Free Software | |
Foundation; either version 2 of the License, or (at your option) any later | |
version. | |
This program is distributed in the hope that it will be useful, but WITHOUT | |
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS | |
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more | |
details. | |
You should have received a copy of the GNU General Public License along with | |
this program; if not, write to the Free Software Foundation, Inc., 59 Temple | |
Place, Suite 330, Boston, MA 02111-1307 USA | |
To download the latest version of this software, follow the appropriate link | |
at | |
http://www.personal.psu.edu/xxl13/download.html | |
1. About | |
This tool computes the lexical complexity of English texts using 25 different | |
measures. Information on the measures can be found in Lu (2012). This | |
tool uses frequency lists derived from the British National Corpus and the | |
American National Corpus. | |
2. Running the tool | |
2.1 Input files: All input files must be POS-tagged and lemmatized first and | |
must be in the following format (see files in the samples folder for | |
examples). The file should contain a minumum of 50 words. | |
lemma_pos lemma_pos lemma_pos ... | |
or | |
lemma_pos | |
lemma_pos | |
lemma_pos | |
You can use any POS tagger and lemmatizer, as long as the Penn Treebank POS | |
tagset is adopted and the input file is appropriately formatted. In Lu | |
(2012), the following POS tagger and lemmaitzer were used: | |
The Stanford POS tagger: | |
http://nlp.stanford.edu/software/tagger.shtml | |
MORPHA: | |
http://www.informatics.susx.ac.uk/research/groups/nlp/carroll/morph.html | |
2.2 Analyzing a single file: To get the lexical complexity of a single file, | |
run the following from this directory. Replace input_file with the actual | |
name of your input file and output_file with the desired name of your output | |
file. | |
python lc.py input_file > output_file | |
e.g., | |
python lc.py samples/1.lem > 1.lex | |
To use the American National Corpus (ANC) wordlist instead of the BNC wordlist | |
for lexical sophistication analysis, use the lc-anc.py script, e.g., | |
python lc-anc.py samples/1.lem > 1-anc.lex | |
2.3 Analyzing multiple files: To get the lexical complexity of two or more | |
files within a single folder, run the following from this directory. Replace | |
path_to_folder with the actual path to the folder that contains your files | |
and output_file with the desired name of your output file. The folder should | |
only contain the files you want to analyze. | |
python folder-lc.py path_to_folder > output_file | |
e.g., | |
python folder-lc.py samples/ > samples.lex | |
To use the American National Corpus (ANC) wordlist instead of the BNC wordlist | |
for lexical sophistication analysis, use the folder-lc-anc.py script, e.g., | |
python folder-lc-anc.py samples/ > samples-anc.lex | |
2.4 Using the output: The output file is comma-delimited and can be loaded to | |
excel and spss directly for analysis. | |