LingConv / lng /lca /README.txt
mohdelgaar's picture
upload lng
b028d48
raw
history blame
3.33 kB
This code is the lexical complexity analyzer described in
Lu, Xiaofei (2012). The relationship of lexical richnes to the quality
of ESL speakers' oral narratives. The Modern Language Journal, 96(2), 190-208.
Version 1.1 Released on February 12, 2013
Copyright (C) 2013 Xiaofei Lu
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA
To download the latest version of this software, follow the appropriate link
at
http://www.personal.psu.edu/xxl13/download.html
1. About
This tool computes the lexical complexity of English texts using 25 different
measures. Information on the measures can be found in Lu (2012). This
tool uses frequency lists derived from the British National Corpus and the
American National Corpus.
2. Running the tool
2.1 Input files: All input files must be POS-tagged and lemmatized first and
must be in the following format (see files in the samples folder for
examples). The file should contain a minumum of 50 words.
lemma_pos lemma_pos lemma_pos ...
or
lemma_pos
lemma_pos
lemma_pos
You can use any POS tagger and lemmatizer, as long as the Penn Treebank POS
tagset is adopted and the input file is appropriately formatted. In Lu
(2012), the following POS tagger and lemmaitzer were used:
The Stanford POS tagger:
http://nlp.stanford.edu/software/tagger.shtml
MORPHA:
http://www.informatics.susx.ac.uk/research/groups/nlp/carroll/morph.html
2.2 Analyzing a single file: To get the lexical complexity of a single file,
run the following from this directory. Replace input_file with the actual
name of your input file and output_file with the desired name of your output
file.
python lc.py input_file > output_file
e.g.,
python lc.py samples/1.lem > 1.lex
To use the American National Corpus (ANC) wordlist instead of the BNC wordlist
for lexical sophistication analysis, use the lc-anc.py script, e.g.,
python lc-anc.py samples/1.lem > 1-anc.lex
2.3 Analyzing multiple files: To get the lexical complexity of two or more
files within a single folder, run the following from this directory. Replace
path_to_folder with the actual path to the folder that contains your files
and output_file with the desired name of your output file. The folder should
only contain the files you want to analyze.
python folder-lc.py path_to_folder > output_file
e.g.,
python folder-lc.py samples/ > samples.lex
To use the American National Corpus (ANC) wordlist instead of the BNC wordlist
for lexical sophistication analysis, use the folder-lc-anc.py script, e.g.,
python folder-lc-anc.py samples/ > samples-anc.lex
2.4 Using the output: The output file is comma-delimited and can be loaded to
excel and spss directly for analysis.