Spaces:
Sleeping
Sleeping
File size: 18,343 Bytes
b028d48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
Tsurgeon v3.3.1 - 2014-01-04 ---------------------------------------------- Copyright (c) 2003-2012 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. Original core Tregex code by Roger Levy and Galen Andrew. Original core Tsurgeon code by Roger Levy. GUI by Anna Rafferty Support code, additional features, etc. by Chris Manning This release prepared by John Bauer. This package contains Tregex and Tsurgeon. Tregex is a Tgrep2-style utility for matching patterns in trees. It can be run in a graphical user interface, from the command line using the TregexPattern main method, or used programmatically in java code via the TregexPattern, TregexMatcher and TregexPatternCompiler classes. As of version 1.2, the Tsurgeon tree-transformation utility is bundled together with Tregex. See the file README.tsurgeon for details. Java version 1.6 is required to use Tregex. If you really want to use Tregex under an earlier version of Java, look into RetroWeaver: http://retroweaver.sourceforge.net/ TSURGEON ---------------------------------------------- Tsurgeon is a tool for modifying trees that match a particular Tregex pattern. Further documentation for Tregex and Tregex GUI can be found in README-tregex.txt and README-gui.txt, respectively. ---------------------------------------------- Brief description: Takes some trees, tries to match one or more tregex expressions to each tree, and for each successful match applies some surgical operations to the tree. Pretty-prints each resulting tree (after all successful match/operation sets have applied) to standard output. A simple example: ./tsurgeon.csh -treeFile atree exciseNP renameVerb ----------------------------------------- RUNNING TREGEX ----------------------------------------- Program Command Line Options: -treeFile <filename> specify the name of the file that has the trees you want to transform. -po <matchPattern> <operation> Apply a single operation to every tree using the specified match pattern and the specified operation. -s Prints the output trees one per line, instead of pretty-printed. The arguments are then Tsurgeon scripts. Each argument should be the name of a transformation file that contains a list of pattern and transformation operation list pairs. That is, it is a sequence of pairs of a TregexPattern pattern on one or more lines, then a blank line (empty or whitespace), then a list of transformation operations one per line (as specified by Tsurgeon syntax below to apply when the pattern is matched, and then another blank line (empty or whitespace). Note the need for blank lines: The code crashes if they are not present as separators (although the blank line at the end of the file can be omitted). The script file can include comment lines, either whole comment lines or trailing comments introduced by %, which extend to the end of line. A needed percent mark in partterns or operations can be escaped by a preceding backslash. ----------------------------------------- TSURGEON SYNTAX ----------------------------------------- Legal operation syntax and semantics (see Examples section for further detail): delete <name_1> <name_2> ... <name_m> For each name_i, deletes the node it names and everything below it. prune <name_1> <name_2> ... <name_m> For each name_i, prunes out the node it names. Pruning differs from deletion in that if pruning a node causes its parent to have no children, then the parent is in turn pruned too. excise <name1> <name2> The name1 node should either dominate or be the same as the name2 node. This excises out everything from name1 to name2. All the children of name2 go into the parent of name1, where name1 was. relabel <name> <new-label> Relabels the node to have the new label. There are three possible forms for the new-label: relabel nodeX VP - for changing a node label to an alphanumeric string, relabel nodeX /''/ - for relabeling a node to something that isn't a valid identifier without quoting, and relabel nodeX /^VB(.*)$/verb\/$1/ - for regular expression based relabeling. In the last case, all matches of the regular expression against the node label are replaced with the replacement String. This has the semantics of Java/Perl's replaceAll: you may use capturing groups and put them in replacements with $n. Also, as in the example, you can escape a slash in the middle of the second and third forms with \/ and \\. This last version lets you make a new label that is an arbitrary String function of the original label and additional characters that you supply. insert <name> <position> insert <tree> <position> inserts the named node, or a manually specified tree (see below for syntax), into the position specified. Right now the only ways to specify position are: $+ <name> the left sister of the named node $- <name> the right sister of the named node >i <name> the i_th daughter of the named node. >-i <name> the i_th daughter, counting from the right, of the named node. move <name> <position> moves the named node into the specified position. To be precise, it deletes (*NOT* prunes) the node from the tree, and re-inserts it into the specified position. See above for how to specify position replace <name1> <name2> deletes name1 and inserts a copy of name2 in its place. adjoin <tree> <target-node> adjoins the specified auxiliary tree (see below for syntax) into the target node specified. The daughters of the target node will become the daughters of the foot of the auxiliary tree. adjoinH <tree> <target-node> similar to adjoin, but preserves the target node and makes it the root of <tree>. (It is still accessible as <code>name</code>. The root of the auxiliary tree is ignored.) adjoinF <tree> <target-node> similar to adjoin, but preserves the target node and makes it the foot of <tree>. (It is still accessible as <code>name</code>, and retains its status as parent of its children. The foot of the auxiliary tree is ignored.) coindex <name_1> <name_2> ... <name_m> Puts a (Penn Treebank style) coindexation suffix of the form "-N" on each of nodes name_1 through name_m. The value of N will be automatically generated in reference to the existing coindexations in the tree, so that there is never an accidental clash of indices across things that are not meant to be coindexed. ----------------------------------------- Syntax for trees to be inserted or adjoined: A tree to be adjoined in can be specified with LISP-like parenthetical-bracketing tree syntax such as those used for the Penn Treebank. For example, for the NP "the dog" to be inserted you might use the syntax (NP (Det the) (N dog)) That's all that there is for a tree to be inserted. Auxiliary trees (a la Tree Adjoining Grammar) must also have exactly one frontier node ending in the character "@", which marks it as the "foot" node for adjunction. Final instances of the character "@" in terminal node labels will be removed from the actual label of the tree. For example, if you wanted to adjoin the adverb "breathlessly" into a VP, you might specify the following auxiliary tree: (VP (Adv breathlessly) VP@ ) All other instances of "@" in terminal nodes must be escaped (i.e., appear as \@); this escaping will be removed by tsurgeon. In addition, any node of a tree can be named (the same way as in tregex), by appending =<name> to the node label. That name can be referred to by subsequent tsurgeon operations triggered by the same match. All other instances of "=" in node labels must be escaped (i.e., appear as \=); this escaping will be removed by tsurgeon. For example, if you want to insert an NP trace somewhere and coindex it with a node named "antecedent" you might say insert (NP (-NONE- *T*=trace)) <node-location> coindex trace antecedent $ ----------------------------------------- Examples of Tsurgeon operations: Tree (used in all examples): (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) Apply delete: VP < PP=prep delete prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (. .))) The PP node directly dominated by a VP is removed, as is everything under it. Apply prune: S < (NP < NNP=noun) prune noun Result: (ROOT (S (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) The NNP node is removed, and since this results in the NP above it having no terminal children, the NP node is deleted as well. Note: This is different from delete in which the NP above the NNP would remain. Apply excise: VP < PP=prep excise prep prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (IN in) (NP (NNP May))))) (. .))) The PP node is removed, and all of its children are added in the place it was previously located. Excise removes all the nodes from the first named node to the second named node, and the children of the second node are added as children of the parent of the first node. Thus, for another example: VP=verb < PP=prep excise verb prep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (IN in) (NP (NNP May))) (. .))) Apply relabel: VP=v < PP=prep relabel prep verbPrep Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (verbPrep (IN in) (NP (NNP May))))) (. .))) The label for the node called prep (PP) is changed to verbPrep. The other form of relabel uses regular expressions; consider the following operation: /^VB.+/=v relabel v /^VB(.*)$/ #1 Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (D was) (VP (N arrested) (PP (IN in) (NP (NNP May))))) (. .))) The Tregex pattern matches all nodes that begin "VB" and have at least one more character. The Tsurgeon operation then matches the node label to the regular expression "^VB(.*)$" and selects the text matching the first part that is not completely specified in the pattern. In this case, that is the part matching the wildcard (.*), which matches all characters after the VB. The node is then relabeled with that part of the text, causing, for example, "VBD" to be relabeled "D". The "#1" specifies that the name of the node should be the first group in the regex. Apply insert (shown here with inserting a node, but could also be a tree): S < (NP < (NNP=name !$- DET)) insert (DET Ms.) $+ name Result: (ROOT (S (NP (DET Ms.) (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) The pattern matches the NNP node that is directly dominated by an NP (which is directly dominated by an S) and is not a direct right sister of a DET. Thus, the (DET Ms.) node is inserted immediately to the left of that NNP node, as specified by "$+ name". "$+" is the location and "name" describes what node the location is with respect to. Note: Tsurgeon will re-search for matches after each run of the script; thus, cycles may occur, causing the program to not terminate. The key is to write patterns that match prior to the changes you would like to make but that do not match afterwards. If the clause "!$- DET" had been left out in this example, Tsurgeon would have matched the pattern after every insert operation, causing an infinite number of DETs to be added. Apply move: VP=verb < PP=prep move prep $- verb Result: (ROOT (S (NP (NNP Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested))) (PP (IN in) (NP (NNP May))) (. .))) The PP is moved out of the VP that dominates it and added as a direct right sister of the VP. As for insert, "$-" specifies the location for prep while "verb" specifies what that location is relative to. Note: "move" is a macro operation that deletes the given node and then inserts it. "move" does not use prune, and thus any branches that now lack terminals will remain rather than being removed. Apply replace: S < (NP=name < NNP) replace name (NP (DET A) (NN woman)) Result: (ROOT (S (NP (DET A) (NN woman)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) "name" is matched to an NP that is dominated by an S and dominates an NNP, and a new subtree ("(NP (DET A) (NN woman))") is added in the place where "name" was. Note: This operation is vulnerable to falling into an infinite loop. See the note concerning the "insert" operation and how patterns are matched. Apply adjoin: S < (NP=name < NNP) adjoin (NP (DET A) (NN woman) NP@) name Result: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) First, the NP is matched to the NP dominating the NNP tag. Then, the specified tree ("(NP (DET A) (NN woman) NP@)") is placed in that location. The "@" symbol specifies that the children of the original NP node ("name") are to be placed as children of a new NP node that is directly to the right of (NN woman). If the specified tree were "(NP (DET A) (NN woman) VP@)" then the child (NNP Maria_Eugenia_Ochoa_Garcia) would appear under a VP. Exactly one "@" node must appear in the specified tree in order to indicate where to place the node from the original tree. Apply adjoinH: S < (NP=name < NNP) adjoinH ((NP (DET A) (NN woman) NP@)) name Result: (ROOT (S (NP (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia)))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) This operation differs from adjoin in that it retains the named node (in this case, "name"). The named node is made the root of the specified tree, resulting in two NP nodes dominating the DET in this example whereas only one was present in the previous example. Note that the specified tree is wrapped in an extra pair of parentheses in order to show the syntax for retaining the named node. If the extra parentheses were not there and the specified tree was, for example, (VP (DET A) (NN woman) NP@), the VP would be ignored in order to retain an NP as the root. Thus, in this case, "adjoinH (VP (DET A) (NN woman) NP@) name" and "adjoinH ((DET A) (NN woman) NP@) name" both produce the same tree: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) Apply adjoinF: S < (NP=name < NNP) adjoinF (NP(DET A) (NN woman) @) name Result: (ROOT (S (NP (DET A) (NN woman) (NP (NNP Maria_Eugenia_Ochoa_Garcia))) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP (NNP May))))) (. .))) This operation is very similar to adjoin and adjoinH, but this time the original named node ("name" in this case) is maintained as the root of the subtree that is adjoined. Thus, no node label needs to be given in front of the "@" and if one is given, it will be ignored. For instance, "adjoinF (NP(DET A) (NN woman) VP@) name" would still produce the same tree as above, despite the VP preceding the @. Apply coindex: NP=node < NNP=name coindex node name Result: (ROOT (S (NP-1 (NNP-1 Maria_Eugenia_Ochoa_Garcia)) (VP (VBD was) (VP (VBN arrested) (PP (IN in) (NP-2 (NNP-2 May))))) (. .))) This causes the named nodes to be numbered such that all nodes that are part of the same match have the same number and all matches have distinct new names. We had two instances of an NP dominating an NNP in this example, and they were renamed such that NP-i < NNP-i for each match, with 1 <= i <= number of matches. ----------------------------------------- TSURGEON SCRIPTS ----------------------------------------- Script format: Tsurgeon scripts are a combination of a Tregex pattern to match and a series of Tsurgeon operations to perform on that match. The first line of a Tsurgeon script should be the Tregex pattern. This should be followed by a blank line, and then each subsequent line may contain one Tsurgeon operation. Tsurgeon operations should not be separated by blank lines. The following is an example of correctly formatted script: S < NP=node < NNP=name relabel node NP_NAME coindex node name Comments: The character % introduces a comment that extends to the end of the line. All other intended uses of % must be escaped as \% . ----------------------------------------- CONTACT ----------------------------------------- For questions about this distribution, please contact Stanford's JavaNLP group at [email protected]. We provide assistance on a best-effort basis. ----------------------------------------- LICENSE ----------------------------------------- Tregex, Tsurgeon, and Interactive Tregex Copyright (c) 2003-2011 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. For more information, bug reports, fixes, contact: Christopher Manning Dept of Computer Science, Gates 1A Stanford CA 94305-9010 USA [email protected] http://www-nlp.stanford.edu/software/tregex.shtml |