GETTING STARTED WITH UCS/Perl
  This introduction is intended to make you familiar with UCS/Perl, which is
  the core of the UCS system.  The UCS/Perl libraries and tools allow you to
  create, manipulate, filter, sort, and print cooccurrence data sets.  A
  typical application of such cooccurrence data is to serve as raw material
  for collocation identification.  For this purpose, the pair types of a data
  set are ranked according to statistical association measures.  UCS/Perl can
  be used both for the annotation of association scores and for the ranking
  process.  A graphical evaluation against a gold standard of true
  collocations can then be performed in the UCS/R part.

  If you only want to use the UCS/R evaluation functions, you can turn
  directly to the UCS/R tutorial script.  Change to the "System/R/" directory
  and follow the instructions in the "README" file there.

 Preparing for the Tutorial
  The remainder of this section is a walk-through of the UCS/Perl command-line
  tools.  Most of their functionality (and some additional stuff) is also
  available through a programmer interface in the form of a set of Perl
  modules.  If you want to write your own UCS/Perl programs, you will have to
  find your own way through the comprehensive documentation.  The UCS/Perl
  command-line tools and several additional example scripts provide a good
  starting point for your own work.  Note that you can easily configure your
  scripts (so that they have access to the UCS/Perl libraries) with the help
  of the ucs-config program.

  This tutorial assumes that you have already configured the UCS system and
  installed the command-line utilities in your search path, as described in
  the main README file.  In this case, you can skip the remainder of this
  section.

  Otherwise, you will have to specify full paths to the tools in each of the
  examples below.  For this purpose, it is convenient do define a shell
  variable $UCS pointing to the System directory of the UCS installation. 
  Execute one of the following lines, depending on whether your shell is
  "bash" or "tcsh" (if you don't know, type "echo $SHELL", or simply try both
  commands).

    export UCS=`ucs-config --base-dir`  # in sh or bash

    setenv UCS `ucs-config --base-dir`  # in tcsh

  Having set this shell variable, you can just type "$UCS/bin/ucs-add" instead
  of "ucs-add" to invoke the ucs-add program in the examples below, and
  similarly for all other command-line programs.

 Tutorial Introduction to UCS/Perl
  You should now change to a scratch directory (e.g. in your home directory or
  in the "/tmp" directory) where we can put the data files created by the
  examples in the tutorial.  These files can be deleted after you have stepped
  through the examples.

  UCS/Perl comes with fairly comprehensive documentation embedded into the
  modules and programs in POD format.  The "ucsdoc" program provides a
  convenient interface to this documentation.  Simply type

    ucsdoc <ProgramName>

  or

    ucsdoc <ModuleName>

  to read the respective manual page.  The starting point for all UCS/Perl
  documentation is the ucsintro document:

    ucsdoc ucsintro

  When you have installed Perl/Tk and the Tk::Pod module, you can also view
  the manpages in a GUI window:

    ucsdoc -tk ucsintro

  Of course, "ucsdoc ucsdoc" will tell you more about the "ucsdoc" program and
  its options.  If you prefer paper documentation, you can print the entire
  UCS/Perl documentation, using one of the additional UCS/Perl scripts
  provided in the "contrib/" directory.  Such "contributed" scripts can easily
  be invoked with the ucs-tool program:

    ucs-tool print-documentation --collate UCS-Perl-Doc

  This command will create a PostScript file "UCS-Perl-Doc.ps" in the current
  directory, which you may delete after printing.  In case of any problems you
  should omit "--collate", so that the individual manpages will be saved to
  separate files "UCS-Perl-Doc-001.ps", "UCS-Perl-Doc-002.ps", etc. (You can
  also convert documentation into LaTeX format with the "--latex" option.)

  First of all, you need to understand the UCS data set file format.  You
  should read the ucsfile manpage carefully now ("ucsdoc ucsfile").  The UCS
  distribution includes the following example data sets for your first
  experiments:

  "dickens.ds.gz"
      adjective + noun cooccurrences from a corpus of novels by Charles
      Dickens (3.4 million words)

  "fr-pnv.ds.gz"
      German PP+verb cooccurrences from the Frankfurter Rundschau corpus (40
      million words)

  "glaw.ds.gz"
      German adjective+noun cooccurrences from a small corpus of freely
      available law texts (< 1 million words), with manual annotation of
      "usual combinations"

  You will find these data sets in the "DataSet/Distrib/" directory.  UCS data
  set files have the form of statistical tables, with rows corresponding to
  pair types and columns to variables.  They are stored in a simple text
  format which is compatible with the R environment.  Data set files are
  usually compressed with "gzip" to save space and carry the filename
  extension ".ds.gz".  Direct viewing of data set files (e.g. with "zmore") is
  inconvenient.  For this purpose, UCS/Perl provides the "ucs-info" and
  "ucs-print" programs.

  "ucs-info" displays information from the header of a data set file.  Try:

    ucs-info fr-pnv.ds.gz

    ucs-info glaw.ds.gz

  Because these data sets are stored in the global data set directory (or,
  more precisely, in one of its subdirectories), it is sufficient to enter the
  name of the data set file without a full path.  If no file with the
  specified name is found in the current directory, the UCS/Perl programs will
  automatically search the global data set directory for a matching filename. 
  If the data set header does not show its size (i.e. the number of rows in
  the table) or you do not trust it, you can check the actual size of the data
  set with the "-s" option.

    ucs-info -v -s fr-pnv.ds.gz

  (The "-v" option keeps you entertained while the data set is being read.) 
  You can also display a list of all variables defined in the data set with
  the "-l" option.

    ucs-info -l fr-pnv.ds.gz

    ucs-info -l glaw.ds.gz

  Compare these listings with the documentation in ucsfile.  Also note how an
  explanatory comment is displayed with the user-defined variable "n.accept"
  in "glaw.ds.gz".

  "ucs-print" formats a data set file as an ASCII table suitable for viewing
  and printing.  It is most useful with the "-i" option, which sends the
  formatted table to a pager for interactive viewing (you should install the
  Term::ReadKey module for optimal results).

    ucs-print -i dickens.ds.gz

    ucs-print -i glaw.ds.gz

  You should now be able to page through the data set file by pressing SPACE
  (one page forward) and BACKSPACE (one page backward).  The "ucs-print"
  utility has several other options.  Like all other UCS/Perl programs, it
  will display a short usage reminder when called with the "-h" option:

    ucs-print -h

  Enter "ucsdoc ucs-print" to see the full manual page.

  The "ucs-summarize" program computes statistical summaries for numerical
  variables, e.g. for the cooccurrence frequency "f":

    ucs-summarize -v f FROM dickens.ds.gz

  or simply leave out the variable name(s) to compute summaries for all data
  set variables.

    ucs-summarize -v dickens.ds.gz

  Again, check the manual page for additional options and detailed
  information.

  Now that you are familiar with the data set file format, let us manipulate
  the data sets.  The "ucs-sort" utility changes the order of the rows in a
  data set by sorting on one or more variables.

    ucs-sort -v dickens.ds.gz BY f- INTO sorted.ds.gz

  This sorts the Dickens data set by cooccurrence frequency (decreasing) and
  creates a new data set file "sorted.ds.gz" in the current directory.  The
  "-" character after the variable name "f" selects decreasing sort order. 
  Without an explicit "+" or "-", the sort order is automatically chosen. 
  When you display the sorted data set, you will notice that there are many
  ties, i.e. pair types with the same cooccurrence frequency.

    ucs-print -i sorted.ds.gz

  You can break such ties randomly with the "-r" option

    ucs-sort -v -r dickens.ds.gz BY f- INTO sorted.ds.gz
    ucs-print -i sorted.ds.gz

  or alphabetically by specifying additional sort keys.  In this example, we
  sort first on the noun, then the adjective:

    ucs-sort -v dickens.ds.gz BY f- l2+ l1+ INTO sorted.ds.gz
    ucs-print -i sorted.ds.gz

  When the "INTO" clause is omitted, the resulting data set is printed on
  STDOUT (in the data set file format).  This feature often allows us to
  combine UCS/Perl programs into command pipes without having to save
  intermediate results into files.  Here is a single-line version of the above
  commands:

    ucs-sort dickens.ds.gz BY f- l2+ l1+ | ucs-print -i 

  If you just got a SGIPIPE warning, don't worry.  That is just because you
  quit the pager without going through the entire data set, so some of the
  data printed by "ucs-sort" was discarded.

  The two most important tools are probably "ucs-add" and "ucs-select".  The
  "ucs-add" program allows you to annotate a data set with association scores,
  rankings, and other variables.  Let us add association scores for two
  well-known association measures to the Dickens data set:

    ucs-add -v am.t.score am.log.likelihood TO dickens.ds.gz INTO scores.ds.gz
    ucs-print -i scores.ds.gz

  By the way: if you don't like the uppercase keywords "TO" and "INTO", you
  are also allowed to type them in lowercase ("to", "into") or mixed case
  ("To", "Into").  The default versions are meant to give a better visual
  subdivision of the command line.

  The most "significant" cooccurrences are those with the highest association
  scores.  We will now re-sort the data set to put these at the top:

    ucs-sort scores.ds.gz BY am.t.score | ucs-print -i
    ucs-sort scores.ds.gz BY am.log.likelihood | ucs-print -i

  (The default sort order for association scores is descending, so we do not
  have to put an explicit "-" after the variable name.)  Note how the two
  association measures disagree about which cooccurrences are most
  significant.  The actual differences can be seen more clearly when we add
  ranks according to each of the association scores to the data set:

    ucs-add -v 'r.%' TO scores.ds.gz INTO ranks.ds.gz

  In this example, we have used a UCS wildcard pattern ('r.%') to compute
  rankings for all available association scores without having to type each
  one explicitly.  Have a look at the ucsexp manpage to learn more about such
  patterns.  We can now sort directly compare the ranks assigned to each pair
  type:

    ucs-sort ranks.ds.gz BY am.t.score | ucs-print -i 'r.%' '*' FROM - 

  Note the use of wildcard patterns to display only some of the variables and
  to re-order the columns.  The special filename "-" can be used to read from
  standard input (e.g. in a command pipe) when the "FROM" clause is mandatory.
  Read the ucs-add manpage to learn about the many other possibilities it
  offers.

  The "ucs-select" command is used to select rows and/or columns from a data
  set, or to count rows that satisfy a specified condition.  If you are just
  interested in the rankings, you can select the two relevant variables and
  save them to a new data set file or display them directly with "ucs-print".

    ucs-select 'r.%' FROM ranks.ds.gz | ucs-print -i

  This actually has the same effect as

    ucs-print -i 'r.%' FROM ranks.ds.gz

  As the next step, let us count the number of pair types with cooccurrence
  frequency >= 10.  This condition is specified in the form of a UCS
  expression on the command line.

    ucs-select -v --count FROM ranks.ds.gz WHERE '%f% >= 10'

  A UCS expression is simply a snippet of Perl code (which is compiled and
  executed on the fly) with a special syntax to access data set variables.  In
  the example above, "%f%" is set to the respective value of the "f" variable
  as the expression is applied to each row of the data set.  UCS expressions
  are one of the most important elements of UCS/Perl - study the ucsexp
  manpage carefully now.

  Another simple example counts the number of pair types which are among the
  500 highest-scoring pairs according to both measures.

    ucs-select -v --count FROM ranks.ds.gz 
               WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'

  (Of course, this command has to be entered as a single line in the shell.)
  The built-in utility function "max()" is automatically available in UCS
  expressions (cf. the UCS::Expression::Func manpage).  We can also save all
  rows that satisfy this condition to a new data set, selecting all columns
  with the "%" wildcard.

    ucs-select -v '%' FROM ranks.ds.gz INTO highest.ds.gz
               WHERE 'max(%r.t.score%, %r.log.likelihood%) <= 500'

    ucs-info -l highest.ds.gz
    ucs-print -i highest.ds.gz

  We can now easily work with this new small data set, or re-sort and view it.
  Such small subsets extracted from a data set are also suitable for printing.
  Running "ucs-print" with the "--postscript" (or "-ps") option creates a
  PostScript file that can be sent to an appropriate printer:

    ucs-print -v -ps -l -p 50 -o highest.ps highest.ds.gz

  You can now preview the result with "gv highest.ps".  Check the ucs-print
  manpage for an explanation of the options used in the example above.

  Thanks to the use of UCS expressions, "ucs-select" has the full power of
  Perl, with access to all built-in functions ("perldoc perlfunc") and the
  complete standard library.  It is easy e.g. to retrieve all collocates of
  nouns ending in -ness.

    ucs-select -v '*' 'r.%' FROM ranks.ds.gz WHERE '%l2% =~ /ness$/'
               | ucs-sort by l2 l1 | ucs-print -i

  It is often useful to store manual annotations (e.g. variables marking true
  collocations) in separate files.  A data set without frequency information
  (i.e. without the frequency signature f, f1, f2, and N) is called an
  "annotation database" and conventionally has the extension ".adb.gz".  The
  UCS distribution includes an annotation database for German PP+verb pairs,
  which was kindly provided by Brigitte Krenn (FAI, Vienna).

    ucs-info -l pnv.adb.gz
    ucs-print -i pnv.adb.gz

  We can easily find out the number of pair types that were identified as
  collocations with the "ucs-select" command:

    ucs-select -v --count FROM pnv.adb.gz WHERE '%b.figur%'
    ucs-select -v --count FROM pnv.adb.gz WHERE '%b.fvg%'

  In order to use these annotations with cooccurrence data extracted from a
  corpus, the annotation attributes have to be transferred to a data set file.
  This is achieved with the "ucs-join" program.  Simply calling ucs-join with
  the two files as arguments will check the coverage of the annotation
  database:

    ucs-join -v fr-pnv.ds.gz pnv.adb.gz

  We can now copy the "b.figur" and "b.fvg" attributes to the data set
  "fr-pnv.ds.gz", and save the result into a new data set file.

    ucs-join -v fr-pnv.ds.gz WITH b.figur b.fvg FROM pnv.adb.gz 
                INTO fr-annotated.ds.gz

    ucs-info -l fr-annotated.ds.gz

  If any of the pair types are not covered by the annotation database, they
  will be annotated with missing values (NA).

  Once we have added association scores and rankings to the data set, we can
  easily compute the precision and recall of N-best lists (i.e. the N
  highest-ranked pairs according to some association measure etc.).  Note how
  the "-m" option of "ucs-add" allows us to write back the modified data set
  to the same file:

    ucs-add -v -m am.log.likelihood r.log.likelihood 
                  TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

  In the PP+verb annotation database, figurative expressions and support-verb
  constructions are marked separately.  However, we want to accept both as
  true collocations, so the condition for true positives is "%b.figur% or
  %b.fvg%".  It would be convenient to have a single variable marking true
  positives.  We can create such a variable, which we will call "b.TP", by
  evaluating a user-defined UCS expression with the "ucs-add" program.

    ucs-add -v -m 'b.TP := %b.figur% or %b.fvg%' 
                  TO fr-annotated.ds.gz INTO fr-annotated.ds.gz

  Now it is easy to evaluate the N-best lists against all true positives:

    ucs-select -v --count FROM fr-annotated.ds.gz 
                  WHERE '%b.TP% and %r.log.likelihood% <= 500'

  You can also create your own data sets for relational cooccurrences, with
  the help of the "ucs-make-tables" program.  For relational cooccurrences,
  each pair token (= instance) represents a structural relation between words
  (or other morpho-syntactic units).  Examples are adjectives modifying nouns
  (as in the Dickens and GLAW data sets) or PPs that are P-objects or adjuncts
  of a verb (as in the FR-PNV data set).  Positional cooccurrences (words
  occurring in the same sentences or within a certain distance from each
  other) are more difficult to count properly and you will have to construct
  such data sets on your own.

  "ucs-make-tables" takes its input - which is a stream of pair tokens - from
  an extraction tool that the user has to provide.  Each line of this stream
  represents a pair token and has the format

    <l1> TAB <l2>

  where <l1> is the type (= lexeme) of the first component of the pair token,
  and <l2> is the type of its second component.  The extraction tool should
  print the token stream on standard output so that it can be connected to
  "ucs-make-tables" through a pipe:

    <YourExtractionTool> | ucs-make-tables -v <dataset.ds.gz>

  Type "ucsdoc ucs-make-tables" to learn about the available command-line
  options.

  The UCS/Perl distribution includes example scripts in the
  System/Perl/contrib/ directory tree that extract cooccurrence data from a
  corpus encoded in the IMS Corpus Workbench (CWB).
  [ "http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/" ]

  When you have installed the CWB, the CWB/Perl interface modules, and the
  demonstration corpus provided with the CWB (DICKENS), you can re-create the
  Dickens data set with the following commands:

    ucs-tool adj-n-from-cwb penn DICKENS 
          | ucs-make-tables -v -f 3 my-dickens.ds.gz

    ucs-info -l my-dickens.ds.gz

  You can also import data sets from the Ngram Statistics Package (NSP)
  [ "http://ngram.sourceforge.net/" ].  For instance, if you have a file named
  "bigrams.cnt" that was created with NSP's "count.pl" tool, the following
  command converts it into a UCS data set:

    ucs-tool nsp2ucs -v bigrams.cnt bigrams.ds.gz
    ucs-info -l bigrams.ds.gz

  Note that there are usually no manual pages for such "contributed" scripts. 
  Run them with the option "-h" for a short description of their purpose and
  usage information:

    ucs-tool adj-n-from-cwb -h
    ucs-tool segment-from-cwb -h
    ucs-tool nsp2ucs -h
    ucs-tool make-dummy-ds -h
    ucs-tool count-collocates -h
    ucs-tool dispersion-test -h

  (When a manual page is available, it can be displayed with the "--doc"
  option, e.g. "ucs-tool --doc nsp2ucs").  You can list all contributed
  scripts with

    ucs-tool --list

  or all scripts that import data sets from external programs with

    ucs-tool --list --category=Import

