NAME
    ucsfile - The UCS data set file format

INTRODUCTION
    UCS data sets are stored in a simple tabular format, similar to that of
    a statistical table. Each row in the table corresponds to a pair type,
    and its individual fields (columns) provide various kinds of information
    about the pair type:

    * a unique ID number (unique within the data set)
    * the component lexemes
    * the pair type's frequency signature
    * [optional] contingency tables of observed and expected frequencies
    computed from the frequency signature
    * [optional] coordinates computed from the frequency signature
    * association scores and rankings for various association measures
    * arbitrary user-defined attributes, especially for the manual
    annotation of *true positives* in an evaluation study

    Following statistical terminology, the table columns are referred to as
    the variables of a data set (each of which assumes a specific value for
    each pair type). Columns are separated by a TAB character ("\t"), and
    the first row lists the variable names as table headings (see the
    section "VARIABLES" below for naming conventions).

    The actual data table may be preceded by an optional header of
    Perl-style comment lines (beginning with a "#" character). Lines with
    the special format

      ##:: <variable> = <value>

    define global variables, which may be interpreted by some of the
    UCS/Perl programs (see the section "GLOBAL VARIABLES" below). The
    variable name (*variable*) may only contain alphanumeric characters
    ("A-Z a-z 0-9") and the period ("."). The *value* may contain arbitrary
    characters, including whitespace (but leading and trailing whitespace
    will be ignored). Variable definitions must not span multiple lines.

    UCS data set files must have the filename extension .ds. They may be
    compressed with gzip (and they usually are), in which case they carry
    the extension .ds.gz. UCS library functions will automatically recognise
    and uncompress data set files with this extension.

    A special subtype of data sets are the annotation database files with
    extension .adb (uncompressed) or .adb.gz (compressed). Annotation
    databases omit all frequency information and association scores, listing
    only component lexemes and user-defined attributes. They are used as
    repositories of lexical information (such as manually annotated *true
    positives* for evaluation purposes) that applies to data sets extracted
    from different corpora (or with different methods).

GLOBAL VARIABLES
      size        number of pair types in a data set

    The only global variable that is currently supported is size, an integer
    specifying the number of pair types in a data set. Availability of the
    data set size in the header may give a slight performance improvement
    when loading data set files into memory. If size is set to an incorrect
    value, the behaviour of UCS/Perl programs and modules is undefined.

    A global variable whose name is identical to that of a variable defined
    in the data set (i.e. a table column) is interpreted as an explanatory
    note. Such notes should typically be given for all user-defined
    variables, and also for user-defined association measures.

    Unsupported variables will simply be ignored and will not raise errors
    or warnings when a data set file is parsed.

DATA TYPES
    The UCS system supports four different data types:

      BOOL      a logical (Boolean) value
      INT       a signed integer value (>= 32 bits)
      DOUBLE    a floating-point value (IEEE double precision)
      STRING    an arbitrary string (ISO-8859-1 or UTF-8)

    Boolean values are represented by 1 (true) and 0 (false). String values
    may contain blanks (but no TAB characters) and are neither quoted nor
    escaped. Full support for Unicode strings (UTF-8) is only available
    within the UCS/Perl subsystem.

    The UCS/R subsystem will interpret Boolean values as logical variables,
    and strings (except for the component lexemes) as *factor* variables
    with a fixed set of levels (which are automatically determined from the
    data).

    User-defined attributes may assume the special value "NA" for missing
    values. (Note that the string "NA" will always be interpreted as a
    missing value rather than a literal character string!) UCS/R has
    built-in support for missing values, whereas UCS/Perl represents them by
    undef entries. Programs that do not support missing values may replace
    them by 0 (BOOL and INT), 0.0 (DOUBLE), or the empty string "" (STRING).

    The data type of a variable is uniquely determined by the variable name,
    as detailed in the section "VARIABLES" below.

VARIABLES
    In order to be compatible with the R language, variable names may only
    contain alphanumeric characters ("A-Z a-z 0-9") and periods ("."), and
    they must begin with a letter. The main function of periods is to
    delimit words in complex variable names, replacing blanks, hyphens, and
    underscores. UCS variable names are case-sensitive.

    Periods are not allowed in Perl variable names, but UCS expressions
    provide a special syntax for direct access to data set variables (see
    the ucsexp and UCS::Expression manpages). In the rare case where plain
    Perl variables are used to store information from a data set, periods
    should be replaced by underscores ("_") in the variable names.

    There are strict naming conventions for data set variables, which are
    detailed in the following subsections. Apart from a fixed list of core
    variables (whose names do not contain the "." character), all variable
    names begin with a period-separated prefix that determines the data type
    of the variable.

  Core Variables
    Core variables represent the minimal amount of information that must be
    present in a data set file (i.e. evidence for cooccurrences extracted
    from a corpus). All core variables are mandatory, except in the case of
    annotation database files (.adb), which omit frequency signatures ("f f1
    f2 N"). For relational cooccurrences, frequency signatures can be
    computed with the ucs-make-tables utility from a stream of pair tokens
    (cf. the ucs-make-tables manpage).

      INT    id    a numerical ID value (unique within the data set)
      STRING l1    first component type of the pair
      STRING l2    second component type of the pair

      INT    f     cooccurrence frequency of pair type
      INT    f1    marginal frequency of first component
      INT    f2    marginal frequency of second component
      INT    N     sample size (identical for all pair types)

    "id" is a numerical ID value, which must be unique within a data set.
    Its intended uses are to identify pair types in subsets selected from a
    given data set, and to validate line numbers when attributes or
    association scores are computed by an external program and re-integrated
    into the data set file.

    The lexemes "l1" and "l2" are the component (word) types that uniquely
    identify a pair type. Consequently, a data set file must not contain
    multiple rows with identical "l1" and "l2" values. UCS/Perl should
    provide reasonably good support for Unicode strings as lexemes (in UTF-8
    encoding), at least when running on Perl version 5.8.0 or newer.

    The quadruple "f f1 f2 N" is called the frequency signature of a pair
    type. It contains all the frequency information used by association
    measures and is equivalent to a contingency table. Note that the sample
    size "N" is identical for all pair types in a data set and is included
    here mainly for convenience' sake (so that association scores can be
    computed from the row data without reference to a global variable). See
    (Evert 2004) for more information on lexemes and frequency signatures.

  Derived Variables
    Derived variables can be computed from the frequency signatures of pair
    types, providing different "views" of the frequency information.
    Normally, they are not annotated explicitly but are accessible through
    UCS expressions, which compute the required values automatically (see
    the ucsexp and UCS::Expression manpages).

      INT    O11   contingency table of observed frequencies
      INT    O12     (computed from frequency signature)
      INT    O21
      INT    O22

      INT    R1    row sums in observed contingency table
      INT    R2
      INT    C1    column sums in observed contingency table
      INT    C2

    The variables "O11 O12 O21 O22" represent the observed contingency table
    of a pair type. Note that their frequency information is equivalent to
    the frequency signature of the pair type. In addition, the row sums ("R1
    R2") and column sums ("C1 C2") of the contingency table are also made
    available.

      DOUBLE E11   contingency table of expected frequencies
      DOUBLE E12     under point null hypothesis
      DOUBLE E21     (computed from row and column sums)
      DOUBLE E22

    The variables "E11 E12 E21 E22" represent the contingency table of
    expected frequencies, i.e. the expectations of the multinomial sampling
    distribution under the point null hypothesis of independence. Most
    association measures compare observed frequencies to expected
    frequencies in some way.

    In a geometric interpretation of a data set, each pair type can be
    interpreted as a point *x* in a three-dimensional coordinate space *P*.
    Since the sample size "N" is a constant parameter within the data set,
    the coordinates of *x* are given by the joint and marginal frequencies
    "f f1 f2".

      DOUBLE lf    logarithmic coordinates 
      DOUBLE lf1     (base 10 logarithm)
      DOUBLE lf2

    Since the coordinates usually have a skewed distribution across several
    orders of magnitude, it is often more convenient to visualise them on a
    logarithmic scale. The variables "lf lf1 lf2" give the base ten
    logarithms of the coordinate triple "f f1 f2".

      DOUBLE e     ebo-coordinates
      DOUBLE b       (expected, balance, observed)
      DOUBLE o

      DOUBLE le    logarithmic ebo-coordinates
      DOUBLE lb      (base 10 logarithm)
      DOUBLE lo

    Theoretical and empirical studies of the properties of association
    measures will often be based on transformed coordinate systems in the
    coordinate space. The most useful system are the ebo-coordinates "e b o"
    (for *expected*, *balance*, *observed*). All three coordinates range
    from 0 to infinity (constrained by the sample size parameter "N"). The
    base 10 logarithms "le lb lo" of the ebo-coordinates are convenient for
    visualisation purposes. "le" and "lb" range from -infinity to +infinity,
    while "lo" ranges from 0 to infinity (all constrained by "N").

    For backward compatibility, a transformation of the coordinate system to
    relative frequencies, which were used in earlier versions of this
    software, is also supported. The relative cooccurence ("p") and marginal
    ("p1 p2") frequencies are computed from the frequency signature
    according to the equations "p = f/N", "p1 = f1/N", and "p2 = f2/N". Note
    that the logarithmic versions "lp lp1 lp2" are *negative* base 10
    logarithms, ranging from 0 to infinity.

  Association Scores and Rankings
    These variables store association scores and rankings for an arbitrary
    number of association measures. Each association measure is identified
    by a *key*, which is appended to the respective variable name prefix
    (resulting in the names "am.*key*" and "r.*key*"). See the UCS::AM
    manpage (and the manpages of the add-on packages listed there) for a
    wide range of built-in association measures.

      DOUBLE am.*  association scores from measure identified by *
      INT    r.*   ranking for this measure (ties are allowed)

    Rankings are often computed on the fly, but they may also be annotated
    in data set files. Note that the "r.*" variables should *not* break ties
    but report identical ranks (and skip an appropriate number of subsequent
    ranks). The ucs-sort program (cf. the ucs-sort manpage) can be used to
    resolve ties in various ways (using other association scores, lexical
    sort order, or randomisation).

  User-Defined Variables
    User-defined variables may contain arbitrary information, which is
    typically used for filtering data sets and to determine true positives
    in evaluation tasks. However, some special-purpose association measures
    may also base their association scores on their values. In order to
    allow a minimal amount of automatic processing (such as sorting by
    user-defined attributes), the variable name prefix of a user-defined
    variable is used to determine its data type, according to the following
    list.

      BOOL   b.*   user-defined Boolean variable
      INT    n.*   user-defined integer variable (n=number)
      DOUBLE x.*   user-defined floating-point variable
      STRING f.*   user-defined string variable (f=factor)

    User-defined variables with the additional prefix "ucs" (corresponding
    to variable names "b.ucs.*", "n.ucs.*", "x.ucs.*", and "f.ucs.*") are
    reserved for internal use by UCS modules and programs.

REFERENCES
    Evert, Stefan (2004). *The Statistics of Word Cooccurrences: Word Pairs
    and Collocations.* PhD Thesis, University of Stuttgart, Germany.

COPYRIGHT
    Copyright (C) 2004 by Stefan Evert.

    This software is provided AS IS and the author makes no warranty as to
    its use and performance. You may use the software, redistribute and
    modify it under the same terms as Perl itself.

