Notes on Audio File Headers                           P. Kabal  Sept. 18, 1995

Information Records in AFsp Audio Files

AFsp audio files use the Sun audio file header, but add compatible information
in an extensible part of the header.  Sun audio file headers allow for an
arbitrary length information field following a fixed format portion of the
header.  The purpose of this note is to suggest a standard format for this
information.  Files with this standard header information encoding will be
referred to as AFsp audio files.  The goal is to provide a simple standard
mechanism for adding information to the header, and more importantly provide
for very easy extraction of relevant information.  Furthermore, AFsp audio
files are upward compatible with Sun/NeXT audio files, which use the same
encoding for the fixed part of the header.

The information that will be emphasized here is that which can be used to track
processing of speech and audio files.  The suggested format, though, has no
inherent limitation on the information that can be used.  There have been other
attempts to standardize file headers.  However in the speech coding community
much of the processing done with headerless files, perhaps because of the lack
of standardization, or perhaps because some of the header formats are
considered to be too complicated to decode dynamically.  If they are not
decoded dynamically, then there is little purpose in leaving the header in
place.

NIST SPHERE audio file headers:
This is a general format - but does not require any particular records to be
present.  For instance, the sampling rate is one record that is important for
playback and processing, but that information is not required to be present.
The format of the sample data is encoded is in three different records
(sample_n_bytes, sample_byte_format and sample_sig_bits).  The format of the
header makes for complicated decoding of record information.  There are three
types of data supported: strings, integers and reals.  String values are
allowed to consist of any characters, including the newline character that
normally marks the end of a record.  This means that in scanning the header
for a particular named record, all preceding records must be fully parsed to
determine record boundaries.  The NIST SPHERE header is fixed length - easy to
skip, but perhaps also wasteful.  Routines are available to manipulate the
header, but because they verify correct syntax for the header records, these
are a bit on the heavy-weight side.

ESPS:
This is again a general format.  However it is proprietary.  It does have fixed
information at the beginning of the header, followed by record oriented data.
The record oriented data uses a complicated record structure which requires
full parsing of preceding records to retrieve a particular record.  One piece
of information that is missing from the fixed format part of the header is the
sampling rate.  This information is only available as one of the records in
the variable part of the header.

Sun audio files:
Under this category are Sun and NeXT audio files.  They share the same basic
format for the file header.  The file header consists of 6 fixed locations
followed by a variable length segment.  One of the fixed locations gives the
total header length, making skipping over the header relatively easy.  The
remaining fixed information encodes the data format, the number of samples,
number of channels, and the sampling rate (integer value).  With much of the
essential information in the fixed part of the header, and guaranteed to be
present, the remaining information is less critical.  This extra information
can be very useful for identification purposes.

Standardized Header Information Records for Sun Audio Files:

The basic proposal is to adopt Sun audio files as a base.  Sun audio files,
particularly for mu-law coded data, are widely used for playback in computer
workstations.  The header provides for a variety of data formats, integer as
well as floating point.  Furthermore the header has provision for both
essential data in the fixed part of the header and additional data in the
extensible part of the header.  To define the extensible part of the header, a
prime concern is to provide for easy decoding, without limiting the scope of
information that can be stored.

AFsp Audio Files:

Audio files which use the basic Sun header but add structured fields to the
extensible part of the header will be referred to as AFsp audio files.
It is proposed that records be separated by null characters, with the null
character prohibited from appearing elsewhere in a record.  This provision
makes for easy retrieval of a record with a particular name without decoding
intervening records.  Also the banning of nulls in records allows for header
processing routine to return null-terminated string values, a format that is
very convenient for C-language routines.

Definition of AFsp Information records:

<record><\0>,   where <record> is any sequence of characters except null

Header information records are delimited by null characters.  Standard records
contain a name part and a value part.  It is proposed that the record retrieval
method be simply to match a "name" string to the first part of a record.  If a
match is found, any characters following the name represent the value field of
that named record.

Record names:
To extract specific information from the header, named records can be used for
identifying the records.  For instance, a date record would give the file
creation date and time.  Standard names shall have a trailing ':' character
which can be thought of as a separator between the name and the value, but is
really part of the name itself.  For example a standard date record would
appear as

"date:1993/01/20 18:12:20 UTC"

The name of this record is "date:".  The value field, in this case a string,
is "1993/01/20 18:12:20 UTC:".

Values:
The value part of the record can be interpreted as desired, but two clear
alternatives arise: string values and numeric values.  A value can always be
interpreted as a string (possibly empty).  A numeric value requires that the
characters in the value in the value field represent a valid numeric value.
The numeric value shall be a character representation of a decimal number
expressed either as an integer, a simple floating point number, or a floating
point number with a power of 10 exponent.  Thus a "sample_rate:" record could
specify a sampling rate of 8000 Hz as

"sample_rate:8e3"  or "sample_rate:8000.0" or "sample_rate:8000"

Numeric values should be treated as floating point, and if appropriate
converted by the user to integer.  For instance, information that is clearly
integer in nature is just a special case of a floating point value with no
fractional part.

Identifier:
For files using this type of encoding must have as the first 4 characters
(bytes 24-27) following the fixed part of the header contain the characters
"AFsp".  The standardized information records start immediately after this 4
character sequence (at byte offset 28).

Padding:
The header should be padded out with null characters to a length which is a
multiple of 4 bytes.  This is not strictly necessary, but is useful if the AFsp
file is accessed by other programs.  For instance an audio file playback
program which assumes headerless files, but which provides for playback to
start at an arbitrary sample number can be used to skip the header if the
header has a size which is equal to the size of an integral number of samples.

Standardized Records:

It is suggested that a number of information fields can be standardized.  The
"date:" record should always be present.  The others are optional.  Standard
programs should provide at least the following header information.
  date:1994/01/25 19:19:39 UTC    date
  sample_rate:8012.5              sampling frequency (only if non-integer)
  user:kabal@k2.EE.McGill.CA      user
  program:CopyAudio               program name

Audio files, serving as part of a data base of recordings, should have
information identifying the database, the recording conditions and a
description of the material or spoken text.

(1) "date:"; recording date or processing date.
    This record remains invariant even when the file is copied and the
    creation date of the file as kept by the file system gets mangled.
    sample value format, "date:1993/01/16 18:13:56 UTC".  This suggested format
    is compact, language independent, and easily generated on many systems.
    The inclusion of a time-zone, here UTC, for the date is recommended.  If
    such a format cannot be generated on a particular machine, the default
    system time format can be used.
(2) "user:"; user and hostname for the user that created the file
    sample value format, "user:kabal@k2.EE.McGill.CA".
(3) "program:"; program that created the file
    sample value format, "CopyAudio".  As shown the program name is stripped
    of the pathname of the program.
(4) "text:"; transcription of the text for recorded spoken material
    sample value format, "Cats and dogs each hate the other"
(5) "speaker:"; speaker identification
    sample value format, "speaker:AMK female"
(6) "recording_conditions:"; information as to how the recording was made
    sample value format, "recording_conditions:original recorded at 20 kHz,
    15-bit D/A, digitally filtered and resampled to 8 kHz".
(7) "database:"; database identification
(8) ":"; comments
    sample value format, "converted to float from 16-bit integer".  The comment
    record is meant to supply extra information that is not appropriate for
    other records.
(9) "sample_rate:"; sample rate for each channel
    sample value format, "41.1e3".  This value complements information that is
    in the fixed part of the header.  However, this version of the sample rate
    is a floating point number and so can express fractional sampling rates.
    The value in this record must be within 0.5 of the integer value in the
    fixed part of the header.
(10) "description:"; description of the contents of the audio file
    sample value format, "Opening musical score in the film "2001: A Space
    Odyssey""

Notes:
(1) Newline characters can be used in long text strings to help in formatting
    the strings.
(2) The last record in the header need not have a null termination.
(3) Empty records are permitted.  They should be ignored.  Note that null
    padding at the end of the header appears as empty records.
(4) There is a question of how much baggage an audio file header should carry.
    One extreme is a headerless file.  Perhaps the other extreme is represented
    by ESPS files.  ESPS utility programs seem to pass along a great deal of
    information in the output file header, including imbedding the headers from
    the input files.  I have seen one file header nearly 5 kb long  which
    contained several generations of file headers representing the processing
    history of that file.
      The approach taken with the AFSP read/write routines is to automatically
    insert information that the routines can glean without being privy to the
    specifics of the processing of the data.  The user is of course free to add
    information to the header.  A standalone program CopyAudio can also be used
    to add information records.  This is an appropriate approach for creating
    self-documenting database files.  Also, the AFsp file open routines were
    designed to print information about the audio files as they are being
    opened.  This information can be stored in a log file which can serve to
    record the processing history.

=============
Peter Kabal
Department of Electrical Engineering    McGill University
+1 514 398-7130   +1 514 398-4470 Fax
kabal@@TSP.EE.McGill.CA

$Id: AFHeaders.txt,v 1.4 1995/09/18 AFsp-V2R1 $