This note is an edited summary of the ideas underlying the design of the
Time2 Library.
With time playing a rather important role in this world,
sequences of things ordered by time are present in many kinds of systems.
Because the idea of time is intuitively familiar, it is tempting to choose
a simplistic design when modeling a system or even not to think about it at all.
Ironically, simplistic designs can lead to needlessly complex implementations,
as annoying issues are addressed one after the other.
Stated informally, a time series is a set of elements uniquely identified
by a discrete point in time or by a time interval.
The Time Series Framework requires a more precise definition:
A time series is characterized by a value type and a time domain.
All elements of a time series have a value of the same type, the value type,
or can be recognized as missing. Any element can be uniquely identified
by a point in the time domain of the series.
Note that the definition does not explicitly allow for time ranges identifying
values. This is generally not a problem because of the nature of time domains.
The terms used in the definition will be explained shortly, but before
that a few typical problems will be discussed. These problems should
illustrate why a framework for time series is useful.
Problems and design goals
Let us look at a series alternating between two
constant values every working day: Friday=5, Monday=10, Tuesday=5,
Wednesday=10, Thursday=5, Friday=10, Monday=5, Tuesday=10, and so
on. The series has never any data on weekends. Consider
this straightforward chart plotting the values on the
vertical axis against their dates, on the horizontal axis:
This looks a bit funny, doesn't it? The chart does not express very well the
perfect regularity of the series. Compare it to
this
designed solution:
The visual pattern is now a faithful expression of the regularity of
the data. The chart does it by excluding weekends. The
time
domain is in fact the only difference between the two charts:
everyday dates in one case versus workweek-only dates in the other.
Tel Aviv, Cairo, New York
Beyond this contrived example, there are many situations with
no data on weekends. A familiar case is provided by stock markets.
These are also a good illustration of the annoying details hiding in
seemingly simple problems: weekends are not the same in Tel Aviv,
Cairo, and New York. When designing a database for global stock market
data, or when drawing charts to compare the prices of some American,
Egyptian, or Israeli stocks, such issues must be dealt with.
Getting a good grip on the time domain is the first important design
goal of the Time Series Framework.
Missing values
The following table lists the first eleven Olympic Games.
1896 | Athens |
1900 | Paris |
1904 | Saint-Louis |
1908 | London |
1912 | Stockholm |
1916 | |
1920 | Antwerp |
1924 | Paris |
1928 | Amsterdam |
1932 | Los Angeles |
1936 | Berlin |
Eleven? It is true that there are ten Games, but it is also true that
the list has eleven elements. Is something wrong here? No. Games had
been scheduled in Berlin in 1916 but were canceled because of the
war. Conceptually, the Games of 1916 are a missing value, and
there are many real world phenomena modeled with time series where it
is common to have some values missing. A system dealing with time
series must be capable of dealing with such cases gracefully and in a
useful way. You don't want your software to return the nine first
Games when you asked for ten, or to crash on the Games of 1916. Or you
don't want your portfolio evaluation software to give up when a quote
is missing. And as a software developer, you don't want to invent ad-hoc
solutions all the time.
Detecting missing values and handling them intelligently is the
second important design goal of the Time Series Framework.
The pieces of the puzzle
Time domain
A time domain identifies a point in time with an offset from a
base time. The offset is a long integer (64 bit signed integer).
The base time is January 1 0000, corresponding to the index
value 0.
A pseudo Gregorian calendar with the usual definition of leap
years is used. It is not the true thing because it applies from year 0, and
dates before the Gregorian cutover of October 15 1582 do not correspond to
historical dates. Years can be as a large as the size of the numbers used in
the implementation allow.
The resolution of the time series defines the time unit
corresponding to an offset increment. Available units are microsecond, second,
minute, hour, day, month, year. Week is omitted by design.
The origin is a non-negative number of units defining an
offset from the base time. It plays an important role in the time patterns
introduced below. It can also be used to avoid using large integers for indexes
in applications where the range size of a series always fits within a standard
integer. This is especially useful if the programming language does not
support long integers for indexing arrays, like 64 bit Java.
The time pattern defines the time points for which elements
exist. One such pattern is a repeating cycle. A weekly series has a resolution
of day units with a cycle of one day with an element, followed by six days
without elements. A workweek series has day units with a cycle of five days
with elements, followed by two days without elements. The Olympics have year
units with a four-year cycle of one year with games, followed by three years
without. By default, a series has no cycle, meaning elements exist for all
times. The cycle starts with the origin of the time domain. It is possible
for two series with the same resolution and cycle, but not the same origin,
to have no time point in common. The summer and winter Olympics are such a
case.
Time index
A time index is in a time domain and carries a discrete offset which defines
a point in time. Two time indexes in the same time domain can be compared,
with a larger index corresponding to a later point in time. Adjacent points
in time are represented by adjacent offsets.
Because time indexes are expected to be used as keys it is important to
implement them as immutable objects.
Time range
A time range is a pair of time indexes in the same time domain, called
the lower and upper bound. If the lower bound is larger than the upper bound
the range is said to be empty.
Observation, value type, missing value
An observation has a time index and a value of some type. The value type
must allow the definition of a special value representing missing values,
without restricting the set of useful values; null (nil) can only be used
as this special value if it has no other meaning in the relevant context.
Time series
A time series maps a set of time indexes to values. All time indexes are
in the same time domain and all values are of the same type. For this
reason we talk of the time domain and the value type of a time series.
A time series can be defined alternatively as a set of observations,
with each element of the set uniquely identified by its time index.
A time series has a range defined by the smallest and largest time indexes
of the included observations.
A time series maintains the abstraction of missing values, which correspond
to time indexes in the range for which no observation exist. This definition
implies that a time series can never have missing values at its boundaries.
This is also true for a subset of a time series, because it is also a
time series.
Time series storage
Although an implementation detail, the storage of time series presents an
interesting design question. That a series maps time indexes to values does
not mean that it has to be stored in a dictionary keyed by time indexes. The
Flyweight pattern provides a better solution, making use of
the fact that for a given time domain, time indexes can be represented by
integer values, the offsets.
Two cases must be considered, depending on the frequency of missing
values. In a regular time series, missing values are
exceptional. In a sparse time series, they are the rule.
Sparse series are useful for managing irregular events. They are typically
implemented as dictionaries. Missing values are not stored.
Instead of time indexes, only offsets are used as keys. The time domain
is stored only once. Time indexes are reconstructed if and when needed.
The storage of regular series is straightforward and efficient.
All values, missing or not, are stored into an array. With missing values the
exception, the overhead is reasonable, and because they are self-signaling,
missing values are always detectable. In addition to the array, the series
stores also the time domain and the offset of the time index of the value in
the first array element. A time index can be reconstructed from the time domain
and the sum of the stored offset and the array index of the value.
Time Series Framework Design
by Jean-Paul Vetterli is licensed under a Creative Commons Attribution 3.0 Unported License.