Time Series Software: July 2012

Tuesday, July 17, 2012

CrNiCKL 1.1.0 released with all delete methods renamed

There were quite a few methods named delete in the CrNiCKL software. This was a bad idea because such methods are useless when called from JavaScript. Indeed, delete is a reserved word in js.

All such methods have been renamed and a new version of the software has been released to the project website, to SourceForge and the GitHub repositories. The following scheme has been used for renaming. Parameterless delete methods are now named destroy. The others are now typically named deleteFoo where Foo is the type of (one of) the parameter(s).

I decided to completely eliminate the old methods rather than let them live a bit longer as deprecated. I did it because the software is not yet used out there. In case I misread the stats, please accept my apologies. In any case the previous version remains available on all the websites mentioned.

Friday, July 13, 2012

Source code sharing

Since a very long time I wanted to switch to the git version control system. It's now done. Compared to what I've used in the past: cvs, sccs, rcs, vss, sclm, Librarian, and maybe others I have forgotten, I like git better. I don't know why I waited so many years...

Although the source of the projects I discuss in this blog was already available for downloading via the project websites, it can now be browsed more comfortably at GitHub.

The relevant repositories are:

time2lib: for the Time2 Library
crnickl: for the CrNiCKL base system
crnickl-jdbc: for the JDBC support of CrNiCKL
crnickl-demo: for the CrNiCKL demos

The game of the name

The ChronoDB project has been renamed CrNiCKL. I noticed a naming conflict with an unrelated product and decided to change the name when it's still harmless.

Chronicles are the most important objects in the ~~ChronoDB~~ CrNiCKL database. So I started from them to find a new name. I settled on CrNiCKL because it can be pronounced like "chronicle" and there are practically no hits for it in Google. I also decided to write the name with most of the letters in uppercase, so it stands out in text.

The software remains available under its old name, for a while at least, but won't be updated. Blog posts related to ChronoDB will not be updated but will get a notice about the rename.

The new project website is http://agent.ch/timeseries/crnickl/.

Wednesday, July 11, 2012

Time2 Library project home page updated

The marketing departement has just updated the home page of the Time2 Library project with the following quote of the day:

If you need speed, compactness, and flexibility, then the Time2 Library is for you. There is no trade off. Even with the wildest time domains (calendars if you prefer) a time point is just a number. In many cases it is even less than that: it is an array index. The flexibility comes into play only if and when needed, for example when printing or scanning a date.

Wow.

Monday, July 9, 2012

ChronoDB and geographical coordinates

Important notice. On July 13, 2012, the ChronoDB project was renamed CrNiCKL, which is pronounced like "chronicle". All packages, demos included, have been renamed. The new project website is at http://agent.ch/timeseries/crnickl/. The old project remains accessible for a while at http://agent.ch/timeseries/chronodb/. The remainder of this article remains valid mutatis mutandis.

I've just released another ChronoDB demo to show how to set up a database with time series of geographical coordinates. The demo can be found in the package ch.agent.chronodb.demo.geocoord in archive chronodb-demo-1.1.0.jar at the ChronoDB project website or on SourceForge.

GeoCoord is a toy Java interface for geographical coordinates, with three methods:

boolean isNear(GeoCoord coord); boolean isNear(GeoCoord coord, double distance); double distanceTo(GeoCoord coord); In the demo, GeoCoords are implemented as cartesian coordinates. The glue with ChronoDB is provided by a simple value scanner GeoCodeValueScanner and a less simple implementation of ValueAccessMethods<GeoCoord> named AccessMethodsForGeoCoord. The Database class itself is so small it can be listed completely here: package ch.agent.chronodb.demo.geocoord; import ch.agent.chronodb.jdbc.JDBCDatabase; public class GeoCoordDatabase extends JDBCDatabase { public GeoCoordDatabase(String name) { super(name); setAccessMethods(GeoCoordValueScanner.class.getName(), new AccessMethodsForGeoCoord()); } }

To make things interesting the demo uses a special time domain, with time points at 7:00 AM, 9:00 AM, 3:11 PM, and 9:33:20 PM every Monday, Tuesday, Friday, and Saturday. For a refresher on time domains, please have a look at my previous post.

Here is what running the demo looks like (commands are on a single line and output has been truncated):

$ jar xf chronodb-demo-1.1.0.jar lib $ java -cp chronodb-demo-1.1.0.jar \ ch.agent.chronodb.demo.geocoord.GeoCoordDemo Resources/geocoord.parm 2012-04-09 09:00:00 8151km (machin) 2012-04-09 15:11:00 7000km (machin) --- clip clip --- 2012-07-09 07:00:00 7053km (truc) 2012-07-09 09:00:00 8367km (bidule) 2012-07-09 15:11:00 9582km (bidule) 2012-07-09 21:33:20 6889km (machin) $ The exact figures vary with each run because the data is random.

Friday, July 6, 2012

Time Series Framework Design

This note is an edited summary of the ideas underlying the design of the Time2 Library.

With time playing a rather important role in this world, sequences of things ordered by time are present in many kinds of systems. Because the idea of time is intuitively familiar, it is tempting to choose a simplistic design when modeling a system or even not to think about it at all. Ironically, simplistic designs can lead to needlessly complex implementations, as annoying issues are addressed one after the other.

Stated informally, a time series is a set of elements uniquely identified by a discrete point in time or by a time interval. The Time Series Framework requires a more precise definition:

A time series is characterized by a value type and a time domain. All elements of a time series have a value of the same type, the value type, or can be recognized as missing. Any element can be uniquely identified by a point in the time domain of the series.

Note that the definition does not explicitly allow for time ranges identifying values. This is generally not a problem because of the nature of time domains.

The terms used in the definition will be explained shortly, but before that a few typical problems will be discussed. These problems should illustrate why a framework for time series is useful.

Problems and design goals

Let us look at a series alternating between two constant values every working day: Friday=5, Monday=10, Tuesday=5, Wednesday=10, Thursday=5, Friday=10, Monday=5, Tuesday=10, and so on. The series has never any data on weekends. Consider this straightforward chart plotting the values on the vertical axis against their dates, on the horizontal axis:

This looks a bit funny, doesn't it? The chart does not express very well the perfect regularity of the series. Compare it to this designed solution:

The visual pattern is now a faithful expression of the regularity of the data. The chart does it by excluding weekends. The time domain is in fact the only difference between the two charts: everyday dates in one case versus workweek-only dates in the other.

Tel Aviv, Cairo, New York

Beyond this contrived example, there are many situations with no data on weekends. A familiar case is provided by stock markets. These are also a good illustration of the annoying details hiding in seemingly simple problems: weekends are not the same in Tel Aviv, Cairo, and New York. When designing a database for global stock market data, or when drawing charts to compare the prices of some American, Egyptian, or Israeli stocks, such issues must be dealt with.

Getting a good grip on the time domain is the first important design goal of the Time Series Framework.

Missing values

The following table lists the first eleven Olympic Games.

1896	Athens
1900	Paris
1904	Saint-Louis
1908	London
1912	Stockholm
1916
1920	Antwerp
1924	Paris
1928	Amsterdam
1932	Los Angeles
1936	Berlin

Eleven? It is true that there are ten Games, but it is also true that the list has eleven elements. Is something wrong here? No. Games had been scheduled in Berlin in 1916 but were canceled because of the war. Conceptually, the Games of 1916 are a missing value, and there are many real world phenomena modeled with time series where it is common to have some values missing. A system dealing with time series must be capable of dealing with such cases gracefully and in a useful way. You don't want your software to return the nine first Games when you asked for ten, or to crash on the Games of 1916. Or you don't want your portfolio evaluation software to give up when a quote is missing. And as a software developer, you don't want to invent ad-hoc solutions all the time.

Detecting missing values and handling them intelligently is the second important design goal of the Time Series Framework.

The pieces of the puzzle

Time domain

A time domain identifies a point in time with an offset from a base time. The offset is a long integer (64 bit signed integer).

The base time is January 1 0000, corresponding to the index value 0.

A pseudo Gregorian calendar with the usual definition of leap years is used. It is not the true thing because it applies from year 0, and dates before the Gregorian cutover of October 15 1582 do not correspond to historical dates. Years can be as a large as the size of the numbers used in the implementation allow.

The resolution of the time series defines the time unit corresponding to an offset increment. Available units are microsecond, second, minute, hour, day, month, year. Week is omitted by design.

The origin is a non-negative number of units defining an offset from the base time. It plays an important role in the time patterns introduced below. It can also be used to avoid using large integers for indexes in applications where the range size of a series always fits within a standard integer. This is especially useful if the programming language does not support long integers for indexing arrays, like 64 bit Java.

The time pattern defines the time points for which elements exist. One such pattern is a repeating cycle. A weekly series has a resolution of day units with a cycle of one day with an element, followed by six days without elements. A workweek series has day units with a cycle of five days with elements, followed by two days without elements. The Olympics have year units with a four-year cycle of one year with games, followed by three years without. By default, a series has no cycle, meaning elements exist for all times. The cycle starts with the origin of the time domain. It is possible for two series with the same resolution and cycle, but not the same origin, to have no time point in common. The summer and winter Olympics are such a case.

Time index

A time index is in a time domain and carries a discrete offset which defines a point in time. Two time indexes in the same time domain can be compared, with a larger index corresponding to a later point in time. Adjacent points in time are represented by adjacent offsets.

Because time indexes are expected to be used as keys it is important to implement them as immutable objects.

Time range

A time range is a pair of time indexes in the same time domain, called the lower and upper bound. If the lower bound is larger than the upper bound the range is said to be empty.

Observation, value type, missing value

An observation has a time index and a value of some type. The value type must allow the definition of a special value representing missing values, without restricting the set of useful values; null (nil) can only be used as this special value if it has no other meaning in the relevant context.

Time series

A time series maps a set of time indexes to values. All time indexes are in the same time domain and all values are of the same type. For this reason we talk of the time domain and the value type of a time series.

A time series can be defined alternatively as a set of observations, with each element of the set uniquely identified by its time index.

A time series has a range defined by the smallest and largest time indexes of the included observations.

A time series maintains the abstraction of missing values, which correspond to time indexes in the range for which no observation exist. This definition implies that a time series can never have missing values at its boundaries. This is also true for a subset of a time series, because it is also a time series.

Time series storage

Although an implementation detail, the storage of time series presents an interesting design question. That a series maps time indexes to values does not mean that it has to be stored in a dictionary keyed by time indexes. The Flyweight pattern provides a better solution, making use of the fact that for a given time domain, time indexes can be represented by integer values, the offsets.

Two cases must be considered, depending on the frequency of missing values. In a regular time series, missing values are exceptional. In a sparse time series, they are the rule.

Sparse series are useful for managing irregular events. They are typically implemented as dictionaries. Missing values are not stored. Instead of time indexes, only offsets are used as keys. The time domain is stored only once. Time indexes are reconstructed if and when needed.

The storage of regular series is straightforward and efficient. All values, missing or not, are stored into an array. With missing values the exception, the overhead is reasonable, and because they are self-signaling, missing values are always detectable. In addition to the array, the series stores also the time domain and the offset of the time index of the value in the first array element. A time index can be reconstructed from the time domain and the sum of the stored offset and the array index of the value.

Time Series Framework Design by Jean-Paul Vetterli is licensed under a Creative Commons Attribution 3.0 Unported License.

Walking through the ChronoDB demo (2/2)

This post is the second of a two-part commentary on the ChronoDB demo. In the first part I explained the steps for setting up a ChronoDB database before it can be used to perform useful work.

Setting up the schema for the demo happens on the last line of our code snippet:

StocksAndForexDemo demo = new StocksAndForexDemo(args[0]); demo.setUpHyperSQLDatabase(); demo.setUpSchema(); Let's focus on this method. public void setUpSchema() throws Exception { StocksAndForexSchema schema = new StocksAndForexSchema(db); schema.createSchema(); // commit all changes db.commit(); } Whenever the symbol db appears in this post, it stands for the ChronoDB database. ch.agent.chronodb.api.Database db; The demo needs numerical series and textual attributes. So we need to create two value types. This is done in the StocksAndForexSchema constructor: db.createValueType("text", false, ValueType.StandardValueType.TEXT.name()) .applyUpdates(); db.createValueType("numeric", false, ValueType.StandardValueType.NUMBER.name()) .applyUpdates(); Here we use built-in support for numbers and texts provided by ChronoDB. In other cases we will provide a custom ValueScanner. But the constructor is not finished with its work. It needs to tell ChronoDB that we are going to have numeric series: UpdatableValueType<ValueType> uvtvt = db.getTypeBuiltInProperty() .getValueType().typeCheck(ValueType.class).edit(); uvtvt.addValue(uvtvt.getScanner().scan("numeric"), null); uvtvt.applyUpdates(); The method invocation applyUpdates() seen here and there consolidates all pending modifications to an object but does not commit them to permanent storage.

When the constructor is done, things get more specific, as can be seen from the code of createSchema:

public void createSchema() throws T2DBException { createCurrencyValueTypeAndProperty(); createSeriesUnitValueTypeAndProperty(); createTickerProperty(); createStocksSchema(); createExchangeRatesSchema(); createTopLevelChronicles(); } In the demo we have a class for currencies and we want to use a custom value scanner: UpdatableValueType<Currency> uvt = db.createValueType("Currency", true, "ch.agent.chronodb.demo.CurrencyValueScanner"); The Currency object in this demo is not very useful. In a real world application it would have more responsibilities, like computing exchange rates. Its purpose here is only to show how we set up ChronoDB to use problem-related classes. Not shown here is how to add currency values and how to create the currency property, as it's straightforward. We also skip the details of creating other value types and properties and turn to the creation of a schema for stocks.

We want to represent a stock using a chronicle with two attributes, ticker, and currency, and with three series, price, volume, and splits. Price and volume have a custom series attribute, unit. The first thing to do is to create a schema. The demo does not use the possibility to inherit from another schema.

UpdatableSchema schema = db.createSchema("Stocks", null); Attributes are then added to the schema. It is necessary to provide numbers for attributes and series. At first sight one would ask why are these numbers not hidden by the software. The answer is schema inheritance and the possibility to not only remove and modify attributes and series but also to insert them in precise positions. schema.addAttribute(1); schema.setAttributeProperty(1, db.getProperty("Ticker", true)); schema.addAttribute(2); schema.setAttributeProperty(2, db.getProperty("Currency", true)); The following piece of code defines the price series as a workweek numeric series, with a currency unit: schema.addSeries(1); schema.setSeriesName(1, "price"); schema.setSeriesDescription(1, "close price"); schema.setSeriesType(1, db.getValueType("numeric")); schema.setSeriesTimeDomain(1, Workday.DOMAIN); schema.addAttribute(1, 5); schema.setAttributeProperty(1, 5, db.getProperty("Unit", true)); schema.setAttributeDefault(1, 5, "currency"); By default a series does use sparse time series. The automatic use of sparse time series can be configured in the schema. This is done for splits series in the demo. Even if not configured, applications still have the possibility to force sparsity when getting data.

The last step in setting up the schema is to create top level chronicles. The demo uses two collections: stocks and exchange rates. The code below shows how to create the top chronicle hosting the exchange rate collection.

Schema forexSchema = db.getSchemas("Forex").iterator().next(); UpdatableChronicle forex = db.getTopChronicle().edit() .createChronicle("forex", false, "Exchange rate data", null, forexSchema); forex.applyUpdates(); To create a chronicle you need to have a parent chronicle. For a top-level chronicle, the parent is the top chronicle, which is virtual and which is named after the database, "demo" in this case.

With top-level chronicles created, the demo can go ahead. Once the required schemas have been set up, application rarely, if ever, need to do anything about them. Millions of chronicles can be created, and their attributes and series are automatically available without dong anything, except setting specific values. In many cases it is not even necessary to set values of attributes. Take the case of american stocks. It is a simple matter to define a schema for them, inheriting from the "Stocks" schema we made in the demo:

UpdatableSchema schema = db.createSchema("American stocks", "Stocks"); Property<Currency> currency = db.getProperty("Currency", true) .typeCheck(Currency.class); schema.setAttributeDefault(2, currency.scan("USD")); With this new schema, you can now create a chronicle collection for american stocks (perhaps a nested collection of the stocks collection), and all members will automatically be in USD.

As a final remark, it is important to note that as the default value of a chronicle attribute can always be overriden (it's named a default value after all), the value of a series attribute cannot. It keeps its default value, as defined in the schema. If you say in the schema that the unit of a series foo is bar then in all collections having this schema, the foo series unit will be bar. The same goes for built-in attributes: name, type, time domain, and sparsity of a series cannot be changed once defined. This does not restrict the modeling freedom. As the demo as shown you can have price series in different currencies. The price and volume have a unit ("in currency", and "in number of shares"), but the price series unit "in currency" simply tells to look at the chronicle's currency. So you will have your Toyotas in quoted in yens and your Renaults in euros.

Thursday, July 5, 2012

Walking through the ChronoDB demo (1/2)

A demo package is available for downloading from the ChronoDB project website or from SourceForge. In a short series of posts I will comment on a few important details. I hope these explanations will be helpful.

Explanations will focus on the following code snippet from the static main method of StocksAndForexDemo:

// args[0] is name of parameter file with key-value pairs ... StocksAndForexDemo demo = new StocksAndForexDemo(args[0]); demo.setUpHyperSQLDatabase(); demo.setUpSchema(); // etc.

The constructor invocation new StocksAndForexDemo(args[0]) initializes a ChronoDB database, kept in a private member inside the demo:

private ch.agent.chronodb.api.Database db; The actual work is done by ch.agent.chronodb.api.SimpleDatabaseManager, which is one of the few non-interface classes in that package. It is provided to make it easier to write test cases (and demos). It sets up the database using parameters provided in a file on the file system or the class path. Here is an extract from such a file: db.name=demo db.class=ch.agent.chronodb.jdbc.JDBCDatabase session.jdbcDriver=org.hsqldb.jdbc.JDBCDriver session.jdbcUrl=jdbc:hsqldb:mem:demodb session.db= session.user=sa session.password= # etc. Parameters with names beginning with "db." are the most important: db.name names the database and must be unique within a running system; db.class names the implementation class. Although it is in principle possible to run multiple databases simultaneously, SimpleDatabaseManager currently supports only one database. Parameters beginning with "session." are specific to the JDBC implementation. Implementations more sophisticated than JDBCDatabase used in this simple demo will have more parameters, some of which are named in the interface ch.agent.chronodb.impl.DatabaseBackend.

At this point, a ChronoDB database object and a JDBC connection are ready for use but there is absolutely nothing in the database yet. In fact, the demo uses an in-memory HyperSQL database. Before the demo starts and after the demo terminates, the database does not exist. So the next step is to create the tables and indexes expected by the JDBC implementation of ChronoDB.

This is done by the method invocation demo.setUpHyperSQLDatabase() which sends SQL data definition language (DDL) statements to the database engine for execution. The DDL is taken from Resources/HyperSQL_DDL_base.sql. This file is in chronodb-jdbc-1.0.0.jar and therefore on the class path. The DDL defines all tables required by the base system, with various indexes and constraints to enforce referential integrity. Browse the SQL file if you need details. Noteworthy are the few non DDL statements at the end of the file, which initialize the database with the built-in properties needed when defining a series in a schema. These properties require in turn the corresponding built-in value types.

These value types are:

name, a string type enforcing a minimalist naming policy
type, for values defining the type of a series
timedomain, a restricted type for time domains, with some predefined values: daily, datetime, monthly, workweek, and yearly
binary, a boolean type

The properties, with their value type in parentheses, are:

Symbol (name)
Type (type)
Calendar (timedomain)
Sparsity (binary)

These value types and properties could in theory be provided virtually by the software, but in a JDBC implementation they are physically required because of referential integrity. For more information on time domains, please consult the documentation of the Time2 Library project. For more information on the other properties please consult the documentation of the ChronoDB database project.

At this point the database is ready for useful work directly related to the problem at hand: setting up the schema for the demo. This will be the subject of a forthcoming post.