Files Dashboard

From ecology
Revision as of 11:48, 8 May 2012 by Andera (talk | contribs) (Software architecture)
Jump to: navigation, search

The files dashboard is a web page to get an overview of the raw data files of trackers. For a chosen project and a date interval, it displays a table with a row for every day in the date interval and a column for every tracker that belongs to the project.

Each cell from the table will then contain information for the files of that tracker for that day; in principle, one box for each file. And usually, no more than one file per day. The purpose is to have information at a glance, so if there are no files for a tracker for one day, an empty box is to be shown.

Three pieces of information are to be shown for each file:

  • Whether a file exists for a given day and tracker
  • How big the file is
  • Whether errors happened during parsing of its contents

Files processing

Raw tracking data comes into the system sent as plain text files through a Dropbox account. Dropbox makes the files available in the server file system.

File properties

Once accessible through the local file system, they can be queried, read and parsed to extract the tracking information they contain.

The file name is expected to have the form Log_0533_13042012_xx.txt. This provides:

The tracker number
(in the example, 533)
The reported date
(in the example, April 13th 2012)


Apart from the name, the file has other attributes that the file system provides. Namely:

last modification date
tells when something was modified in the file for the last time.
Size
how big the file contents are

File processing lifecycle

The goal being to show the three pieces of information described at the beginning, and considering the information that is available in the file name and properties, the system needs to keep track of these. Here is what it is looked at and how.

The file name (alone, without the path) uniquely identifies it in the system. If this is a new file, a new entry will be added to the system. When a file is discovered in the file system, first its properties are analyzed, as described above, and stored next to the file name.

If there already existed an entry for the file name, then the last modification date must be checked. If the one from the file differs from that in the entry in the system, then the information found in the newly found file must be taken into consideration; so the new last modification date overrides the one in the system entry (which, in turn, gets discarded) and along with the new file size.

During all this process, no peek is taken inside the the file's contents.

Note: At this stage in the development there's no tracking done when parsing the files contents. Therefore, no errors are reported in the dashboard yet.

Functional architecture

The architectural components involved in making the webpage available are the following:

A relational database
one specific table is used to store the files information: gps.uva_trackingfile_parsing
the relations between projects and trackers exist regardless of this file parsing system, but they are also taken into account to find out which trackers to pay attention to when a project is selected
A web front-end
the system offers the information to the user as a web page. This information is taken from the database
A set of daemons
they gather information on their own and feed the database

Therefore, the whole system relies on the database to exchange information. Only what is present there will be shown.

Daemons

Two sets of daemons exist in the system related to the files:

  • One file detection daemon
  • Several file parsing daemons

For the moment they are kept separate as the parsing of the files is no particularly easy chore (lots of details need to be taken into account). It has been agreed that the file information be checked once every two hours. Therefore, each of the daemons, independently, is launched every 2h to fulfill their tasks.

File detection

The files need to be found. Therefore, there needs to be something that scouts the incoming files directories, detects changes and informs the system. In computing, the concept of daemon comes in handy for this task, as no user interaction is required here.

The only purpose of existence for this program (the daemon) is to run every once in a while:

  1. Traverse the incoming files directory and look for:
    • New files
    • Changes in existing files
  2. Report the detected changes to the system
  3. Sleep for a while and then start again

Only, to make the system easier, rather than an a typical operating system daemon, it has been developed as a web application without any web interface, but this is just a minor implementation detail.

If errors are encountered when trying to find out the files' properties, they are output to log files.

File parsing

Parsing actually deals with extracting the information out of the raw data inside the files and transforming it to useful information in the system. This is no easy task because the raw data format is complex, lots of information needs to be extracted... Because of the way the system gathers the raw data, there are also check-ups that need to be verified to assure data consistency... Also, different tracker versions provide data in different formats.

All in all, many things can go wrong during parsing and many different small details need to fall in place to guarantee that the right information is extracted. Therefore, nothing has been touched for the moment in the current parsers.

They are implemented as perl scripts that the operating system cron launches.

Some conversations are being held already about improving parsing. When the modifications actually take place, it will be a good moment to include reporting in them. Then, this reporting can be fed into the database and, in turn, be displayed on the dashboard.

Software architecture

The part of UvA-BiTS related to the file dashboard is developed in Java, divided in different Maven modules.

  • The Model module is in charge of the persistence model. It is built upon actual and de-facto standard frameworks such as JPA, Spring and Hibernate.
  • The Web module provides a user-friendly web interface to the user. It relies on Spring and Tapestry frameworks.
  • The Daemon module populates the database with basic information about files. It is built with Spring and Tapestry.

Model

The Model module is the basis to access the database. Both the Web and the Daemon modules rely on the services provided by this module to interact with the database.

It provides Java abstractions of the persistent entities, along with operations to manipulate them. It follows the DAO design pattern to abstract low-level persistence of entities, and the Service (or Façade) design pattern to group common operations and provide a more natural interface to the clients of this module.

Persistence is handled by Hibernate, when possible hidden through the JPA standard. To weave the object mesh, the Spring framework is used. Spring is also used to handle ACID transactions declaratively.

Daemon

The Daemon exists to populate the file information into the database. Its own responsibility lies with finding new files or changes in existing files and making this information explicit. It delegates all the persistence tasks to the Model module.

The Java language provides direct access to the underlying local file system and file properties. The module objects are woven with Spring at run time. It has been deviced as a web application, using Tapestry for this sole purpose, which simply loads the Spring context. This context includes the core loop as a Task along with a TaskScheduler where the Task is scheduled to launch every 2×60×60×1000 = 7200000 milliseconds (2 hours). This is handled by Spring.

The beauty of having the Daemon deployed as a web application is that the web application container provides a number of facilities. The main one is that it provides a centralized management interface already developed, that can be very easily used to list, launch, stop, replace and otherwise operate with the deployed applications. So, in one place, one can see everything that is related to the system. Then, other beauties include centralized logging that is automatically rolled up, a confined running space, security features (in case they are needed), thread management, connectors, etc.

Web layer