Difference between revisions of "Files Dashboard"
(→Cross-cutting concerns) |
(→Daemons) |
||
Line 86: | Line 86: | ||
* Several file parsing daemons | * Several file parsing daemons | ||
− | For the moment they are kept separate as the parsing of the files is no particularly easy chore (lots of details need to be taken into account). | + | For the moment they are kept separate as the parsing of the files is no particularly easy chore (lots of details need to be taken into account). The agreement is that the file information must be checked once every two hours. Therefore, each of the daemons, independently, is launched every 2h to fulfill their tasks. A crontab expression makes sure that the daemon launches at 10min past every even hour. The end user can expect, then, that at 15min past every even hour, the dashboard displays the up-to-date list of files. |
==== File detection ==== | ==== File detection ==== |
Revision as of 11:31, 29 June 2012
The files dashboard is a web page to get an overview of the raw data files of trackers. For a chosen project and a date interval, it displays a table with a row for every day in the date interval and a column for every tracker that belongs to the project.
Each cell from the table will then contain information for the files of that tracker for that day; in principle, one box for each file. And usually, no more than one file per day. The purpose is to have information at a glance, so if there are no files for a tracker for one day, an empty box is to be shown.
Three pieces of information are to be shown for each file:
- Whether a file exists for a given day and tracker
- How big the file is
- Whether errors happened during parsing of its contents
Contents
User manual
If you've got access to the 'services' machine, you should be able to use your login/password and access the url: https://services.flysafe.sara.nl/uvabits/projectadmin/trackersdashboard
The url takes 3 parameters, that you can fill in by hand, like:
...?projectId=2&startDay=20120401&endDay=20120430
Parameters
- projectId
- is the Identifier of the Project that you want to query (currently we have only from 1 to 17)
- startDay
- is the first day you want to request in format yyyyMMdd, where
- yyyy is the 4-digit year (e.g. 2012)
- MM is the 2-digit month within the year (e.g. 05 for May)
- dd is the 2-digit day of the month (e.g. 01 for the first day of the month)
- endDay
- is the last day you want included in the report, with the same format as the startDay
Note: The parameters are separated from the resource url by the standard HTTP symbol (a question mark, so '?'), and they are separated from each other by the standard HTTP parameter separator (the ampersand, so '&').
This has a number of advantages. Among others, you can make a bookmark of a specific report.
If a projectId fails to appear on the url, no project will be selected and therefore a message stating that no data has been found will be shown.
If the dates fail to make sense, then the current date will be selected by default.
Interaction
To easily select the parameters, the web page shows a bar at the top, with a couple of shortcuts
- for months (namely, the current month as the last in the list, and then 11 months in advance), which set the startDay and endDay parameters appropriately
- for projects (there's a dropdown list of all the project names; the request is sent to change the project by clicking on the submit button to the right of the list).
Files processing
Raw tracking data comes into the system sent as plain text files through a Dropbox account. Dropbox makes the files available in the server file system.
File properties
Once accessible through the local file system, they can be queried, read and parsed to extract the tracking information they contain.
The file name is expected to have the form Log_0533_13042012_xx.txt
. This provides:
- The tracker number
- (in the example, 533)
- The reported date
- (in the example, April 13th 2012)
Apart from the name, the file has other attributes that the file system provides. Namely:
- last modification date
- tells when something was modified in the file for the last time.
- Size
- how big the file contents are
File processing lifecycle
The goal being to show the three pieces of information described at the beginning, and considering the information that is available in the file name and properties, the system needs to keep track of these. Here is what it is looked at and how.
The file name (alone, without the path) uniquely identifies it in the system. If this is a new file, a new entry will be added to the system. When a file is discovered in the file system, first its properties are analyzed, as described above, and stored next to the file name.
If there already existed an entry for the file name, then the last modification date must be checked. If the one from the file differs from that in the entry in the system, then the information found in the newly found file must be taken into consideration; so the new last modification date overrides the one in the system entry (which, in turn, gets discarded) and along with the new file size.
During all this process, no peek is taken inside the the file's contents.
Note: At this stage in the development there's no tracking done when parsing the files contents. Therefore, no errors are reported in the dashboard yet.
Functional architecture
The architectural components involved in making the webpage available are the following:
- A relational database
- one specific table is used to store the files information: gps.uva_trackingfile_parsing
- the relations between projects and trackers exist regardless of this file parsing system, but they are also taken into account to find out which trackers to pay attention to when a project is selected
- A web front-end
- the system offers the information to the user as a web page. This information is taken from the database
- A set of daemons
- they gather information on their own and feed the database
Therefore, the whole system relies on the database to exchange information. Only what is present there will be shown.
Daemons
Two sets of daemons exist in the system related to the files:
- One file detection daemon
- Several file parsing daemons
For the moment they are kept separate as the parsing of the files is no particularly easy chore (lots of details need to be taken into account). The agreement is that the file information must be checked once every two hours. Therefore, each of the daemons, independently, is launched every 2h to fulfill their tasks. A crontab expression makes sure that the daemon launches at 10min past every even hour. The end user can expect, then, that at 15min past every even hour, the dashboard displays the up-to-date list of files.
File detection
The files need to be found. Therefore, there needs to be something that scouts the incoming files directories, detects changes and informs the system. In computing, the concept of daemon comes in handy for this task, as no user interaction is required here.
The only purpose of existence for this program (the daemon) is to run every once in a while:
- Traverse the incoming files directory and look for:
- New files
- Changes in existing files
- Report the detected changes to the system
- Sleep for a while and then start again
Only, to make the system easier, rather than an a typical operating system daemon, it has been developed as a web application without any web interface, but this is just a minor implementation detail.
If errors are encountered when trying to find out the files' properties, they are output to log files.
File parsing
Parsing actually deals with extracting the information out of the raw data inside the files and transforming it to useful information in the system. This is no easy task because the raw data format is complex, lots of information needs to be extracted... Because of the way the system gathers the raw data, there are also check-ups that need to be verified to assure data consistency... Also, different tracker versions provide data in different formats.
All in all, many things can go wrong during parsing and many different small details need to fall in place to guarantee that the right information is extracted. Therefore, nothing has been touched for the moment in the current parsers.
They are implemented as perl scripts that the operating system cron launches.
Some conversations are being held already about improving parsing. When the modifications actually take place, it will be a good moment to include reporting in them. Then, this reporting can be fed into the database and, in turn, be displayed on the dashboard.
Software architecture
The part of UvA-BiTS related to the file dashboard is developed in Java, divided in different Maven modules.
- The Model module is in charge of the persistence model. It is built upon actual and de-facto standard frameworks such as JPA, Spring and Hibernate.
- The Web module provides a user-friendly web interface to the user. It relies on Spring and Tapestry frameworks.
- The Daemon module populates the database with basic information about files. It is built with Spring and Tapestry.
Model
The Model module is the basis to access the database. Both the Web and the Daemon modules rely on the services provided by this module to interact with the database.
It provides Java abstractions of the persistent entities, along with operations to manipulate them. It follows the DAO design pattern to abstract low-level persistence of entities, and the Service (or Façade) design pattern to group common operations and provide a more natural interface to the clients of this module.
Persistence is handled by Hibernate, when possible hidden through the JPA standard. To weave the object mesh, the Spring framework is used. Spring is also used to handle ACID transactions declaratively.
Web layer
The Web module is a web application that acts as View and Controller for user interaction with the system.
The user presentation is structured mainly in 2 sections (System administration and Project administration), although for the moment only the Project Administration has something, and that is, precisely, the trackers dashboard.
It is developed with Tapestry, a component-oriented web framework that eases development and avoids having to deal with the low level Java servlets API. It uses Spring to establish the links among object instances.
The trackers dashboard web page delegates persistence activities to the Model module to read the information to populate the projects list and to search for the required file information.
Daemon
The Daemon exists to populate the file information into the database. Its own responsibility lies with finding new files or changes in existing files and making this information explicit. It delegates all the persistence tasks to the Model module.
The Java language provides direct access to the underlying local file system and file properties. The module objects are woven with Spring at run time. It has been deviced as a web application, using Tapestry for this sole purpose, which simply loads the Spring context. This context includes the core loop as a Task
along with a TaskScheduler
where the Task
is scheduled to launch every 2×60×60×1000 = 7200000 milliseconds (2 hours). This is handled by Spring.
The beauty of having the Daemon deployed as a web application is that the web application container provides a number of facilities. The main one is that it provides a centralized management interface already developed, that can be very easily used to list, launch, stop, replace and otherwise operate with the deployed applications. So, in one place, one can see everything that is related to the system. Then, other beauties include centralized logging that is automatically rolled up, a confined running space, security features (in case they are needed), thread management, connectors, etc.
Deployment
The system is running on the Services machine services.flysafe.sara.nl
.
The different supporting services are the following:
- PostgreSQL
- A relational database management server where the system database is stored
- Apache
- A web server to provide access to visualizations
- It provides HTTP authentication and authorization
- Tomcat
- A web application container that hosts the web applications (the trackers dashboard and the daemon)
- It is to be reached through the Apache server (so no direct access)
- It relies on authentication and authorization from Apache