Welcome to DEBS.org

This site provides information about upcoming and previous Distributed Event-Based Systems (DEBS) conferences and workshops. It also contains papers from the DEBS PhD workshops.

Following the success of the inaugural DEBS 2007 conference in Toronto, the Second DEBS 2008 conference was held in Rome, Italy. The Third DEBS 2009 took place at Vanderbilt University Campus, Nashville, TN, USA. The Fourth DEBS 2010 was held in Cambridge, United Kingdom. Since then, DEBS conferences happen each year in locations spanning Europe, North America, and Asia.

DEBS 2014 Grand Challenge: Smart homes

The ACM DEBS 2014 Grand Challenge is the fourth in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The goal of the Grand Challenge competition is to implement a solution to a problem provided by the Grand Challenge organizers. The DEBS Grand Challenge series provides problems which are relevant to the industry at large. DEBS Grand Challenge problems allow for evaluation of event based systems using real-life data and queries.

With this year’s challenge, we seek to demonstrate the applicability of event-based systems to provide scalable, real-time analytics over high volume sensor data. The underlying scenario stems from the smart grid domain and addresses the analysis of energy consumption measurements. Specifically, we target following problems: (1) short-term load forecasting and (2) load statistics for real-time demand management. The data for the challenge is synthesized based on real-world profiles. The profiles have been collected from a number of smart-home installations deployed in Germany.

Data

The Grand Challenge 2014 data set is based on recordings originating from smart plugs, which are deployed in private households. A smart plug plays a role of a proxy between the wall power outlet and the device connected to it. It is equipped with a range of sensors which measure different, power consumption related, values. For the purpose of the DEBS 2014 Grand Challenge a number of smart plugs has been deployed in private households with data being collected roughly every second for each sensor in each smart plug. It has to be noted that the data set is collected in an uncontrolled, real-world environment, which implies the possibility of malformed data as well as missing measurements.

For the DEBS 2014 Grand Challenge we assume a hierarchical structure with a house, identified by a unique house id, being the topmost entity. Every house contains one or more households, identified by a unique household id (within a house). Every household contains one or more smart plugs, each identified by a unique plug id (within a household). Every smart plug contains exactly two sensors:
(1) load sensor measuring current load with Watt as unit (2) work sensor measuring total accumulated work since the start (or reset) of the sensor with kWh as unit.

The schema of the base stream is following:

id – a unique identifier of the measurement [32 bit unsigned integer value]
timestamp – timestamp of measurement (number of seconds since January 1, 1970, 00:00:00 GMT) [32 bit unsigned integer value]
value – the measurement [32 bit floating point]
property – type of the measurement: 0 for work or 1 for load [boolean]
plug_id – a unique identifier (within a household) of the smart plug [32 bit unsigned integer value]
household_id – a unique identifier of a household (within a house) where the plug is located [32 bit unsigned integer value]
house_id – a unique identifier of a house where the household with the plug is located [32 bit unsigned integer value]

Comma separated snapshot of the data:

2967740693,1379879533,82.042,0,1,0,12
2967740694,1379879533,105.303,1,1,0,12
2967740695,1379879533,208.785,0,2,0,12
2967740696,1379879533,217.717,1,2,0,12
2967740697,1379879533,2.207,0,3,0,12
2967740698,1379879533,2.081,1,3,0,12
2967740699,1379879533,0,1,3,1,12
2967740700,1379879533,0.313,0,3,1,12
2967740701,1379879533,0,1,3,2,12

The synthesized data file contains over 4055 Millions of measurements for 2125 plugs distributed across 40 houses. The collected measurements cover a period of one month with the first timestamp equal to 1377986401 (Sept. 1st, 2013, 00:00:00) and the last timestamp equal to 1380578399 (Sept. 30th, 2013, 23:59:59). All events in the data file are sorted by the timestamp value. Events with the same timestamp are ordered randomly with respect to each other.

Work values are cumulative. The resolution of the work values is only 1 kWh and a change in only visible when enough measurements have accumulated. Therefore, load values are better suited for capturing smaller amounts of used energy. Below figure, shows the load and work charts for a plug during the same time interval. The load chart shows an increase from 200W to 250W between 15:36:02 and 15:36:24. This is not visible in the work chart as the resolution is too small to reflect this change. Work values are calculated even if the data transmission is interrupted. Therefore, below figure shows a jump in the work values after the data gap between 9/20/2013 15:36:31 and 9/21/2013 19:03:40 although no load measurements are available during this period.

Data must be consumed in a streaming fashion. This means that the system processing currently an event with timestamp ts can only work with events with timestamps smaller or equal to ts, .i.e., those it has already seen. Specifically, no precomputation based on the whole data set must be made.

A small (approx. 500MB) test file is available under this link. It contains 100 million events for the first 16 hours. The complete data file is available upon explicit request via an FTP server. In order to obtain the data file please first register your abstract with the EasyChair and subsequently contact us at Zbigniew.Jerzak_at_sap.com to obtain the URI and credentials. The sha1sum of the full data file is: f1cb3fc60604f6c8abde7fbc21a0f59f7a510add

Query 1 – Load Prediction

The goal of this query is to make load forecasts based on the current load measurements and a model that was learned over historical data. Such forecasts can be used in demand side management to proactively influence load and adapt it to the supply situation, e.g., current production of renewable energy sources.

Specifically, the query should provide a forecast of the load for (1) each house, i.e., house-based and (2) for each individual plug, i.e., plug-based. The forecast for each house and plug should be made based on the current load of the connected plugs and a plug specific prediction model.

Note that this question does not target the prediction model itself. Instead we aim at sophisticated interplay between modules for model learning that operate on long term (historic) data with components that use the model on live data. Therefore, we suggest a following slice-based prediction method. We divide the whole period (from t_{start} = September 1st, 2013, 00:00:00 till t_{end} = September 30th, 2013, 23:59:59) covered by the Grand Challenge data set into N slices of equal size |s| seconds. We assume that the first slice s_0 has an index of 0 (zero). The start of the slice s_i is given as a timestamp ts_{s_i}^{start} with second precision equal to t_{start} + i*|s|. The end of the slice s_i is given is given as a timestamp ts_{s_i}^{end} with second precision equal to t_{start} + ((i+1) * |s|) – 1. Assuming the timestamp ts of the current event belongs to the slice s_i (ts \in [ts_{s_i}^{start}, ts_{s_i}^{end}]) the model determines prediction for the load value L for the whole slice s_{i+2} as follows:

L(s_{i+2}) = ( avgLoad(s_i) + median( { avgLoad(s_j) } ) ) / 2

where avgLoad(s_i) is the current average load for slice s_i. The value of avgLoad(s_i), in case of plug-based prediction, is calculated as the average of all load values reported by the given plug with timestamps \in s_i. In case of a house-based prediction the avgLoad(s_i) is calculated as a sum of average values for each plug within the house. { avgLoad(s_j) } is a set of average load values for all slices s_j such that:

s_j = s_(i + 2 – n*k )

where k is the number of slices in a 24 hour period, n is a natural number with values between 1 and floor((i + 2)/k). The value of the { avgLoad(s_j) } is calculated analogously to avgLoad(s_i) in case of plug-based and house-based (sum of averages) variants.

Let us consider the following plug-based example which assumes a slice size of 15 minutes. The latest event has a time stamp indicating Thursday 00:25:00 and the first event in the data set has a timestamp of Monday, 00:00:00. For this example we assume that we receive a load event for a given plug every second on the second. Let us further assume that historical average load values for the given plug for the time between 00:45:00 and 01:00:00 are given as Monday=9, Tuesday=3, and Wednesday=7.

The current average load value for the given plug on Thursday for a slice s_i starting at 00:15:00 (ending at 00:30:00) equals 8. The load forecast for the slice s_{i+2}, which starts on Thursday at 00:45:00 and ends at 01:00:00 is given by:

L(s_{i+2}) = ( 8 + median( 9, 3, 7 ) ) / 2 = 7.5

The above example assumes that the first event has a timestamp of Monday 00:00. Hence the index i of the current slice s_i (Thursday 00:15:00 till 00:30:00) equals 3*96+1 = 289. Index of slice i+2 (Thursday 00:45:00 till 01:00:00) equals 291. The set {s_j} equals hence {s_195, s_99, s_3} since s_195=s_(291-1*96), s_99=s_(291-2*96), and s_3=s_(291-3*96). s_195 corresponds to Wednesday 00:45:00 till 01:00:00. s_99 corresponds to Tuesday 00:45:00 till 01:00:00 and s_3 corresponds to Monday 00:45:00 till 01:00:00.

The load prediction query must provide prediction values for both (1) houses and (2) individual plugs. For each house and plug a prediction using a different slice sizes should be calculated. Slice sizes to consider are 1min, 5min, 15min, 60min and 120min. The output streams for house-based prediction values should adhere to the following schema:

ts – timestamp of the starting time of the slice that the prediction is made for
house_id – id of the house for which the prediction is made
predicted_load – the predicted load for the time slice starting at ts

The output streams for plug-based prediction values should contain the following information:

ts – timestamp of the starting time of the slice that the prediction is made for
house_id – id of the house for where the plug is located
household_id – the id of the household where the plug is located
plug_id – the id of the plug for which the prediction is made
predicted_load – the predicted load for the time slice starting at ts

The output streams should be updated every 30 seconds as specified by the input event timestamp. The purpose of the update is to reflect the latest value of the avgLoad(s_i) for the given slice.

Query 2 – Outliers

The goal of the outliers query is to find the outliers as far as the energy consumption levels are concerned. The query should answer the following question: for each house calculate the percentage of plugs which have a median load during the last hour greater than the median load of all plugs (in all households of all houses) during the last hour. The load statistics query should generate two output streams, one for a sliding window of one hour (described above) and one for a sliding window of a whole day (24 hours). The size of the window should be determined by the timestamp (ts) contained in every event. The slide of the window is equal to one second. The value of the aggregation should be re-calculated whenever a new event enters or leaves a sliding window.

The output streams should adhere to the following schema:

ts_start – timestamp indicating the start of the window
ts_stop – timestamp indicating the end of the window
house_id – the id of the house for which the calculation was done
percentage – the percentage of plugs in this house having their median load higher than global median

An output should be produced only when the percentage value changes for any of the houses. Specifically, no new output needs to be produced when the median load of all plugs changes but no new plugs move below or above the global median load.

Implementation and Evaluation

Each Grand Challenge solution must include an evaluation section. The evaluation section should investigate following aspects of the proposed solution:
(1) Per query throughput as a function of the workload for 10, 20, and 40 houses (average, 10th and 90th percentile)
(2) Per query latency as a function of the workload for 10, 20, and 40 houses (average, 10th and 90th percentile)
(3) For distributed systems – per query throughput and latency as function of the number of processing nodes

Please note that evaluation should explicitly focus on providing these relative values, i.e., throughput for 10 houses as compared to the throughput for 20 houses. Specifically, the absolute throughput and latency values are of minor importance only and will be used only as a sanity check against the baseline.

The original data file can be pre-processed and split into three separate input files containing 10, 20, and 40 houses. The 10 house data file should contain houses with ids from 0 till 9, and the 20 house data file should contain houses with ids from 0 till 19. Authors are furthermore explicitly encouraged to emphasize following aspects of their solutions: (1) the training and execution of the prediction model and (2) handling of potential data quality issues.

Submission Information

All DEBS 2014 Grand Challenge submissions must be composed of two parts: (1) a 6 page paper, in ACM SIG proceedings format, outlining the solution, highlighting its innovative aspects and presenting the evaluation and (2) a demonstration of the system – either in the form of a video or a screencast. All submissions are subject to a peer review process. Based on the review process a submission can be either accept or rejected. All accepted submissions will be divided in two groups. Top accepted submissions (group one) will be published in the conference proceedings and will be presented during a special Grand Challenge session. All accepted submissions (group one and two) will be invited to present the solution as a poster/demo during the DEBS 2014 Conference.

Unless explicitly declined by the authors, all solutions will be included in the global ranking with the winner of the Grand Challenge being announced and awarded during the conference banquet. The global ranking is based on the peer review process scores and the non-functional properties of the solution. The considered non-functional properties are: scalability, throughput, and latency of the system under the assumption of result correctness. The submission process is divided into two steps:

Step 1 – until February 1st, 2014 – submission of non-binding intent for participation. The goal of this submission is to initiate the contact between the DEBS Challenge organizers and solution authors. Registered authors will be proactively notified about any questions and resulting changes to the Grand Challenge description. The submission of the intent should be done via EasyChair (abstract submission).
Step 2 – until March 7th, 2014 – submission of the final solution

DEBS Challenge participants are encouraged to contact Challenge organizers with questions and suggestions regarding the problem. Main contact points for are: Zbigniew Jerzak (Zbigniew.Jerzak_at_sap.com) and Holger Ziekow (hziekow_at_agtinternational.com). All questions and answers will be made available online on the DEBS Challenge website.

All submissions are to be made via EasyChair system: https://www.easychair.org/conferences/?conf=debs2014 in the Grand Challenge track.

Unlike research track submissions, the DEBS Grand Challenge track submissions do not have to be anonymized.

Changelog

2014.02.18 – Updated the Implementation and Evaluation Section – clarified that the 10, 20 and 40 houses data sets can be preprocessed. Clarified the importance of relative and absolute values in evaluation.

2014.02.10 – Updated the description of Query 1 highlighting the result update strategy.

2014.02.07 – Extended the explanation of the calculation of the slice number s_j in Query 1. Added an explanation to the data section clarifying the consumption mode for the input data file.

2014.02.03 – Added new question (number 3) regarding the update semantics of the load prediction value.

2014.02.02 – Updated (clarified) description of Query 1

2014.01.30 – Updated description of the data file adding the first and last timestamps

2014.01.17 – Updated description of Query 2 – the slide size of time based windows equals one second and the calculation of the aggregate should be triggered by every event

2014.01.16 – Updated submission information clarifying that the GC track submissions need not be anonymized

2014.01.15 – Updated description of Query 2 with precise definition of a slide size and the median load for all plugs

2014.01.14 – Updated description and result schema of Query 2

Q&A

  1. Query 2 – How are both windows (for one hour and for a whole day) computed? More precisely, are both windows sliding or tumbling?
    Both windows are to be considered as sliding windows with the window span determined by the timestamp field (ts) in the input data.
  2. Query 2 – How often should sliding window produce a result?
    Each sliding window should update its result whenever the value of the calculated aggregation changes, i.e., usually upon the reception of every input event.
  3. Query 1 – how should the prediction of the load value be updated?
    Please note that the load prediction equation contains two terms. The current average load value avgLoad(s_i) and the historical value obtained from the model median( { avgLoad(s_j) } ). While the model-based value stays constant the current average changes with every new event entering the slice s_i. Therefore, the load prediction value should be updated with every new event entering the slice s_i.

Grand Challenge co-Chairs

Zbigniew Jerzak, SAP AG
Holger Ziekow, AGT International

Grand Challenge Technical Program Committee

Mauricio Arango – Oracle, USA
Thomas Heinze – SAP AG, Germany
Helge Parzyjegla – University of Rostock, Germany
Leonardo Querzoni – Universit degli Studi di Roma “La Sapienza”, Italy
Etienne Rivière – University of Neuchatel, Switzerland
Marco Serafini, QCRI – Qatar Foundation, Qatar
Matthias Weidlich – Imperial College London, UK

DEBS 2016 Grand Challenge: Social networks

Call for Grand Challenge Solutions

The ACM DEBS 2016 Grand Challenge is the sixth in a series of
challenges which seek to provide a common ground and uniform evaluation criteria for a
competition aimed at both research and industrial event-based systems. The goal of the
2016 DEBS Grand Challenge competition is to evaluate event-based systems for real-time
analytics over high volume data streams in the context of graph models.

The underlying scenario addresses the analysis metrics for
a dynamic (evolving) social-network graph. Specifically, the 2016 Grand Challenge targets
following problems: (1) identification of the posts that currently trigger the most
activity in the social network, and (2) identification of large communities that are
currently involved in a topic. The corresponding queries require continuous analysis
of a dynamic graph under the consideration of multiple streams that reflect updates to the graph.

The data for the DEBS 2016 Grand Challenge is based on the dataset
provided together with the
LDBC Social Network Benchmark.
DEBS 2016 Grand Challenge takes up the general scenario from the
2014 SIGMOD Programming Contest,
however, in contrasts to the SIGMOD contest, it explicitly focuses on processing
streaming data and thus dynamic graphs. Details about the data, queries for the Grand Challenge, and
information about evaluation are provided below.

Prize

Participants in the 2016 DEBS Grand Challenge will have the chance to win two prizes.
The first prize is the “Grand Challenge Award” for the best performing, correct submission. The winner (or winning team)
of the “Grand Challenge Award” will be additionally awarded with a monetary reward of $1000. The monetary reward and
the whole DEBS 2016 Grand Challenge is sponsored by the WSO2.

The second prize is the “Grand Challenge Audience Award” – it is determined based on the audience feedback provided
after the presentation of the Grand Challenge solutions during the DEBS 2016 conference. The “Grand Challenge Audience
Award”, as opposed to the overall “Grand Challenge Award”, does not entail any additional monetary reward as it is
based purely on the audience perception of the presenatation of a given solution.

Support

We would like to explicitly thank WSO2 for sponsoring the DEBS 2016 Grand Challenge
prize and the LDBC Council for their help in preparing the test data set. We also like to thank
the BICC lab at the Hochschule Furtwangen University for providing the resources to run the evaluation system.

WSO2
LDBC Council
HFU

License

All solutions submitted to the DEBS 2016 Grand Challenge must be made open source under the BSD license: https://opensource.org/licenses/BSD-3-Clause. A solution incorporates concepts, queries, and code developed for the purpose of solving the Grand Challenge. If a solution is developed within the context of, is built on top of, or is using an existing system or solution which is licensed under a different license than BSD, then such an existing solution or system needs not have its license changed.

Input Data Streams

The input data is organized in four separate streams, each provided as a text file. Namely, we provide the following input data files:

friendships.dat: <ts, user_id_1, user_id_2>

ts is the friendship’s establishment timestamp
user_id_1 is the id of one of the users
user_id_2 is the id of the other user

posts.dat: <ts, post_id, user_id, post, user>

ts is the post’s timestamp
post_id is the unique id of the post
user_id is the unique id of the user
post is a string containing the actual post content
user is a string containing the actual user name

comments.dat: <ts, comment_id, user_id, comment, user, comment_replied, post_commented2>

ts is the comment’s timestamp
comment_id is the unique id of the comment
user_id is the unique id of the user
comment is a string containing the actual comment
user is a string containing the actual user name
comment_replied is the id of the comment being replied to (-1 if the tuple is a reply to a post)
post_commented is the id of the post being commented (-1 if the tuple is a reply to a comment)

likes.dat: <ts, user_id, comment_id>

ts is the like’s timestamp
user_id is the id of the user liking the comment
comment_id is the id of the comment

Please note that as the contents of files represent data streams, each file is sorted based on its respective timestamp field.

A small sample set of input data streams can be downloaded from this URL: https://www.dropbox.com/s/vo83ohrgcgfqq27/data.tar.bz2. These files are only meant for development and debugging. A bigger data set can be downloaded from here: https://www.dropbox.com/s/dcy21uc745pv98i/data.zip?dl=0.
Our tests will be based on other files (about the same size as the provided larger data set), but strictly following the same format.

Query 1

The goal of query 1 is to compute the top-3 scoring active posts, producing an updated result every time they change.

The total score of an active post P is computed as the sum of its own score plus the score of all its related comments. Active posts having the same total score should be ranked based on their timestamps (in descending order), and if their timestamps are also the same, they should be ranked based on the timestamps of their last received related comments (in descending order). A comment C is related to a post P if it is a direct reply to P or if the chain of C’s preceding messages links back to P.

Each new post has an initial own score of 10 which decreases by 1 each time another 24 hours elapse since the post’s creation. Each new comment’s score is also initially set to 10 and decreases by 1 in the same way (every 24 hours since the comment’s creation). Both post and comment scores are non-negative numbers, that is, they cannot drop below zero. A post is considered no longer active (that is, no longer part of the present and future analysis) as soon as its total score reaches zero, even if it receives additional comments in the future.

Input Streams: posts, comments

Output specification:

<ts,top1_post_id,top1_post_user,top1_post_score,top1_post_commenters,
top2_post_id,top2_post_user,top2_post_score,top2_post_commenters,
top3_post_id,top3_post_user,top3_post_score,top3_post_commenters>

ts: the timestamp of the tuple event that triggers a change in the top-3 scoring active posts appearing in the rest of the tuple
topX_post_id: the unique id of the top-X post
topX_post_user: the user author of top-X post
topX_post_commenters: the number of commenters (excluding the post author) for the top-X post

Results should be sorted by their timestamp field. The character “-” (a minus sign without the quotations) should be used for each of the fields (post id, post user, post commenters) of any of the top-3 positions that has not been defined. Needless to say, the logical time of the query advances based on the timestamps of the input tuples, not the system clock.

Sample output tuples for the query

2010-09-19 12:33:01.923+0000,25769805561,Karl Fischer,115,10,25769805933,Chong Liu,83,4,-,-,-,-
2010-10-09 21:55:24.943+0000,34359739095,Karl Fischer,58,7,34359740594,Paul Becker,40,2,34359740220,Chong Zhang,10,0
2010-12-27 22:11:54.953+0000,42949673675,Anson Chen,127,12,42949673684,Yahya Abdallahi,69,8,42949674571,Alim Guliyev,10,0

Query 2

This query addresses the change of interests with large communities. It represents a version of query type 2 from the 2014 SIGMOD Programming contest. Unlike in the SIGMOD problem, the version for the DEBS Grand Challenge focuses on the dynamic change of query results over time, i.e., calls for a continuous evaluation of the results.

Goal of the query:
Given an integer k and a duration d (in seconds), find the k comments with the largest range, where the range of a comment is defined as the size of the largest connected component in the graph defined by persons who (a) have liked that comment (see likes, comments), (b) where the comment was created not more than d seconds ago, and (c) know each other (see friendships).

Parameters: k, d

Input Streams: likes, friendships, comments

Output:
The output includes a single timestamp ts and exactly k strings per line. The timestamp and the strings should be separated by commas. The k strings represent comments, ordered by range from largest to smallest, with ties broken by lexicographical ordering (ascending). The k strings and the corresponding timestamp must be printed only when some input triggers a change of the output, as defined above. If less than k strings can be determined, the character “-” (a minus sign without the quotations) should be printed in place of each missing string.

The field ts corresponds to the timestamp of the input data item that triggered an output update. For instance, a new friendship relation may change the size of a community with a shared interest and hence may change the k strings. The timestamp of the event denoting the added friendship relation is then the timestamp ts for that line’s output. Also, the output must be updated when the results change due to the progress of time, e.g., when a comment is older that d. Specifically, if the update is triggered by an event leaving a time window at t2 (i.e., t2 = timestamp of the event + window size), the timestamp for the update is t2. As in Query 1, it is needless to say that timestamps refer to the logical time of the input data streams, rather than on the system clock.

In summary, the output is specified as follows:

ts: the timestamp of thetuple event that triggers a change in the output.
comments_1,…,comment_k: top k comments ordered by range, starting with the largest range (comment_1).

Sample output tuples for the query with k=3 could look as follows:

2010-10-28T05:01:31.022+0000,I love strawberries,-,-
2010-10-28T05:01:31.024+0000,I love strawberries,what a day!,-
2010-10-28T05:01:31.027+0000,I love st

Evaluation

The evaluation of the Grand Challenge will be done using VirtualBox virtual machines. The virtual machine should be setup with 4 cores and 8 GB of RAM. An automated evaluation platform will be used for this year's challenge. The evaluation platform will be made available on the 15th of January. Registered participants will be able to run their code at the evaluation platform and receive the performance results multiple times (detailed instructions about the submission process will be shared with participants starting from the 15th of January once they register to the challenge via easychair). The evaluation platform will be closed on the 3rd of April (challenge deadline).

For both queries it is assumed that the result is calculated in a streaming fashion - i.e.: (1) solutions must not make use of any pre-calculated information, such as indices and (2) result streams must be updated continuously.

The ranking of the submissions will be done using the total duration of the execution and the average delay per stream. Specifically, we will rank the solutions (a) by the total execution time as well as (b) by the average latency (combining the average results from both streams in a sum). The average latency should be computed as the average of each output tuple latency, the latter measured as the difference between the system clock taken when logging an output tuple and the system clock taken when the processing of the input tuple triggering the result starts. The final ranking will be determined by a score that is the sum of the rank for execution time and the rank for delay. For solutions with the same overall score we will favor the solution with the lower execution time.

FAQ

In order to access the Frequently Asked Questions please
follow the link to the Google Drive FAQ document.

Test cases

In order to access the test cases please follow the link for query
Q1
and
Q2.

Grand Challenge Co-Chairs

Vincenzo Gulisano, Chalmers University of Technology, Sweden

Zbigniew Jerzak, SAP SE, Germany

Spyros Voulgaris, Vrije Universiteit Amsterdam, The Netherlands

DEBS 2015 Grand Challenge: Taxi trips

Call for Grand Challenge Solutions

The ACM DEBS 2015 Grand Challenge is the fifth in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The goal of the 2015 DEBS Grand Challenge competition is to evaluate event-based systems in the context of real-time analytics over high volume geospatial data streams. The underlying scenario addresses the analysis of taxi trips based on a stream of trip reports from New York City. Specifically, the 2015 Grand Challenge targets following problems: the identification of recent frequent routes and identification of high profit regions. The corresponding queries demand the analysis of taxi location data along with other trip meta-data.

The data for the 2015 Grand Challenge is based on a data set released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (http://chriswhong.com/open-data/foil_nyc_taxi/). Details about the data, the queries for the DEBS 2015 Grand Challenge, and information about how the challenge will be evaluated is provided below.

Data

Provided data consists of reports of taxi trips including starting point, drop-off point, corresponding timestamps, and information related to the payment. Data are reported at the end of the trip, i.e., upon arrive in the order of the drop-off timestamps.

The specific attributes are listed below:

medallion an md5sum of the identifier of the taxi – vehicle bound
hack_license an md5sum of the identifier for the taxi license
pickup_datetime time when the passenger(s) were picked up
dropoff_datetime time when the passenger(s) were dropped off
trip_time_in_secs duration of the trip
trip_distance trip distance in miles
pickup_longitude longitude coordinate of the pickup location
pickup_latitude latitude coordinate of the pickup location
dropoff_longitude longitude coordinate of the drop-off location
dropoff_latitude latitude coordinate of the drop-off location
payment_type the payment method – credit card or cash
fare_amount fare amount in dollars
surcharge surcharge in dollars
mta_tax tax in dollars
tip_amount tip in dollars
tolls_amount bridge and tunnel tolls in dollars
total_amount total paid amount in dollars

Following are the first ten lines from the data file:

07290D3599E7A0D62097A346EFCC1FB5,E7750A37CAB07D0DFF0AF7E3573AC141,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,40.715008,CSH,3.50,0.50,0.50,0.00,0.00,4.50

22D70BF00EEB0ADC83BA8177BB861991,3FF2709163DE7036FCAA4E5A3324E4BF,2013-01-01 00:02:00,2013-01-01 00:02:00,0,0.00,0.000000,0.000000,0.000000,0.000000,CSH,27.00,0.00,0.50,0.00,0.00,27.50

0EC22AAF491A8BD91F279350C2B010FD,778C92B26AE78A9EBDF96B49C67E4007,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897,40.760445,CSH,4.00,0.50,0.50,0.00,0.00,5.00

1390FB380189DF6BBFDA4DC847CAD14F,BE317B986700F63C43438482792C8654,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-74.003838,40.726189,CSH,4.00,0.50,0.50,0.00,0.00,5.00

3B4129883A1D05BE89F2C929DE136281,7077F9FD5AD649AEACA4746B2537E3FA,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-73.983772,40.730995,CRD,4.00,0.50,0.50,0.00,0.00,5.00

5FAA7F69213D26A42FA435CA9511A4FF,00B7691D86D96AEBD21DD9E138F90840,2013-01-01 00:02:00,2013-01-01 00:03:00,60,0.00,0.000000,0.000000,0.000000,0.000000,CRD,2.50,0.50,0.50,0.25,0.00,3.75

DFBFA82ECA8F7059B89C3E8B93DAA377,CF8604E72D83840FBA1978C2D2FC9CDB,2013-01-01 00:02:00,2013-01-01 00:03:00,60,0.39,-73.981544,40.781475,-73.979439,40.784386,CRD,3.00,0.50,0.50,0.70,0.00,4.70

1E5F4C1CAE7AB3D06ABBDDD4D9DE7FA6,E0B2F618053518F24790C7FD0264E302,2013-01-01 00:03:00,2013-01-01 00:04:00,60,0.00,-73.993973,40.751266,0.000000,0.000000,CSH,2.50,0.50,0.50,0.00,0.00,3.50

468244D1361B8A3EB8D206CC394BC9E9,BB899DFEA9CC964B50C540A1D685CCFB,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-73.967758,40.760326,CSH,6.50,0.50,0.50,0.00,0.00,7.50

5F78CC6D4ECD0541B765FECE17075B6F,B7567F5BFD558C665D23B18451FE1FD1,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-73.981453,40.778465,CRD,6.00,0.50,0.50,1.30,0.00,8.30

The data file is sorted chronologically according to the dropoff_datetime. Events with the same dropoff_datetime are in random order. Please note that the quality of the data is not perfect. Some events might miss information such as drop off and pickup coordinates or fare information. Moreover, some information, such as, e.g., the fare price might have been entered incorrectly by the taxi drivers thus introducing additional skew. Handling of data quality issues is outside of the scope of this year’s Grand Challenge.

Please follow this link to access the data file comprising the first twenty days (roughly 2 million events) of data. The file is approximately 130 MB in size.

Please follow this link to access the full data file containing data for the whole year 2013. The file is approximately 12 GB in size and contains approximately 173 million events.

Query 1: Frequent Routes

The goal of the query is to find the top 10 most frequent routes during the last 30 minutes. A route is represented by a starting grid cell and an ending grid cell. All routes completed within the last 30 minutes are considered for the query. The output query results must be updated whenever any of the 10 most frequent routes changes. The output format for the result stream is:

pickup_datetime, dropoff_datetime, start_cell_id_1, end_cell_id_1, ... , start_cell_id_10, end_cell_id_10, delay

where pickup_datetime, dropoff_datetime are the timestamps of the trip report that resulted in an update of the result stream, start_cell_id_X the starting cell of the Xth-most frequent route, end_cell_id_X the ending cell of the Xth-most frequent route. If less than 10 routes can be identified within the last 30 min, then NULL is to be output for all routes that lack data.

The attribute “delay” captures the time delay between reading the input event that triggered the output and the time when the output is produced. Participants must determine the delay using the current system time right after reading the input and right before writing the output. This attribute will be used in the evaluation of the submission.

The cells for this query are squares of 500 m X 500 m. The cell grid starts with cell 1.1, located at 41.474937, -74.913585 (in Barryville). The coordinate 41.474937, -74.913585 marks the center of the first cell. Cell numbers increase towards the east and south, with the shift to east being the first and the shift to south the second component of the cell, i.e., cell 3.7 is 2 cells east and 6 cells south of cell 1.1. The overall grid expands 150km south and 150km east from cell 1.1 with the cell 300.300 being the last cell in the grid. All trips starting or ending outside this area are treated as outliers and must not be considered in the result computation.

Query 2: Profitable Areas

The goal of this query is to identify areas that are currently most profitable for taxi drivers. The profitability of an area is determined by dividing the area profit by the number of empty taxis in that area within the last 15 minutes. The profit that originates from an area is computed by calculating the median fare + tip for trips that started in the area and ended within the last 15 minutes. The number of empty taxis in an area is the sum of taxis that had a drop-off location in that area less than 30 minutes ago and had no following pickup yet.

The result stream of the query must provide the 10 most profitable areas in the subsequent format:

pickup_datetime, dropoff_datetime, profitable_cell_id_1, empty_taxies_in_cell_id_1, median_profit_in_cell_id_1, profitability_of_cell_1, ... , profitable_cell_id_10, empty_taxies_in_cell_id_10, median_profit_in_cell_id_10, profitability_of_cell_10, delay

with attribute names containing cell_id_1 corresponding to the most profitable cell and attribute containing cell_id_10 corresponding to the 10th most profitable cell. If less than 10 cell can be identified within the last 30 min, then NULL is to be returned for all cells that lack data. Query results must be updated whenever the 10 most profitable areas change. The pickup_datetime, dropoff_datetime in the output are the timestamps of the trip report that triggered the change.

The attribute “delay” captures the time delay between reading the input event that triggered the output and the time when the output is produced. Participants must determine the delay using the current system time right after reading the input and right before writing the output. This attribute will be used in the evaluation of the submission.

Note: We use the same numbering scheme as for query 1 but with a different resolution. In query two we assume a cell size of 250m X 250m, i.e., the area to be considered spans from cell 1.1 to cell 600.600.

Evaluation

The evaluation will be done using virtual machines (VirtualBox). Participants of the challenge must provide a virtual machine that contains and runs theirs solution. Within the virtual machine they must provide a script that executes their solution by replaying the GC from a file. The script must also measure the duration of the execution and output the execution time on the screen.

For both queries it is assumed that the result is calculated in a streaming fashion – i.e.: (1) solutions must not make use of any pre-calculated information, such as indices and (2) result streams must be updated continuously.

For the evaluation we will run a series of test runs for each solution and take the median execution time as result. For the test runs we will provide a virtual machine with access to 4 Cores and 8 GB of RAM.

The ranking of the submissions will be done using the total duration of the execution and the average delay per stream. Specifically, we will rank the solutions (a) by the total execution time as well as (b) by the average latency (combining the average results from both streams in a sum). The final ranking will be determined by a score that is the sum of the rank for execution time and the rank for delay. For solutions with the same overall score we will favor the solution with the lower execution time.

Example:

Execution time ranks:
Time Rank 1, Solution A, 10min
Time Rank 2, Solution B, 11min
Time Rank 3, Solution C, 30min
Time Rank 4, Solution D, 40min
Time Rank 5, Solution E, 45min

Delay ranks:
Delay Rank 1, Solution B, 0.5sec
Delay Rank 2, Solution A, 0.6sec
Delay Rank 3, Solution D, 1sec
Delay Rank 4, Solution E, 2sec
Delay Rank 5, Solution C, 10sec

Overall Rank:
Rank 1, Time Rank 1 + Delay Rank 2, Solution A
Rank 2, Time Rank 2 + Delay Rank 1, Solution B
Rank 3, Time Rank 4 + Delay Rank 3, Solution D
Rank 4, Time Rank 3 + Delay Rank 5, Solution C
Rank 5, Time Rank 5 + Delay Rank 4, Solution E

FAQ

Question 1: Is the earth flat or how to map coordinates to cells?

Answer: For the challenge we allow a simplified flat earth assumption for mapping coordinates to cells in the queries. You can assume that a distance of 500 meter south corresponds to a change of 0.004491556 degrees in the coordinate system. For moving 500 meter east you can assume a change of 0.005986 degrees in the coordinate system.

Question 2: What is a change of the top 10 list? Only when there is a new entry or also when the order changes?

Answer: Also when the order changes

Question 3: What happens to cells that do not have empty taxies and cause a decision by zero for query 2?

Answer: Results for this situation are undefined and should not be considered for calculating the output.

Question 4: Can you give us a concrete example of how a correctly formatted output event should look like?

Yes:
For Query 1:

2013-01-01 00:01:00,2013-01-01 00:02:00,1.7,2.9,10.3,20.27,40.3,30.27,40.3,30.28,1.1,1.2,120.1,150.2,200.1,150.2,120.1,170.2,10.1,15.2,11.1,15.7,1001

For Query 2:

2013-01-01 00:01:00,2013-01-01 00:02:00,1.1,2,50.10, ,10.1,2,40.10,,101.11,4,50.10,111.1,4,40.00,51.1,7,30.00,,102.1,10,31.10,12.1,10,31.09,1.31,10,30.00,1.122,10,30.00,211.11,11,30.00,1002

Note: the concrete numbers are toy examples to illustrate the formatting

Question 5: Should the queries run simultaneously during the evaluation or one after the other?

Answer: The queries must run simultaneously. However, the output must got two separated streams (i.e. end up in two different files for evaluation)

Question 6: In query 2 we must factor in the number of empty taxis in an area. This is defined as taxies that recently dropped someone off in that area and had no following pickup yet. However, since a trip is reported at its end, how should we know about taxies that had a pickup but are still driving with that customer?

Answer: It is correct to not account for trips that have not been reported yet. The system cannot know about these events and hence the queries cannot factor them in.

Question 7: How should we order elements in a list that have the same value for the ordering criterion?

Answer: You should always put the freshest information first. E.g. if route A and B have the same frequency, put the route with the freshest input information fist (i.e. the one which includes the freshest event).

Question 8: There are some negative values for fare_amount and tip_amount. Are these errors? How should we handle these values?

Answer: The negative values are errors. They should not be considered in any calculation. However, you can still use the corresponding events for calculations that do not need the fare_amount and tip_amount.

Question 9: Should we literally print the ASCII word “NULL” or should it be the NULL string (i.e., zero characters)?

Answer: Please write the ASCII word “NULL”.

Question 10: How to update the top 10 list when it is changed due to events that leave the window?

Answer: For updates that are caused by events leaving the window, you should use the timestamp that denotes when the result changes. E.g. in a 15 min window, that would be the timestamp of the relevant leaving event + 15min.

Question 11: About the delay measurements. Should we At which exact point should the “delay” attribute start “counting”?

(a) Since the moment the line is read from disk?

(b) Since the moment the line *starts* being parsed?

(c) Since the moment the line *has been* parsed?

Answer: (b) is the correct option. We do not want to capture the disk I/O in the delay and want to be agnostic to all buffering that might take place before the actual event processing starts. However, we consider parsing as part of the processing, because it and may be done in different ways. Hence, parsing should be captured by the delay measurements. Your system should have a clearly defined “entry point” where it receives the events as they are encoded in the file. The measurement is to be taken once the raw string encoded event passed that point. Please specify your approach to delay measurements in your corresponding paper submission.

Question 12:

At which exact point should the “delay” attribute stop “counting”?

(a) At the moment the top-10 list has been *computed* (but before being converted to an ASCII string)?

(b) At the moment the top-10 list has been converted to ASCII (but before the actual call to fprintf/write)?

(c) After the fprintf/write has returned?

Answer: Again, it is not our intention to capture disk I/O. However, creating the output in the right format is considered part of the processing. Thus, (b) is the right option.

Question 13: How do you ensure that the delay measurements are implemented in an appropriate way?

Answer: Please describe how you implemented the delay measurements in the corresponding paper.

Question 14: How are you going to test correctness, given that the precise output is very sensitive to minor things, like floating point precision errors in coordinate-to-cell conversion, etc.

Answer: Our goal is to have solutions that are similar enough to compare their performance. We have not seen the result files yet, thus it is too early to give a final answer. However, it is not our intent to disqualify solutions due to minor deviations that e.g. might be the result of rounding errors. What we want to avoid with the correctness requirement is that solutions use approximations or other ways of simplifying the actual computational problem.

So to be very clear about this: Do not use approximations as a way to speed up your solution.

 

FAQ For Test Cases

About Test Case 1:

The second route appears at 3:22 and remains the second most frequent for 14 Minutes. This is until 214 minutes after the start (and not 224 as in the description).

The first route should disappear 30 minutes after the last event for this route. The last event is at 3:20 and therefore the corresponding 30min is empty at 03:50 (not 03:53:00 as in the description).

About Test Case 3:

There have been questions about the cell IDs for Q2. The grids for Q1 and Q2 differ in their resolution. However, the center of the first cell is the same for both grids and the synthetic data uses the cell centers of the Q1 grid as coordinates. Therefore, the events for cell 1.1 are in cell 1.1 for both grids.

Further Questions:

Question 1: Is an event perceived has having caused an update when it leaves a time window and hence changes the result? Does this mean there are two measurements for end-to-end delay for this event (i.e. when in enters the system and when it leaves the window)?

Answer 1: The trip report event triggers a delay measurement only once (i.e. when it enters the system and causes and update). The updates due to timeouts are actually caused by a time event that has no end-to-end delay.

Questions 2: The grids for Q1 and Q2 are of different sizes. That is, the coarser grained grid covers a bigger area than the more fine grained one. What does this mean for treating outliers? Should we only consider events that are in both grids?

Answer 2: You should always use the grid for the given query as reference. That is, everything outside the grid for Q1 is an outlier regarding Q1 and everything outside the grid for Q2 is and outlier for Q2.

Question 3: Our solution manages time “globally” and merges simultaneous outputs into one consolidated output that reflects the query results at the given point in time. Is this ok?

Answer 3: Yes, this is ok. However, consecutive updates according to the order of arrival without consolidation are fine as well.

Question 4: There are some differences (e.g. formatting differences) in the original data sets and the data for the test cases. Specifically, the last line is missing the \n in test case 2, some profit values are pure integers without a decimal part as opposed to precisely 2 decimal digits and reported durations for trips are not the difference between the pickup-time and the dropoff-time. We tuned our solution to what we found in the original data. What should we do?

Answer 4: The test cases use synthetic data that was created to help debugging certain aspects of the solutions. The original dataset is normative for the challenge and you do not need to adapt if your solutions has issues that are caused by deviations in the synthetic data.

Grand Challenge Co-Chairs

Zbigniew Jerzak (Zbigniew.Jerzak_at_sap.com), SAP SE, Germany

Holger Ziekow (Holger.Ziekow_at_hs-furtwangen.de), Furtwangen University, Germany

DEBS Grand Challenge participants are encouraged to contact Grand Challenge organizers with questions and suggestions regarding the problem. Main contact points for solution authors are: Zbigniew Jerzak (Zbigniew.Jerzak_at_sap.com) and Holger Ziekow (hziekow_at_agtinternational.com). Unless explicitly requested by the question author(s) all answers will be made available online on the DEBS Grand Challenge website.

DEBS 2013 Grand Challenge: Soccer monitoring

The ACM DEBS 2013 Grand Challenge is the third in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The goal of the Grand Challenge competition is to implement a solution to a problem provided by the Grand Challenge organizers.
The DEBS Grand Challenge series provides problems which are relevant to the industry at large. DEBS Grand Challenge problems allow for evaluation of event based systems using real-life data and queries.

All DEBS 2013 Grand Challenge submissions must be composed of two parts: (1) a 6 page paper, in ACM SIG proceedings format, outlining the solution, highlighting its innovative aspects and presenting the evaluation and (2) a demonstration of the system – either in
the form of a video or a screencast. All submissions are subject to a peer review process. Authors of all accepted submissions will be invited to present their systems during the DEBS 2013 Conference. Upon the explicit consent from the challenge participants,
authors solutions will be included in the global ranking with the winner of the Grand Challenge being announced and awarded during the conference banquet. The global ranking is based on the peer review process scores and the performance of the solution. The performance is based on the throughput and latency of the system under the assumption of result correctness.

We would like to thank Christopher Mutschler and the Fraunhofer IIS for provisioning the sensor data and for their help in preparing this year’s challenge.

Problem Description

With this year’s challenge, we seek to demonstrate the applicability of event-based systems to provide real-time complex analytics over high velocity sensor data along the example of analyzing a soccer game. The data for the DEBS 2013 Grand Challenge originates from a number of wireless sensors embedded in the shoes and a ball used during a soccer match and spans the whole duration of the game. The real-time analytics includes the continuous computation of statistics of relevance to spectators (ball possession, shots on goal) as well as trainers and team managers (running analysis of team members’).

Data

The data used in this year DEBS Grand Challenge is collected by the Real-Time Locating System deployed on a football field of the Nuremberg Stadium in Germany. Data originates from sensors located near the players’ shoes (1 sensor per leg) and in the ball (1 sensor). The goal keeper is equipped with two additional sensors, one at each hand. The sensors in the players’ shoes and hands produce data with 200Hz frequency, while the sensor in the ball produces data with 2000Hz frequency. The total data rate reaches roughly 15.000 position events per second. Every position event describes position of a given sensor in a three-dimensional coordinate system. The center of the playing field is at coordinate (0, 0, 0) – see Figure 1 for the dimensions of the playing field and the coordinates of the kick off. The event schema is following:

sid, ts, x, y, z, |v|, |a|, vx, vy, vz, ax, ay, az
where sid is a sensor id which produced the position event, ts is a timestamp in picoseconds, e.g.: 10753295594424116 (with the value of 10753295594424116 designating the start and 14879639146403495 the end of the game); x, y, z describe the position of the sensor in mm (the origin is the middle of a full size football field) ; |v| (in μm/s), vx, vy, vz describe direction by a vector with size of 10,000. Hence, the speed of the object in x-direction in SI-units (m/s) is calculated by

v’x = |v| * vx * 1e-4 * 1e-6
(vx in m/s is derived by |v| * 1e-10 * vx) and |a| (in μm/s&sup2), ax, ay, az describe the absolute acceleration and its constituents in three dimensions (the acceleration in m/s&sup2 is calculated similar to that of the velocity). The acceleration does not include gravity, i.e. |a| is zero when the ball is at a fixed position and not 9.81 m/s&sup2).


Figure1: Playing field and its dimensions

In addition to the sensor data we also provide a separate data stream for referee events, which includes the time when a game was paused and the time when a game was resumed. Moreover, referee events contain the time and player_ids for substitutions.

The mapping between player ids and team ids as well as between sensor id and player id is provided in the metadata file.
The game, during which the data has been collected, was played on a half-size field with teams of seven players each. The game duration was two halves of thirty minutes each. We assume that the data arrives at the system under test without any delays, nor omissions.

Raw sensor data for the game can be downloaded from: here (2.6 GB). All data has been aggregated is a single file and is sorted by time stamps. The video recording of the game (vertical view, static camera) can be downloaded from: here (1st half, 1.7 GB) and here (2nd half, 1.7GB). The metadata file that contains player’s names and associated transmitter ids, detailed field coordinates etc. can be downloaded from: here (10 kB). Game interruptions, ball possession statistics, and shots on goal statistics can be downloaded from: here (10 kB). These statistics have been manually created manually and can serve as an aid in validating the respective query results.

Queries

In the following section we identify a number of queries which need to run concurrently and process the position data. Results of all queries must be returned as a stream of data, unless explicitly stated otherwise.

Query 1 – Running Analysis

The goal of this query is to calculate the analysis of the running performance of each of the players currently participating in the game. The intensities are defined as: standing (0-1 km/h), trot (till 11 km/h), low speed run (till 14 km/h), medium speed run (till 17 km/h), high speed run (till 24 km/h), and sprint (faster than 24 km/h). Figure 2 shows the possible transitions between different states which need to be observed for the running analysis.


Figure 2 – Transitions between running intensities for Query 1

In order to accommodate for the noise in the raw velocity measurements, the actual speed of the run should be computed from all the individual speed norms of a player’s transmitters. Here you can see an example plot of the velocity of the ball:


Figure 3 – Velocity of the ball over time

The running analysis query should return two classes of results: (1) current running statistics and (2) a set of aggregate running statistics. The current running statistics should be returned at a frequency of at most 50Hz and must contain following information:


ts_start, ts_stop, player_id, intensity, distance, speed
Where ts_start represents the start of the measurement, ts_stop represents the end of the measurement, player_id is the identifier of a player, intensity describes the intensity of the run, distance is the length of the run (in the horizontal plane only) between ts_start and ts_stop, and speed is the average speed of the given intensity run.

The aggregate running statistics must contain following information:


ts, player_id, standing_time, standing_distance, trot_time, trot_distance, low_time, low_distance, medium_time, medium_distance, high_time, high_distance, sprint_time, sprint_distance
where the ts represent the latest time stamp which updated the statistics, the player_id is the player identifier, xxx_time is the time player spent in the xxx intensity (in milliseconds), xxx_distance is the distance covered with the xxx intensity. The aggregate running statistics must be calculated using four different time windows: 1 minute, 5 minutes, 20 minutes and the whole game duration. Each window must emit an event with the frequency of 50Hz. The result will be four aggregate running statistics streams being returned by the system for each of the required window lengths. Moreover, every running intensity, which has been active for less than a second must be counted on top of the next intensity with a duration longer than 1s. For example, if a player is in a trot state for a longer time, then in a low speed run state for 0.8 seconds, and then in a medium speed run state for a longer time, the time of the low speed run is to be counted on top of the medium speed run.

Please note that the requirement to count only intensities active for at least one second requires you to delay the output until a reliable measurement has been made.

Query 2 – Ball Possession

The goal of this query is to calculate the ball possession for each of the players as well as for whole team. A player (and thereby his respective team) can obtain the ball whenever the ball is in his proximity (less than one meter away – calculated as the distance between the ball sensor and the closest sensor of the player) and he hits (the ball acceleration peaks) it. The ball will stay in his possession until another player hits it, the ball leaves the field, or the game is stopped. The ball possession is calculated as time between the first ball contact (hit) and the last ball contact (hit). The ball may leave the player proximity and will still stay in his possession.


Figure 4 – Ball possession states

A ball is hit if its (transmitter) distance from a player’s foot (transmitter) is less than 1 meter and its acceleration reaches a value of minimal 55 m/s&sup2. This value depends heavily on the fitness of the players – values of up to 100 m/s&sup2 are more suitable for professional games. It may be appropriate to apply a mean filter onto the acceleration values in order to get better detection performance.

The ball position query should return two classes of results: (1) per player ball possession stream and (2) per team ball possession. The per player ball possession stream should contain following information:


ts, player_id, time, hits
where the ts is the latest time stamp of the event which lead to the update of the ball possession, player_id is the player identifier, time is the total time of the ball possession for a given player, and hits is the total number of ball contacts of a given player.
The per team ball possession result stream must contain following statistics:


ts, team_id, time, time_percent
where the ts is the latest time stamp of the event which lead to the update of the team’s ball possession, team_id is the team identifier, time is the total time of the ball possession for a given team, time_percent is a % of the ball possession for a given team w.r.t. the total ball possession time of both teams. The per team ball possession should be calculated using four different time windows: 1 minute, 5 minutes, 20 minutes and the whole game duration. This results in four aggregate ball possession statistics result streams being returned by the system for each of the required window lengths.

Each statistics streams should be returned with the frequency of maximum 50Hz

Query 3 – Heat Map

The goal of this query is to calculate statistics for how long each of the players spent in which region of the field. For this purpose we define a grid with X rows along the x-axis and Y columns along the y-axis of equal size. The parameters X and Y should be implemented with following values 8 and 13 (a grid of 104 cells), 16 and 25, 32 and 50, 64 and 100 (a grid of 6,400 cells), respectively. The system should return results for all parameter settings in parallel but different result streams.

The system must provide for each cell and each player the percentage of time that the player spent in the respective cell over four different time windows: 1 minute, 5 minutes, 10 minutes and the whole game duration. This results in 16 result streams being returned by the system for each of the required window lengths and parameters for the grid resolution. Each result stream must be updated once per second and contain the following information:


ts, player_id, cell_x1, cell_y1, cell_x2, cell_y2, percent_time_in_time_cell
where the ts represent the time stamp of the latest statistics update, the cell_x1, cell_y1, cell_x2, cell_y2 are the coordinates of the lower left and upper right corner of the cell – respectively, the player_id is the player identifier, percent_time_in_time_cell is the percentage of time that player spent in the cell during the period specific to the result stream (0.00%-100.00%).

Query 4 – Shot on Goal

The aim of this query is to detect when a player shoots the ball in an attempt to score a goal. A shot on the goal is defined as any shot that would hit (or closely miss) the goal of the opposing team. Note, that this includes unsuccessful attempts that are e.g. blocked by a player or saved by the goal keeper.

Below we provide suggestions for the implementation of the shot detection. However, we also allow alternative implementations that yield good results (i.e. closely resemble the result lists provided in referee-events.tar.gz).

Figure 5 gives an overview of suggested states and transitions of the shot detection. A shot is detected if the player with id <player_id> hits the ball with a minimal acceleration of 55 m/s&sup2, and the projected movement of the ball would cross the opponents’ goal area within 1.5 seconds after the hit. The goal areas are defined as rectangles with the following coordinates:

Goal area team 1: x > 22578.5 and x < 29898.5, y = 33941.0, z < 2440.0
Goal area team 2: x > 22560.0 and x < 29880.0, y = -33968.0, z < 2440.0

Please note that the hit distorts the speed values of the ball. The data are preprocessed by a Kalman-filter and stabilize over time. The computation of the projection may take this into account. To allow for corrective measures we only require that the shot is detected at latest when the ball moved 1m away from the hit location.

We leave it open in the challenge to which degree the projection considers the physics of a flying ball in the projection. A base-line solution that simply extrapolates the motion vector is acceptable. However, more accurate computations of the ball movement (e.g. considering gravity) would be appreciated.

For the duration of the shot (i.e. as long as the state “shot on goal” in Figure 5 is active) the result stream should be updated with motion values of the ball and the ID of the shooting player:


ts, player_id, x, y, z, |v|, vx, vy, vz, |a|, ax, ay, az
The result stream should be updated with the frequency of the sensor data until an exit condition occurs. Exit conditions are (a) the ball leaves the field, or (c) the direction changes such that the goal area would no longer be hit.


Figure 5 – Shot on goal states

Frequently Asked Questions

Question 1: The playing field is not a perfect rectangle. Is it allowed to approximate the field with a perfect rectangle in the queries?

Answer 1: Yes. We suggest to use the following coordinates if you opt for an approximation: (0,33965), (0,-33960),(52483,33965),(52483,-33960)

Question 2: Each player has a sensor at each foot. How should the speed and location of a player be calculated based on these two sensors?

Answer 2: The speed and position should be computed as average of two sensors. (It is therefore ok to update the speed value at a rate of 200 updates per second, i.e. whenever new values for both sensors are available.)

Question 3: I am using the schema sid, ts, x, y, z, |v|, vx, vy, vz, |a|, ax, ay, az for the input stream. However, some values seem to make no sense.

Answer 3: There has been an error in data description for the first few days after the challenge was published. The correct schema for the sensor stream is following: sid, ts, x, y, z, |v|, |a|, vx, vy, vz, ax, ay, az.

Question 4: The heat map should be computed with different parameters for the cell size. How exactly should the listed parameters be combined?

Answer 4: The combination of parameters show be the following:

8 X 13 –> 104 cells
16 X 25 –> 400 cells
32 X 50 –> 1600 cells
64 X 100 –> 6400 cells

Question 5: What are the specifications of the VM that we are supposed to deliver in the end?

Answer 5: We only need a vmdk file – we will specify the VM parameters ourselves. They will be identical for all participants. The minimum parameters will be: 4 cores (@2.8 GHz), 4 GB RAM, SSD HDD, and 1GB Ethernet. The VM will be used for both testing of the throughput and correctness. Please note that the provisioning of the VM is on a voluntary basis and is only necessary for those who want to be included in the final ranking. Specifically it does not influence your chances of getting accepted into the main event.

Question 6: There are negative values for the z-position. How can this be?

Answer 6: The absolute positioning along the z-axis is error prone. Hoverer, the relative movements along this axis are measured with high accuracy. Please use only relative values.

Question 7: What does it mean, “Results of all queries must be returned as a stream of data”? Written to a file? Transmitted via the event system?

Answer 7: It is enough to write results to stdout or to a file.

Question 8: Each sensor event has a timestamp. Is it required to time the data replay according to this timestamp?

Answer 8: It is okay to ignore the timestamp for replaying the data. However, the query results must be in accordance with the timestamp in the data (e.g. length of temporal windows).

Question 9: What are the exact start and end times for the two halves of the game?

Answer 9:
1st Half:
Start of Game: 10753295594424116
End of Game: 12557295594424116
Without Ball: 12398************

2nd Half:
Start of Game: 13086639146403495
End of Game: 14879639146403495

Towards the end of the 1st half we had technical issues with the locating system so that the last 2.5 minutes are without the active ball transmitter (see without ball above). Hence, the shot on goal and ball possession query cannot produce valuable information for that
time.

Question 10: How often should the result streams be updated?

Answer 10: unless explicitly stated otherwise, each result stream must be updated for every input event.

Question 11: Should the windows in the heat map span halves?

Answer 11: No, the halves should be split.

Question 12: Are solutions supposed to work on actual real-time data or can we take advantage of the fact that data is already stored in a CSV file (and thus make an implementation which is not suitable for a real-time system?

Answer 12: We require all participants to keep the real-time aspect.

Question 13: Ball Possession – is the 1m distance from player to ball from the center or surface of the ball?

Answer 13: For simplicity the distance should be calculated as the distance between the two transmitters (the closest one on the player and the one in the ball).

Question 14: Running Analysis – what happens if we face a chain of intensity changes which duration of each one is less than 1 second?

Answer 14: You should delay until you get a measurement lasting at least 1 second.

Question 15: The output frequency for the running analysis is 50Hz. However, it is necessary to delay the output in order to correctly process running intensity changes. Are these conflicting requirements?

Answer 15: The defined output rate should limit the output to a *maximum* of 50Hz. Specifically, we do not want to enforce an exact rate of 50Hz. Therefore, it is OK to delay the output if no result can be computed at the moment.

Question 16: For the Query 3 – Heat Map – is it consequently correct that (6400 + 1600 + 400 + 104) * 4 = 34016 events should be emitted by the system for this query per second?

Answer 16: Yes. In order to balance the i/o throughput it is also possible to alter the result format and provide one large event contaning data for all cells in a given second for a given grid size.

Question 17: When I run my query for Q4, the time stamps I get are about five seconds ahead of both the videos. Is it normal?

Answer 17:Yes, the video is not fully aligned with the sensor data (the sensor recordings start earlier). The actual game starts at timestamp 10753295594424116.

Question 18: I would like to know whether the result schema for Query 4 presents a typo or if it is meant to include two timestamps. The schema is: ts, player_id, ts, x, y, z, |v|, vx, vy, vz, |a|, ax, ay, az. What is the difference between the two “ts” fields?

Answer 18:Yes this is a typo. A correct schema definition should contain only one timestamp as the first column. We have updated the query description accordingly.

DEBS 2012 Grand Challenge: Manufacturing equipment

DEBS 2012 Grand Challenge

The goal of the DEBS 2012 Grand Challenge is to provide a
common ground and evaluation criteria for a competition aimed at both research
and industrial event-based systems. The goal of the competition is to implement
a solution to a problem provided by the DEBS 2012 Grand Challenge organizers.
Solutions provided by the challenge participants will be evaluated and rated
based on the following criteria: (1) correctness with respect to the problem
specification, (2) throughput, and (3) latency.

Authors of all accepted submissions will be invited to
present their systems during the DEBS 2012 Conference. All accepted submissions
are to be accompanied by a paper, which will be included in the conference
proceedings. Challenge papers must be in the ACM format for the conference
proceedings and are limited to 3-6 pages. Upon the explicit consent from the
challenge participants their solutions will be included in the global ranking
with the winner of the challenge being announced and awarded during the
banquet.

Problem Description

The DEBS 2012 Grand Challenge problem involves monitoring of large hi-tech manufacturing equipment. The overall goal of the challenge is to demonstrate the capability of the event processing system to solve the set of queries for the given data. Therefore, the problem description is divided into two major parts – the description of the available data and the description of the queries which need to be executed on the data in order to solve the problem.

Data

The complete data set for the DEBS Challenge is available via FTP: ftp://ftp.mi.fu-berlin.de/pub/debs2012/ or  ftp://ftp.fu-berlin.de/science/computer/debs2012/ (mirror). The directory contains both the data file and the MD5 sum.

Monitoring data is recorded by the manufacturing equipment itself using an embedded PC and a set of sensors. The data is recorded with 100Hz frequency. However, it is possible that the time between two consecutive data points differs significantly from 10ms. The data is available as a flat CSV file, with each line representing a single event (data point). For the convenience of the DEBS Grand Challenge participants we have provided a data generator. The data generator is a simple runnable 32-bit JAR file which consumes the CSV data file and puts its contents into events which are subsequently sent via TCP/IP to a scpecified destination address. The generator JAR file can be downloaded from this link. The generator outputs (serializes) events in Google Protocol Buffers format. We also provide source code for a very simple server capable of receiving the messages output by the generator. The schema of output events is the following:

message CDataPoint {
required fixed64 ts     = 1; //time stamp: nanoseconds since 1st Jan 1970
required fixed64 index  = 2; //message index
required fixed32 mf01 = 3; //Electrical Power Main Phase 1
required fixed32 mf02 = 4; //Electrical Power Main Phase 2
required fixed32  mf03 = 5; //Electrical Power Main Phase 3
required fixed32 pc13 = 6; //Anode Current Drop Detection Cell 1
required fixed32 pc14 = 7; //Anode Current Drop Detection Cell 2
required fixed32 pc15 = 8; //Anode Current Drop Detection Cell 3
required uint32 pc25 = 9; //Anode Voltage Drop Detection Cell 1
required uint32 pc26 = 10; //Anode Voltage Drop Detection Cell 2
required uint32 pc27 = 11; //Anode Voltage Drop Detection Cell 3
required uint32 res  = 12;
required bool bm05  = 13; //Chem A Additive Sense
required bool bm06  = 14; //Chem B Additive Sense
required bool bm07  = 15; //Chem C Additive Sense
required bool bm08  = 16; //Chem A Additive Release Valve VL26
required bool bm09  = 17; //Chem B Additive Release Valve VL27
required bool bm10  = 18; //Chem C Additive Release Valve VL28
optional bool pp01  = 19;
optional bool pp02  = 20;
optional bool pp03  = 21;
optional bool pp04  = 22;
optional bool pp05  = 23;
optional bool pp06  = 24;
optional bool pp07  = 25;
optional bool pp08  = 26;
optional bool pp09  = 27;
optional bool pp10  = 28;
optional bool pp11  = 29;
optional bool pp12  = 30;
optional bool pp13  = 31;
optional bool pp14  = 32;
optional bool pp15  = 33;
optional bool pp16  = 34;
optional bool pp17  = 35;
optional bool pp18  = 36;
optional bool pp19  = 37;
optional bool pp20  = 38;
optional bool pp21  = 39;
optional bool pp22  = 40;
optional bool pp23  = 41;
optional bool pp24  = 42;
optional bool pp25  = 43;
optional bool pp26  = 44;
optional bool pp27  = 45;
optional bool pp28  = 46;
optional bool pp29  = 47;
optional bool pp30  = 48;
optional bool pp31  = 49;
optional bool pp32  = 50;
optional bool pp33  = 51;
optional bool pp34  = 52;
optional bool pp35  = 53;
optional bool pp36  = 54;
optional bool pc01  = 55;
optional bool pc02  = 56;
optional bool pc03  = 57;
optional bool pc04  = 58;
optional bool pc05  = 59;
optional bool pc06  = 60;
optional bool pc19  = 61;
optional bool pc20  = 62;
optional bool pc21  = 63;
optional bool pc22  = 64;
optional bool pc23  = 65;
optional bool pc24  = 66;
}

The only difference between the event schema output by the generator and the event schema in the CSV file is that for the generator output the time stamp format is converted from the ISO-like representation into the UNIX-like representation – seethe above schema definition. In order to run the generator a following command has to be issued:

$ java -jar generator.jar "/path/to/data/file.dat" 1.0 localhost 8080

Where “./path/to/data/file.dat” is the path to the data file, 1.0 is the speedup factor for the data generation, localhost is the name or the IP address of the destination host, and 8080 is the port number on the destination host. One feature of the generator is that it outputs events preserving their relative occurrence time. This means that the input file is replayed at (almost) the same speed at which the original data was recorded. The replay speed can be influenced by modifying the speedup factor. Specifying a value greater than 1.0 implies a replay speed which is higher than the original data rate while specifying a values below 1.0 implies a replay speed lower than the original data rate.

We are still in process of gathering of the data. However, in order to allow Challange participants to develop their solutions we provide a small data file contaning 5 minutes worth of samples. Please use the link to access the file.

Queries

Within this section we describe the queries which are part of the DEBS 2012 Grand Challenge. In our description we assume that all queries operate on the same event schema as defined above. We describe every query using a block data flow diagram with white rectangles representing operators and gray rectangles representing event streams.

Query 1

The goal of the first query is to monitor the behavior of Chem Additive sensors which themselves are responsible for the monitoring of Chem Additive Release valves – see Figure above. As all input data in this task is boolean the first operation performed by the operators 1 till 6 is to detect the change of state of each input fields (bm05 till bm10) and emit those along with time stamps of the state change occurrence. The second set of operators (7 till 9) correlates the change of state of the sensor and the change of state of the valve by calculating the time difference between the occurrence of the state changes. Whenever, the time difference increases by more than 1% within a 24hour period an alarm has to be raised. Moreover, a constant monitoring of the trend for the time difference using the least squares method for the period of 24 hours has to be performed. The trend monitoring can be either visualized or returned as a stream of plot parameters.

Query 2

The goal of the second query is to monitor the energy consumption of the manufacturing equipment. The energy consumption is recorded by the sensors mf01, mf02, and mf03. The first set of operators (operator 1 till 3) calculates the average values for each of the sensors (s1.avg-mf01 till s1.avg-mf03) as well as the relative variation (s1.rng-mf01 till s1.rng-mf03) in each of the sensors readings. Both average and variation values are calculated over the period of 1 second and are output every second.

The relative variation is used to trigger the recording of the raw values of the sensor readings — see operator 4. Whenever the relative variation on any of the energy measuring sensors exceeds the threshold of 30%, the raw data from each of the sensors (mf01-mf03) needs to be recorded starting 20 seconds before the occurrence of the threshold violation and ending 70 seconds afterwards. If multiple violations occur with the 90 second interval, it needs to be extended so that it always captures 70 seconds of raw data after the occurrence of the last violation and 20 seconds before the occurrence of the first violation.

Finally, operators 5 till 7 record the power consumption of the manufacturing equipment within a period of one minute.

Q&A

Question: In Query 1 – should we assume that the initial values of bm05, …, bm10 are 0, 1, or unknown?
Answer: The initial values are either 0 or 1. However, it is not know a priori whether it is 0 or 1.

Question: How are GPB messages written to the socket?
Answer: Messages are written using an encoder that prepends the Google Protocol Buffers messages with Base 128 Varints integer length field. Please refer to the sample server source code for the detailed insight into the decoding of the messages.

Question: Query 1 – is the 24h trend containing just the values of the last 24 hours or is it a prediction for the next 24 hours?
Answer: It contains last 24 hours.

Question: Query 1 – which kind of function should be used for the method of least squares: a simple linear or a non-linear like quadratic or polynomial?
Answer: Simple linear.

Question: Query 1 – when should the trend plot be updated?
Answer: Whenever the trend values change.

Question: Query 2 – What is the meaning of “now()” – is it the system time like “System.currentTimeMillis ()” in Java or is it the timestamp of the event (sensor reading) that causes the violation?
Answer: It is the timestamp of the sensor reading that caused the violation.

Question: Some lines in the input file seem to be merged (missing “\n”). See lines: 36000, 71999, 719998, 7319680, 18453793, 22725937, 28247589, 28473136, 28880898, 29077791, 29287746, 31847342.
Answer: Yes, this is a bug! We will fix it and upload a new file.

Question: Question 1 – “Whenever the time difference increases by more than 1% within a 24h period, an alarm has to be raised.” How should this 24h period be treated?
Answer: Using the operator 10 as an example – you should collect 24 hours’ worth of s58.dt and check whether the difference between any two values of s58.dt in the 24h window exceeds 1%. The window slides by removing tuples whose s58.ts < max(s58.ts) – 24h.