DEBS 2015 Grand Challenge: Taxi trips

Call for Grand Challenge Solutions

The ACM DEBS 2015 Grand Challenge is the fifth in a series of challenges which seek to provide a common ground and evaluation criteria for a competition aimed at both research and industrial event-based systems. The goal of the 2015 DEBS Grand Challenge competition is to evaluate event-based systems in the context of real-time analytics over high volume geospatial data streams. The underlying scenario addresses the analysis of taxi trips based on a stream of trip reports from New York City. Specifically, the 2015 Grand Challenge targets following problems: the identification of recent frequent routes and identification of high profit regions. The corresponding queries demand the analysis of taxi location data along with other trip meta-data.

The data for the 2015 Grand Challenge is based on a data set released under the FOIL (The Freedom of Information Law) and made public by Chris Whong (http://chriswhong.com/open-data/foil_nyc_taxi/). Details about the data, the queries for the DEBS 2015 Grand Challenge, and information about how the challenge will be evaluated is provided below.

Data

Provided data consists of reports of taxi trips including starting point, drop-off point, corresponding timestamps, and information related to the payment. Data are reported at the end of the trip, i.e., upon arrive in the order of the drop-off timestamps.

The specific attributes are listed below:

medallion an md5sum of the identifier of the taxi – vehicle bound
hack_license an md5sum of the identifier for the taxi license
pickup_datetime time when the passenger(s) were picked up
dropoff_datetime time when the passenger(s) were dropped off
trip_time_in_secs duration of the trip
trip_distance trip distance in miles
pickup_longitude longitude coordinate of the pickup location
pickup_latitude latitude coordinate of the pickup location
dropoff_longitude longitude coordinate of the drop-off location
dropoff_latitude latitude coordinate of the drop-off location
payment_type the payment method – credit card or cash
fare_amount fare amount in dollars
surcharge surcharge in dollars
mta_tax tax in dollars
tip_amount tip in dollars
tolls_amount bridge and tunnel tolls in dollars
total_amount total paid amount in dollars

Following are the first ten lines from the data file:

07290D3599E7A0D62097A346EFCC1FB5,E7750A37CAB07D0DFF0AF7E3573AC141,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,40.715008,CSH,3.50,0.50,0.50,0.00,0.00,4.50

22D70BF00EEB0ADC83BA8177BB861991,3FF2709163DE7036FCAA4E5A3324E4BF,2013-01-01 00:02:00,2013-01-01 00:02:00,0,0.00,0.000000,0.000000,0.000000,0.000000,CSH,27.00,0.00,0.50,0.00,0.00,27.50

0EC22AAF491A8BD91F279350C2B010FD,778C92B26AE78A9EBDF96B49C67E4007,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897,40.760445,CSH,4.00,0.50,0.50,0.00,0.00,5.00

1390FB380189DF6BBFDA4DC847CAD14F,BE317B986700F63C43438482792C8654,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.48,-74.004173,40.720947,-74.003838,40.726189,CSH,4.00,0.50,0.50,0.00,0.00,5.00

3B4129883A1D05BE89F2C929DE136281,7077F9FD5AD649AEACA4746B2537E3FA,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.61,-73.987373,40.724861,-73.983772,40.730995,CRD,4.00,0.50,0.50,0.00,0.00,5.00

5FAA7F69213D26A42FA435CA9511A4FF,00B7691D86D96AEBD21DD9E138F90840,2013-01-01 00:02:00,2013-01-01 00:03:00,60,0.00,0.000000,0.000000,0.000000,0.000000,CRD,2.50,0.50,0.50,0.25,0.00,3.75

DFBFA82ECA8F7059B89C3E8B93DAA377,CF8604E72D83840FBA1978C2D2FC9CDB,2013-01-01 00:02:00,2013-01-01 00:03:00,60,0.39,-73.981544,40.781475,-73.979439,40.784386,CRD,3.00,0.50,0.50,0.70,0.00,4.70

1E5F4C1CAE7AB3D06ABBDDD4D9DE7FA6,E0B2F618053518F24790C7FD0264E302,2013-01-01 00:03:00,2013-01-01 00:04:00,60,0.00,-73.993973,40.751266,0.000000,0.000000,CSH,2.50,0.50,0.50,0.00,0.00,3.50

468244D1361B8A3EB8D206CC394BC9E9,BB899DFEA9CC964B50C540A1D685CCFB,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.71,-73.955383,40.779728,-73.967758,40.760326,CSH,6.50,0.50,0.50,0.00,0.00,7.50

5F78CC6D4ECD0541B765FECE17075B6F,B7567F5BFD558C665D23B18451FE1FD1,2013-01-01 00:00:00,2013-01-01 00:04:00,240,1.21,-73.973000,40.793140,-73.981453,40.778465,CRD,6.00,0.50,0.50,1.30,0.00,8.30

The data file is sorted chronologically according to the dropoff_datetime. Events with the same dropoff_datetime are in random order. Please note that the quality of the data is not perfect. Some events might miss information such as drop off and pickup coordinates or fare information. Moreover, some information, such as, e.g., the fare price might have been entered incorrectly by the taxi drivers thus introducing additional skew. Handling of data quality issues is outside of the scope of this year’s Grand Challenge.

Please follow this link to access the data file comprising the first twenty days (roughly 2 million events) of data. The file is approximately 130 MB in size.

Please follow this link to access the full data file containing data for the whole year 2013. The file is approximately 12 GB in size and contains approximately 173 million events.

Query 1: Frequent Routes

The goal of the query is to find the top 10 most frequent routes during the last 30 minutes. A route is represented by a starting grid cell and an ending grid cell. All routes completed within the last 30 minutes are considered for the query. The output query results must be updated whenever any of the 10 most frequent routes changes. The output format for the result stream is:

pickup_datetime, dropoff_datetime, start_cell_id_1, end_cell_id_1, ... , start_cell_id_10, end_cell_id_10, delay

where pickup_datetime, dropoff_datetime are the timestamps of the trip report that resulted in an update of the result stream, start_cell_id_X the starting cell of the Xth-most frequent route, end_cell_id_X the ending cell of the Xth-most frequent route. If less than 10 routes can be identified within the last 30 min, then NULL is to be output for all routes that lack data.

The attribute “delay” captures the time delay between reading the input event that triggered the output and the time when the output is produced. Participants must determine the delay using the current system time right after reading the input and right before writing the output. This attribute will be used in the evaluation of the submission.

The cells for this query are squares of 500 m X 500 m. The cell grid starts with cell 1.1, located at 41.474937, -74.913585 (in Barryville). The coordinate 41.474937, -74.913585 marks the center of the first cell. Cell numbers increase towards the east and south, with the shift to east being the first and the shift to south the second component of the cell, i.e., cell 3.7 is 2 cells east and 6 cells south of cell 1.1. The overall grid expands 150km south and 150km east from cell 1.1 with the cell 300.300 being the last cell in the grid. All trips starting or ending outside this area are treated as outliers and must not be considered in the result computation.

Query 2: Profitable Areas

The goal of this query is to identify areas that are currently most profitable for taxi drivers. The profitability of an area is determined by dividing the area profit by the number of empty taxis in that area within the last 15 minutes. The profit that originates from an area is computed by calculating the median fare + tip for trips that started in the area and ended within the last 15 minutes. The number of empty taxis in an area is the sum of taxis that had a drop-off location in that area less than 30 minutes ago and had no following pickup yet.

The result stream of the query must provide the 10 most profitable areas in the subsequent format:

pickup_datetime, dropoff_datetime, profitable_cell_id_1, empty_taxies_in_cell_id_1, median_profit_in_cell_id_1, profitability_of_cell_1, ... , profitable_cell_id_10, empty_taxies_in_cell_id_10, median_profit_in_cell_id_10, profitability_of_cell_10, delay

with attribute names containing cell_id_1 corresponding to the most profitable cell and attribute containing cell_id_10 corresponding to the 10th most profitable cell. If less than 10 cell can be identified within the last 30 min, then NULL is to be returned for all cells that lack data. Query results must be updated whenever the 10 most profitable areas change. The pickup_datetime, dropoff_datetime in the output are the timestamps of the trip report that triggered the change.

The attribute “delay” captures the time delay between reading the input event that triggered the output and the time when the output is produced. Participants must determine the delay using the current system time right after reading the input and right before writing the output. This attribute will be used in the evaluation of the submission.

Note: We use the same numbering scheme as for query 1 but with a different resolution. In query two we assume a cell size of 250m X 250m, i.e., the area to be considered spans from cell 1.1 to cell 600.600.

Evaluation

The evaluation will be done using virtual machines (VirtualBox). Participants of the challenge must provide a virtual machine that contains and runs theirs solution. Within the virtual machine they must provide a script that executes their solution by replaying the GC from a file. The script must also measure the duration of the execution and output the execution time on the screen.

For both queries it is assumed that the result is calculated in a streaming fashion – i.e.: (1) solutions must not make use of any pre-calculated information, such as indices and (2) result streams must be updated continuously.

For the evaluation we will run a series of test runs for each solution and take the median execution time as result. For the test runs we will provide a virtual machine with access to 4 Cores and 8 GB of RAM.

The ranking of the submissions will be done using the total duration of the execution and the average delay per stream. Specifically, we will rank the solutions (a) by the total execution time as well as (b) by the average latency (combining the average results from both streams in a sum). The final ranking will be determined by a score that is the sum of the rank for execution time and the rank for delay. For solutions with the same overall score we will favor the solution with the lower execution time.

Example:

Execution time ranks:
Time Rank 1, Solution A, 10min
Time Rank 2, Solution B, 11min
Time Rank 3, Solution C, 30min
Time Rank 4, Solution D, 40min
Time Rank 5, Solution E, 45min

Delay ranks:
Delay Rank 1, Solution B, 0.5sec
Delay Rank 2, Solution A, 0.6sec
Delay Rank 3, Solution D, 1sec
Delay Rank 4, Solution E, 2sec
Delay Rank 5, Solution C, 10sec

Overall Rank:
Rank 1, Time Rank 1 + Delay Rank 2, Solution A
Rank 2, Time Rank 2 + Delay Rank 1, Solution B
Rank 3, Time Rank 4 + Delay Rank 3, Solution D
Rank 4, Time Rank 3 + Delay Rank 5, Solution C
Rank 5, Time Rank 5 + Delay Rank 4, Solution E

FAQ

Question 1: Is the earth flat or how to map coordinates to cells?

Answer: For the challenge we allow a simplified flat earth assumption for mapping coordinates to cells in the queries. You can assume that a distance of 500 meter south corresponds to a change of 0.004491556 degrees in the coordinate system. For moving 500 meter east you can assume a change of 0.005986 degrees in the coordinate system.

Question 2: What is a change of the top 10 list? Only when there is a new entry or also when the order changes?

Answer: Also when the order changes

Question 3: What happens to cells that do not have empty taxies and cause a decision by zero for query 2?

Answer: Results for this situation are undefined and should not be considered for calculating the output.

Question 4: Can you give us a concrete example of how a correctly formatted output event should look like?

Yes:
For Query 1:

2013-01-01 00:01:00,2013-01-01 00:02:00,1.7,2.9,10.3,20.27,40.3,30.27,40.3,30.28,1.1,1.2,120.1,150.2,200.1,150.2,120.1,170.2,10.1,15.2,11.1,15.7,1001

For Query 2:

2013-01-01 00:01:00,2013-01-01 00:02:00,1.1,2,50.10, ,10.1,2,40.10,,101.11,4,50.10,111.1,4,40.00,51.1,7,30.00,,102.1,10,31.10,12.1,10,31.09,1.31,10,30.00,1.122,10,30.00,211.11,11,30.00,1002

Note: the concrete numbers are toy examples to illustrate the formatting

Question 5: Should the queries run simultaneously during the evaluation or one after the other?

Answer: The queries must run simultaneously. However, the output must got two separated streams (i.e. end up in two different files for evaluation)

Question 6: In query 2 we must factor in the number of empty taxis in an area. This is defined as taxies that recently dropped someone off in that area and had no following pickup yet. However, since a trip is reported at its end, how should we know about taxies that had a pickup but are still driving with that customer?

Answer: It is correct to not account for trips that have not been reported yet. The system cannot know about these events and hence the queries cannot factor them in.

Question 7: How should we order elements in a list that have the same value for the ordering criterion?

Answer: You should always put the freshest information first. E.g. if route A and B have the same frequency, put the route with the freshest input information fist (i.e. the one which includes the freshest event).

Question 8: There are some negative values for fare_amount and tip_amount. Are these errors? How should we handle these values?

Answer: The negative values are errors. They should not be considered in any calculation. However, you can still use the corresponding events for calculations that do not need the fare_amount and tip_amount.

Question 9: Should we literally print the ASCII word “NULL” or should it be the NULL string (i.e., zero characters)?

Answer: Please write the ASCII word “NULL”.

Question 10: How to update the top 10 list when it is changed due to events that leave the window?

Answer: For updates that are caused by events leaving the window, you should use the timestamp that denotes when the result changes. E.g. in a 15 min window, that would be the timestamp of the relevant leaving event + 15min.

Question 11: About the delay measurements. Should we At which exact point should the “delay” attribute start “counting”?

(a) Since the moment the line is read from disk?

(b) Since the moment the line *starts* being parsed?

(c) Since the moment the line *has been* parsed?

Answer: (b) is the correct option. We do not want to capture the disk I/O in the delay and want to be agnostic to all buffering that might take place before the actual event processing starts. However, we consider parsing as part of the processing, because it and may be done in different ways. Hence, parsing should be captured by the delay measurements. Your system should have a clearly defined “entry point” where it receives the events as they are encoded in the file. The measurement is to be taken once the raw string encoded event passed that point. Please specify your approach to delay measurements in your corresponding paper submission.

Question 12:

At which exact point should the “delay” attribute stop “counting”?

(a) At the moment the top-10 list has been *computed* (but before being converted to an ASCII string)?

(b) At the moment the top-10 list has been converted to ASCII (but before the actual call to fprintf/write)?

(c) After the fprintf/write has returned?

Answer: Again, it is not our intention to capture disk I/O. However, creating the output in the right format is considered part of the processing. Thus, (b) is the right option.

Question 13: How do you ensure that the delay measurements are implemented in an appropriate way?

Answer: Please describe how you implemented the delay measurements in the corresponding paper.

Question 14: How are you going to test correctness, given that the precise output is very sensitive to minor things, like floating point precision errors in coordinate-to-cell conversion, etc.

Answer: Our goal is to have solutions that are similar enough to compare their performance. We have not seen the result files yet, thus it is too early to give a final answer. However, it is not our intent to disqualify solutions due to minor deviations that e.g. might be the result of rounding errors. What we want to avoid with the correctness requirement is that solutions use approximations or other ways of simplifying the actual computational problem.

So to be very clear about this: Do not use approximations as a way to speed up your solution.

 

FAQ For Test Cases

About Test Case 1:

The second route appears at 3:22 and remains the second most frequent for 14 Minutes. This is until 214 minutes after the start (and not 224 as in the description).

The first route should disappear 30 minutes after the last event for this route. The last event is at 3:20 and therefore the corresponding 30min is empty at 03:50 (not 03:53:00 as in the description).

About Test Case 3:

There have been questions about the cell IDs for Q2. The grids for Q1 and Q2 differ in their resolution. However, the center of the first cell is the same for both grids and the synthetic data uses the cell centers of the Q1 grid as coordinates. Therefore, the events for cell 1.1 are in cell 1.1 for both grids.

Further Questions:

Question 1: Is an event perceived has having caused an update when it leaves a time window and hence changes the result? Does this mean there are two measurements for end-to-end delay for this event (i.e. when in enters the system and when it leaves the window)?

Answer 1: The trip report event triggers a delay measurement only once (i.e. when it enters the system and causes and update). The updates due to timeouts are actually caused by a time event that has no end-to-end delay.

Questions 2: The grids for Q1 and Q2 are of different sizes. That is, the coarser grained grid covers a bigger area than the more fine grained one. What does this mean for treating outliers? Should we only consider events that are in both grids?

Answer 2: You should always use the grid for the given query as reference. That is, everything outside the grid for Q1 is an outlier regarding Q1 and everything outside the grid for Q2 is and outlier for Q2.

Question 3: Our solution manages time “globally” and merges simultaneous outputs into one consolidated output that reflects the query results at the given point in time. Is this ok?

Answer 3: Yes, this is ok. However, consecutive updates according to the order of arrival without consolidation are fine as well.

Question 4: There are some differences (e.g. formatting differences) in the original data sets and the data for the test cases. Specifically, the last line is missing the \n in test case 2, some profit values are pure integers without a decimal part as opposed to precisely 2 decimal digits and reported durations for trips are not the difference between the pickup-time and the dropoff-time. We tuned our solution to what we found in the original data. What should we do?

Answer 4: The test cases use synthetic data that was created to help debugging certain aspects of the solutions. The original dataset is normative for the challenge and you do not need to adapt if your solutions has issues that are caused by deviations in the synthetic data.

Grand Challenge Co-Chairs

Zbigniew Jerzak (Zbigniew.Jerzak_at_sap.com), SAP SE, Germany

Holger Ziekow (Holger.Ziekow_at_hs-furtwangen.de), Furtwangen University, Germany

DEBS Grand Challenge participants are encouraged to contact Grand Challenge organizers with questions and suggestions regarding the problem. Main contact points for solution authors are: Zbigniew Jerzak (Zbigniew.Jerzak_at_sap.com) and Holger Ziekow (hziekow_at_agtinternational.com). Unless explicitly requested by the question author(s) all answers will be made available online on the DEBS Grand Challenge website.

Email webmaster