Supporting airline ticket purchase: Buy now or wait?

Date of analysis: May 28, 2008

Table of Contents

  1. Problem statement
  2. Data collection and processing
    2.1. Data collection
    2.2. Missing value processing
    2.3. Dependent variables and Independent variables
    2.4. Data reduction: factor analysis
    2.5. Final data
  3. Analysis
    3.1 Scatter plot diagram
    3.2 Decision Tree
  4. Summary

Problem statement

Airplane ticket prices fluctuate according to various conditions, and airlines do not disclose pricing policies, so it is difficult to forecast prices. Therefore, even if you board the same plane, the amount of money you pay for a seat varies greatly. Smart buyers want to pay the least amount of money to buy a plane ticket, but it is very difficult to decide when to buy a plane ticket.

Online travel companies, such as Priceline, offer low fare alert services to help customers purchase airline tickets. When customers are interested in flights, they will be notified by email if the price have reached the one specified in advance. However, since these services are based on what has already happened, there is a disadvantage that if the prices continue to rise, they may miss the right time to buy an appropriate ticket.

To help customers buy airline tickets, I analyzed the data affecting a ticket price and used data mining techniques to find patterns. Based on the prediction that the airplane ticket will rise or fall tomorrow, I can assist customers to make the decision whether it is good to buy an airplane ticket now (Buy) or better to wait (Wait).

Data collection and processing

Data collection

The data related to airplane ticket fares were directly collected from KAYAK and Priceline. These data are collected from March 1, 2008 to May 28, 2008. The fare of airplane is so fluctuating that the fare is updated about 4 ~ 7 times a day, so the lowest fare among these prices is used. A summary of the data collected in relation to the airfare is as follows.

ItemDescription
Search datesMarch 1, 2008 ~ May 28, 2008 (90 days)
Collection Term1 day
AirlineMultiple airlines (Non-stop and stopover using different airlines)
RouteLAX (Los Angeles) to HNL (Honolulu)
GradeEconomy
Trip wayRound trip

Data on airplane fares and the factors affecting ticket prices (e.g., season, week, time, and the number of stops) are collected. I also collected external factors (e.g., oil prices) that can affect ticket prices.

Dependent variables and Independent variables

The following variables were collected for a total of 253 instances.

Dependent Variables (DVs)

  • buy (b) or wait (w)
    • Buy: Today price < Next day price, Wait: Today price >= Next day price
  • ticket_price
    • Lowest price on the search date

Independent Variables (IVs)

AttributeTypeDescription
remain_daysNumericDays remaining until travel date from the search date
travel_durationNumericDays of travel
oil_priceNumericiPath S&P GSCI Crude Oil Tot Ret Idx ETN (OILNF)
yesterday_ticket_priceNumericThe lowest airfare of the day before the search date
yesterday_oil_priceNumericOil price the day before the search date
trip_leave_dateOrdinal- highly_busy (0): Summer vacation season (3rd week of July ~ 3rd week of August), New year’s day, Christmas season (4th week of December)
- busy (1): Easter, Memorial day, Independence day, Labor day, Thanksgiving day
- moderate (2): 2nd week of April ~ 2nd week of July, 2nd week of October ~ 2nd week of December
- not_busy (3): other dates
trip_leave_weekOrdinal- most_busy (0): Saturday
- busy (1): Friday
- moderate (2): Sunday
- not_busy (3): other weeks
trip_leave_timeNominal- pre_am: 12 am ~ 6 am
- early_am: 6 am ~ 9 am
- morning: 9 am ~ 12pm
- afternoon: 12 pm ~ 5 pm
- evening: 5 pm ~ 9 pm
- night: 9 pm ~ 12 am
stopNumericNumber of stopovers to destination
quarterOrdinal- end_quarter (0): 3rd to 4th weeks of March, June, September, December
- not_end_quarter (1): other dates

Missing value processing

In addition, the price of an airplane ticket is influenced by oil price. The oil prices from March 1, 2008 to May 28, 2008 were obtained from the iPath S&P GSCI Crude Oil Tot Ret Idx ETN (OILNF) in the finance section in the US Yahoo site. Oil prices are not traded on weekends and holidays, there are missing values. To handle this, the data for these days was filled at the prices traded the day before. That is, May 25, 2008 is the Memorial Day in the US, so there is no oil price data even though it is Monday, so it is filled with the transaction price on May 23, 2008.

Data reduction: factor analysis

Factor analysis was performed to reduce the variables and three variables of stop, trip_leave_time, and quarter were removed. First, the stop variable is eliminated because the the number of stops does not affect the price significantly since the collected data includes not only direct flights but also the multiple airline within a one airline. For the trip_leave_time was collected for the assumption that crowded use of daytime airplanes so the airfare is expected to be more expensive at day time than the one in the night; however, it was removed because it did not have a significant correlation with the price. The quarter was collected with the expectation that the company would put the airplane ticket at the end of the quarter at a cheap price to increase the total revenue; but it was removed because it did not show any correlation.

We did not perform the selection using the principal component analysis, because many variables are not taken into consideration, and the remaining variables are considered to be meaningful.

Final data

airfare.csv

Analysis

Scatter plot diagram

The price of airplane tends to increase as the travel date approaches.
When the number of days remaining until the date of travel is the same, Thursday and Sunday start at a lower price, while Friday and Saturday start at a higher price.

To investigate the factors affecting the airplane ticket prices, I plotted scatter plot diagrams with IVs on the X axis and DV on the Y axis.

Scatter plot diagram sorted by trip_leave_week (X-axis: remain_days, Y-axis: ticket_price) research_airfareplot1
What is noteworthy is that, even if the number of days remaining until the date of travel (remain_days) is the same, the price is different according to the day of departure (trip_leave_week). For example, while 80 days remain equally, Thursday (purple) and Sunday (yellow) start at a lower price, while Friday (blue) and Saturday (green) start at a higher price.

Scatter plot diagram sorted by trip_leave_date (X-axis: remain_days, Y-axis: ticket_price) research_airfareplot3
Likewise, even if the number of days remaining until the date of travel (remain_days) is the same, the price is different according to the date of departure (trip_leave_date). For example, while 80 days remain equally, 05/29/2008 (blue) and 06/01/2008 (green) start at a lower price, while 07/04/2008 (yellow) and 08/02/2008 (purple) start at a higher price. These differences are due to peak season and non-peak season. July 4 is the Independence Day of the United States and August 2 is the peak season for the summer vacation season.

Decision Tree

K-means clustering
We use K-means clustering to distinguish the data set and find the characteristics for each cluster. The size of K starts from 2 and clustering is repeated until the lift value is no longer increased. As a result of performing the cluster from K = 2 to K = 8, it stopped at K = 8 because the lift value is no longer improved.

  • Metrics
 Predicted yesPredicted no
Trueab
Falsecd
- precision = a/(a+c)  
- recall = a/(a+b)  
- lift = a/(a+c) / (a+b)/(a+b+c+d)
  • Data set
raw datainstances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
 78175253110.30.6911
  • cluster K = 2
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
1561091651.10.950.330.660.710.62
22266880.811.080.250.750.280.37
  • cluster K = 3
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
12155760.91.050.270.720.260.31
22061810.81.090.250.750.260.35
33759961.250.880.380.650.480.34
  • cluster K = 4
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
139821211.050.980.320.6776860.50.4685
21545600.811.080.250.750.19230.2571
31944630.9710.30.6984130.243590.2514
45491.80.60.550.4444440.06410.0228
  • cluster K = 5
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
11545600.811.080.250.750.19230.2571
21642580.891.050.270.7241380.20510.24
33036661.470.780.450.5454550.38460.2057
41348610.691.140.210.7868850.16660.2742
54481.620.720.50.50.05120.0228
  • cluster K = 6
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
12142631.080.960.330.6666670.26920.24
2822300.861.060.260.7333330.10250.1257
34481.620.720.50.50.05120.0228
4721280.811.080.250.750.08970.12
52645711.180.910.360.6338030.33330.2571
61241530.71.120.220.7735850.15380.2342
  • cluster K = 7
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
11343560.751.110.230.7678570.16660.2457
2820280.921.030.280.7142860.10250.1142
358131.240.880.380.6153850.06410.0457
41241530.731.120.220.7735850.15380.2342
53471.420.820.420.5714290.0380.0227
61126370.961.010.290.7027030.1410.1485
72633591.460.810.440.5593220.33330.188
  • cluster K = 8
cluster #instances #instances #instances #liftliftprecisionprecisionrecallrecall
 buywaittotalbuywaitbuywaitbuywait
1831390.681.150.20.7948720.10250.1771
21318311.40.840.410.5806450.16660.1028
32134551.30.890.380.6181820.26920.1942
448121.110.960.330.6666670.05120.045
5939480.631.170.180.81250.11530.222
687151.780.670.530.4666670.1020.04
73471.420.820.420.5714290.03840.022
81234460.871.070.260.739130.15380.1942

CRT (Classification and Regression Tree)

The decision tree shows that the major factors affecting airline ticket prices are oil_price and remain_days.
From this model, we can determine you may buy the ticket today if oil_price > 64.62 per barrel, remain_days > 50, and yesterday_oil_price < 69.57.

  • CRT tree based on the clusters of K = 1, 3, 5 research_airfareCRT

  • Model summary
    research_airfareCRTsummary

  • Validation (10-Fold Cross Validation)
    research_airfareCRTvalidation

Summary

From the result, I notice that the price of airplane tends to increase as the travel date approaches. When the number of days remaining until the date of travel is the same, Thursday and Sunday start at a lower price, while Friday and Saturday start at a higher price.

The prediction result that the tomorrow airplane price will rise or fall compared to the today price can be useful to customers who want to purchase airplane tickets. As mentioned above, it is difficult to predict the timing of the purchase of the optimal airline ticket because the airline pricing is very complicated due to many factors. Much more of data for longer period needs to be collected for building a more accurate prediction model.