Evaluation from Dependent Samples via Bandit Algorithm: Approach from Standardized Martingales https://arxiv.org/pdf/2006.06982.pdf • Kato and Kaneko (2020), Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy https://arxiv.org/abs/2010.13554 3
Uehara*, and Yasui (*equal contribution). NeurIPS 2020 (spotlight). Off-Policy Evaluation and Learning for External Validity under a Covariate Shift https://arxiv.org/abs/2002.11642 4
Morishita, Fujita, and Shibata. WWW 2019. A Feedback Shift Correction in Predicting Conversion Rates under Delayed Feedback. https://dl.acm.org/doi/10.1145/3366423.3380032 • Kato and Yasui. (2020). Learning Classifiers under Delayed Feedback with a Time Window Assumption https://arxiv.org/abs/2009.13092 5
sample size as possible. n Parameter of interest: Average Treatment Effect (ATE) n How can we gain efficiency? → Adjust the assignment probability of treatment, a.k.a., propensity score, to minimize the asymptotic variance of an estimator of the ATE. Cf. van der Laan. (2008). and Hahn, Hirano, and Karlan. (2011). ü Such a problem setting is called adaptive experimental design. • Adaptive experimental design has a close relation ship with Off-policy evaluation from dependent samples and Multi-armed bandit problem (Best-arm identification) Motivation 7
a period , we observe an individual with a covariate ! ∈ . • For the individual, we assign a treatment ! ∈ {0,1} with !(! ∣ !). • Then, we immediately observe the outcome ! . Here, we assume a potential outcome framework; that is, ! = 1 ! = 1 ! 1 + 1 ! = 0 !(0) for a potential outcome !(). • The data-generating-process is summarized as !, !, ! ∼ ! , . ATE is defined as " = ! 1 − [!(0)] Problem Setting 8
we minimize the efficiency bound by adjusting 1 ! . = Let us regard the efficiency bound as a function of 1 ! . The minimizer is given as ∗ 1 ! = &'( ! 1 ! &'( ! 1 ! ) &'( ! 0 ! . Adaptive Experimental Design 10
a treatment with an efficient probability ∗ 1 ! minimizes the asymptotic variance of an ATE estimator. n When ! ! is given, this strategy is feasible. → However, in general, the conditional variance is unknown. • During an experiment, we estimate the conditional variance and assign a treatment with an estimated efficient probability. = We adaptively design the experiment. Adaptive Experimental Design 11
• Both papers are first published at 2008. n van der Laan (2008): • Sequentially estimate the conditional variance and construct an ATE estimator using martingale property. n Hahn, Hirano, and Karlan (2011): • Separate an experiment into two stages: 1. at the first stage, estimate the conditional variance; 2. at the second stage, assign a treatment based on the estimator of the first stage. Existing Studies 12
Narita (2020) sequentially estimate the conditional variance. 1. At each period , using samples Ω!*# = { +, +, + }+,# !*#, we construct an estimator of conditional variance (! ∣ !) as D ( ∣ !). 2. Then, we assign a treatment as ! 1 !, Ω!*# = - .(#∣1!) - .(#∣1!)) - .("∣1!) . ! 1 !, Ω!*# determines the probability !(1 ∣ !). n Dependency problem: • Here, the samples { +, +, + }+,# !*# is not i.i.d. • How to estimate the ATE from the dependent samples? Dependency Problem 13
Kato et al. (2020) construct an estimator using martingale property. n Define ℎ! = # 3!,# 4!* 5 6!"# #,1! 8!(#∣1!,9!"#) − # 3!,# 4!* 5 6!"# #,1! 8!("∣1!,9!"#) + H !*# 1, ! − H !*# 0, ! , where H !*# , ! is an estimator of [! ∣ !] only using Ω!*# . n Then, we define an ATE estimator as H : = # : ∑!,# : ℎ! . n Owing to the martingale property, we can derive the asymptotic normality even if the samples have dependency. Construction of Estimators under Dependency 14
(i) pointwise convergence in probability of H !*# and D !*# ; that is, for all ∈ and ∈ , H !*# , → ; [! ∣ ] and D !*#( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function. (ii) boundedness of the random variables. Then, H : − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . • As we explain later, the step-wise estimator . !$% also solves non-Donsker problem. Asymptotic Normality 15
the asymptotic variance $ = ! 1 ! ∗ 1 ! + ! 0 ! 1 − ∗ 1 ! + ! 1 ! − ! 0 ! − " # . • Hadad et al. (2019) and Kato et al. (2020) independently derived these results, but van der Laan (2008) has already derived similar result. • Hadad et al. (2019) proposed adaptive reweighting to stabilize the estimator. • Kato et al. (2020) derived non-asymptotic deviation bound using a concentration inequality based on law of iterated logarithm. • Kato (2020a) find that a doubly robust type estimator work as Hadad’s adaptive weight. Efficient Asymptotic Variance 16
for Efficient Treatment Effect Estimation. • We also proposed a method for an adaptive experimental design. • As well as van der Laan’s approach, sequentially estimate the conditional variance and assign a treatment. van der Laan (2008) and Kato et al. (2020) 17
• Summarize the ATE estimation from dependent samples. • Derive asymptotic distribution (asymptotic result) * partially overlapped with the result of van der Laan (2008). • Derive a regret bound (non-asymptotic result) • Show a non-asymptotic result deviation bound. • Based on the concentration inequality, we develop a method for efficient sequential testing. Our Contributions 18
In sequential testing, we conduct hypothesis testing at every period. • If we reject the null hypothesis, we stop the experiment. We can expect that sequential testing can reduce the sample size when the null hypothesis can be rejected. n Conventional method: • Bonferroni (BF) method: known to be too conservative. There are other methods for sequential testing, such as BH method. n These methods are based on multiple testing. → Recently, concentration inequality based methods are proposed. Sequential Testing 19
logarithm (LIL)-based concentration inequality for sequential testing. n LIL-based concentration inequality has gathered interest in bandit community recently (Jamieson et al. (2014)). n We modify the result of Balsubramani and Ramdas (2016) for efficient ATE estimation. • Compared with other types of sequential testing such as Bonferroni method, we can more appropriately control the error of test in concentration inequality based sequential testing. Sequential Testing and Concentration Inequality 20
estimator H ! , which holds for all with probability ≥ 1 − : H ! − " ≤ " + 2 U ∗ ! log log U ! ∗ + log 4 , where is a constant, " ≈ log = > , U ∗ ! ≈ ∑+,# ! + $, and + = ℎ+ − " . Based on this confidence bound, we can conduct sequential hypothesis testing because the above bound holds for all . Non-asymptotic Bound of an ATE Estimator 21
Control Type I error at . 1. Set ! = 1.1 log # ? + 2 ∑+,# ! + log @AB ∑$%# ! D$ ? . 2. If H ! > ! , reject the null hypothesis. • The coefficient 1.1 comes from the report of Balsubramani and Ramdas (2016) . Procedure of Sequential Testing 22
confidence interval. n Recently, there are two directions are proposed for sequential testing. • Adaptive confidence interval (Zhao et al. (2015)) • Always valid -value (Johari, Pekelis, and Walsh (2015)) Both generate the same rejection region in hypothesis testing. Adaptive Confidence Interval and Always Valid -value 23
sample size = the sample size is not random variable. n Sequential hypothesis testing • We do not have to determine the sample size. • The sample size also means stopping time, which is a random variable. The sample size decision problem in standard hypothesis testing becomes optimal stopping problem in sequential hypothesis testing. Ø Advantage: when we can reject the null hypothesis, we can reject it as soon as possible. Ø Disadvantage: difficult to control the sample size when the null hypothesis cannot be rejected. Sample Size and Optimal Stopping Time 24
can be rejected. • The right is the case where the null hypothesis cannot be rejected. • In = 150 and 300, we conduct the standard testing at the period. • In “LIL” and “BF”, we conduct sequential testing by LIL-based and BF-based methods. • “Testing” shows the % of the hypothesis is rejected. • In “LIL” and “BF”, we show the stopping time over 1000 trilas. • The total sample size is 500. • When the experiment does not stop, we report 500. Experimental Results 25
to satisfy some conditions, such as Donsker’s condition. n Recently, Chernozhukov et al. (2016) proposed cross-fitting for semiparametric inference with non-Donsker type nuisance estimators. n Similar ideas have been proposed by Klaassen (1987), Zheng and van der Laan (2011). n van der Laan and Lendle (2014) further established a method for semiparametric inference with non-Donsker type nuisance estimators for not-i.i.d. case. • The idea is also using martingale properties. Donsker’s Condition 26
not-i.i.d. samples is investigated by van der Laan and Lendle (2014). • We call the martingale-based method adaptive-fitting. • In adaptive-fitting, we use a property that the step—wisely constructed nuisance estimators can be regarded as constants in the expectation conditioned on past information. n The ATE estimator can be generalized into off-policy evaluation (OPE) estimators. • In OPE, we want to estimate the policy value from samples generated by bandit algorithms. Adaptive-Fitting 27
= 1 ! = 1 ! − 1, ! (1 ∣ !) − 1 ! = 1 ! − 1, ! (0 ∣ !) + 1, ! − 0, ! . n Cross-fitting: • Suppose that samples are i.i.d. • Separate a dataset into two subsets, # and $ . • Then, for nuisance estimators H #(, ) and D #(, ) constructed from # , we define an estimator f $ = ∑!∈& ℎ(!, !, !; H #, D #). • In [ℎ(!, !, !; H #, D #) ∣ #], we can deal H # and D # as if they were constants. Example: Doubly Robust Estimator 28
are not i.i.d. • Then, for nuisance estimators H +*#(, ) and D +*#(, ) constructed from Ω+*# , we define an estimator ̈ ! = ∑+,# ! ℎ(+, +, +; H +*#, D +*#). • In [ℎ(+, +, +; H +*#, D +*#) ∣ Ω+*#], we can deal H +*# and D +*# as if they were constants. More detailed discussion are shown in Kato (2020a) https://arxiv.org/pdf/2010.03792.pdf. Example: Doubly Robust Estimator 29
graph shows cross-fitting; the right graph shows adaptive-fitting. • The y-axis represents samples used for constructing an estimator with nuisance estimators constructed by samples represented by the x-axis. Illustration of Cross-fitting and Adaptive-fitting 30
n Summarize various OPE estimators from dependent samples. • In OPE, we want to estimate ∑'∈ H ! ! ! , where H ! is known as an evaluation policy and is an action space. • We have a dataset !, !, ! !,# : ∼ ! , Ω!*# , , where ! , Ω!*# is known as a behavior policy. • The samples are not i.i.d. • For the following estimator, we assume that the behavior policy ! , Ω!*# converges to a time invariant policy ( ∣ ) in probability. Various OPE Estimators 31
!( ∣ !, Ω!*#) . • An IPW estimator is defined as H : IJK = ∑!,# : ∑'∈ H ! U ℎ! IJK() . • Let us note that we use the true behavior policy !( ∣ !, Ω!*#). n It is difficult to replace the true behavior policy with its estimator unlike an i.i.d. case, such as Hirano, Imbens, and Ridder (2003). n In addition, the asymptotic variance is larger than the efficiency bound. We call this estimator an Adaptive IPW (AdaIPW) estimator. Inverse Probability Weighting (IPW) Estimator 32
− H !*# , ! !( ∣ !, Ω!*#) + H !*# , ! . • An AIPW estimator is H : 3IJK = ∑!,# : ∑'∈ H ! U ℎ! 3IJK() . n Unlike a doubly robust estimator, we use the true behavior policy in the IPW component. n Note that the estimator is unbiased. We call this estimator an Adaptive AIPW (A2IPW) estimator. Augmented IPW (AIPW) Estimator 33
policies and are unbiased. n Hadad et al. (2019) pointed out that using the true behavior policy makes the estimator unstable because ! takes an extreme value at a period. • Hadad et al. (2019) proposed using adaptive weight for stabilization. n An adaptive-weighted A2IPW estimator is defined as H : LK3IJK = ∑!,# : !, Ω!*# H ! U ℎ! 3IJK ∑!,# : !, Ω!*# , where !, Ω!*# is called an adaptive weight. Adaptive Reweighting AIPW Estimator 34
− H !*# , ! D !*# ! + H !*# , ! , where D !*# is an estimator of ! constructed from Ω!*# • An DR estimator is H : ML = ∑!,# : ∑'∈ H ! U ℎ! ML() . We call this estimator an Adaptive DR (ADR) estimator. Doubly Robust (DR) Estimator 35
convergence in probability of !*# ; that is, for all ∈ and ∈ , !( ∣ , Ω!*#) → ; ( ∣ ), where is a time invariant function; (ii) H !*# − $ D !*# − ! $ = ; *#/$ ; (iii) boundedness of the random variables. Then, j : ML − " → < (0, $), where $ = ! 1 ! 1 ! + ! 0 ! 1 − 1 ! + ! 1 ! − ! 0 ! − " # . Asymptotic Normality of an ADR Estimator 36
and DR; that is, estimators without cross-fitting and adaptive-fitting. • Denote AIPW and DR estimators with adaptive-fitting as A2IPW and ADR. • Denote AIPW and DR estimators with cross-fitting as AIPWCF and DRCF. • Denote IPW estimators with the true behavior policy and estimated behavior policy as IPW and EIPW, respectively. • Denote a direct method estimator ∑!,# : ∑'∈ H ! H :( ∣ !) as DM. Theoretical Comparison 37
asymptotic normality. n IPW does not achieve the efficiency bound. ⇄ A2IPW and ADR achieve the efficiency bound. n A2IPW does not require a specific convergence rate for nuisance estimators (van der Laan (2008), Hadad et al. (2019), Kato et al. (2020)). n ADR requires a specific convergence rate for nuisance estimators. See van der Laan and Lendle (2014) and Kato (2020a). • The cross-fitting of Chernozhukov (2016) also assumesk k H !*# − $ D !*# − ! $ = ; *#/$ . - If D !*# = ! , this condition holds if H !*# − $ ＝; 1 . n Unlike an i.i.d. case, EIPW does not have asymptotic normality. Theoretical Comparison 38
A2IPW because the true behavior policy causes instability. n However, the ADR may absorb such an instability because it replaces the true behavior policy with its estimator. Ex. Suppose that # 1 ∣ = 0.5, $ 1 ∣ = 0.1, O 1 ∣ = 0.95 for ∈ . Although the behavior policy is unstable, the estimator D " , D # , D $ may not so unstable if we estimate them from # and $ . n This result is similar to findings on an IPW estimator with estimated propensity score. Empirical Comparison 39
to a time invariant policy ( ∣ ) in probability? → To use the central limit theorem of martingales, which requires the convergence of the variance (See Hall and Heyde, 1980). n Are there any strategy without assuming the convergence of the policy? • Standardization of martingales (Luedtke and van der Laan (2016), Kato (2020b). • Batch-based approach (Kato and Kaneko (2020)). More detailed discussion is shown in Luedtke and van der Laan (2016). Other Strategy for Asymptotic Normality 41
al. (2020) to best arm identification. n Kaufman, Cappé, and Garivier (2016) showed that • Given two Gaussian arms with different means and variances. • Let us denote the variances as # $ and $ $. • The optimal strategy for finding the best arm is pulling the arm with ratio P# P#)P& : P& P#)P& . The optimality is defined for a non-asymptotic bandit setting. n We want to extend the proposed method for (asymptotic) experimental design to (non-asymptotic) best arm identification. • How to bridge the asymptotic setting and non-asymptotic setting? On-going: Semiparametric Best Arm Identification 42
some evaluation policy H from the historical data obtained by a logging policy Q. : State (Age, Height etc for people in the US ) : Action (Therapy) : Reward (Recover rate in COViD19) n In a standard setting, the target of estimation is 1,3,4 ∼; S 8' ;(T∣',S)[] OPE 44 ∼ () Y '
: State Ex. Age, BMI etc for people in the US n The target of estimation is 1,3,4 ∼W S 8' ;(T∣',S)[] There is a distribution shift between historical and evaluation data. Such a situation is called covariate shift (Shimodaira (2000)). OPE Under a Covariate Shift 45 ∼ () n Evaluation data U U,# V')*: n U : State Ex. Age, BMI etc for people in Japan ∼ () ' 3
with 8' 8+ . n We have to correct the difference of covariate densities with ; S W S . • This value is unknown. n A naïve importance sampling estimator is 1 X+! p U,# V($! p '∈ H U Q U U U 1 U = U . This estimator is consistent. But can we do better? The above estimator is mentioned by Hirano, Imbens, and Ridder (2003). Naïve Estimator 46
Evaluation and Learning for External Validity under a Covariate Shift n Contributions Q. What is the lower bound in estimating the policy value? A. We derive the lower bound following semiparametric theory. Q. Achieve this lower bound? A. The doubly robust estimator can achieve it under mild nonparametric rate conditions. Q. Policy Learning? A: Use a doubly robust score. Efficient OPE under Covariate Shift 47
Sugiyama, Suzuki, and Kanamori. (2012). n However, when using the density ratio as importance weighting for covariate shift, the approximated sample average does not have asymptotic normality ( -consistency) Ex. For H = j 1,∼;(S) U r W 1, ; 1, , where U ∈ ℝ and j 1,∼;(S) is a sample approximation using U ∼ (), it is difficult to show that H − Y,∼W S U → 0, $ , when r W 1, ; 1, does not satisfy Donsker’s condition. Naïve Covariate Shift 48
• Separate U, U, U U,# V($! into # and $ with sample sizes # X+! and $ X+!. • Separate U U,# V')* into # and $ with sample sizes # H.Z and $ H.Z. n Doubly robust estimator under a covariate shift: U H = 1 % 45! 3 67% 8( )*+ 3 9∈ ̂ 8, )*+,8, -./ ( ∣ 6 )1 6 = 6 − . 8, )*+ , + 1 % 3=> 3 67% 8( -./ 3 9∈ . 8, )*+ , , where ̂ V& ($!,V& ')* ( ∣ U) and H V& ($! , U are estimators of W 1, ;(1,) 8' U 8+('∣1,) and [() ∣ U] constructed from $ and $ . Proposed Estimator 49
• The efficiency bound calculated based on the results of van der Vaart (1998) and Wooldridge (2001). (Here skip the detail) n Has the double robustness • Either ̂ V& ($!,V& ')* ( ∣ U) or H V& ($! , is consistent, the proposed estimator is also consistent. n The doubly robust estimator is useful for policy learning. Theoretical Properties 50
have a dataset !, f ! !,# : . n ! is a feature, and f ! ∈ {−1, +1} is a binary label. n The label f ! may not be true and can be flipped; → several time after the observation, the first observed label can be flipped into the true binary label ! . Problem Setting 52
item (! = +1) using user’s feature ! by showing an advertisement. n First, we cannot obtain a response; therefore, we regard the label is negative, f ! = −1. → If the user buy an item of the advertisement several time after the impression, the label turns into positive, f ! = ! = +1. Example: Advertisement Optimization 53
that old samples have true labels. Parametric approach: survival analysis models (Chapelle (KDD 2014)). The performance is very poor. Nonparametric (?) approach: only using old samples. → By additionally assuming stationarity of time-series, importance weighting (Yasui et al. (WWW 2019)). unbiased risk approach (Kato and Yasui (2020)) The convergence rate only depends on the sample size of old samples. 2. Maximize the worst-case performance under an adversarial attack. Cesa-Bianchi, Gentile, and Mansour (2019) Two Approaches 54
learning with noisy labels and learning from positive and unlabeled (PU) data. n As a special case of learning with noisy labels, du Plessis and following work proposed methods for PU learning. n Therefore, the methods for delayed feedback are similar to those for learning with noisy labels and PU learning. Ex. Yasui et al. (2019) is similar to Liu and Tau (2015). Kato and Yasui (2020) is similar to du Plessis et al. (2015). Machine Learning Perspective 55
ence function of locally asymptotically linear esti- mators. Annals of statistics, 1987. • van der Laan, M. J. The construction and analysis of adaptive group sequential designs. 2008. • van der Laan, M. J. and Lendle, S. D. Online targeted learning. 2014. • Luedtke, A. R. and van der Laan, M. J. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 2016. • Zheng, W. and van der Laan, M. J. Cross-validated targeted minimum-loss-based estimation. In Tar- geted Learning: Causal Inference for Observational and Experimental Data, Springer Series in Statistics. 2011. • Johari, R., Pekelis, L., and Walsh, D. J. Always valid inference: Bringing sequential analysis to a/b test- ing. arXiv preprint arXiv:1512.04922, 2015. • Balsubramani, A. and Ramdas, A. Sequential non- parametric testing with the law of the iterated log- arithm. In UAI, 2016. Reference of Adaptive Experimental Design 57
and Athey, S. Confidence intervals for policy eval- uation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. • Hahn, J., Hirano, K., and Karlan, D. Adaptive exper- imental design using the propensity score. Journal of Business and Economic Statistics, 29(1):96–108, 2011. • Zhao, S., Zhou, E., Sabharwal, A., and Ermon, S. Adaptive concentration inequalities for sequential decision problems. In NeurIPS, pp. 1343–1351. Cur- ran Associates, Inc., 2016. • Chernozhukov, V., Chetverikov, D., Demirer, M., Du- flo, E., Hansen, C., Newey, W., and Robins, J. Dou- ble/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21: C1–C68, 2018. • Kaufmann, E., Cappé, O., and Garivier, A. On the com- plexity of best-arm identification in multi-armed bandit models. JMLR, 2016. Reference of Adaptive Experimental Design 58
KDD, 2014. • Yasui, S., Morishita, G., Komei, F., and Shibata, M. A feedback shift correction in predicting conversion rates under delayed feedback. In WWW, 2020. • du Plessis, M., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, 2015. • Cesa-Bianchi, N., Gentile, C., and Mansour, Y. Delay and cooperation in nonstochastic bandits. JMLR, 2019. Reference of Delayed Feedback 59
weighting the log- likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000. • Sugiyama, M., Suzuki, T., and Kanamori, T. Density Ratio Estimation in Machine Learning. 2012. • Hirano, K., Imbens, G., and Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71:1161–1189, 2003a. • Wooldridge, J. M. Asymptotic properties of weighted m -estimators for standard stratified samples. Econometric Theory, 2001. • van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998. Reference of OPE under Covariate Shift 60