Top-down attention plays an important role in guidance of human attentions in real-world scenarios, but less efforts in computational modeing of visual attention has been put on it. In this paper, inspired by the mechanisms of top-down attention in human visual perception, we propose a multi-layer linear model of top-down attention to modulate bottom-up saliency maps actively. The first layer is a linear regression model which combines the bottom-up saliency maps on various visual features and objects. A contextual dependent upper layer is introduced to tune the parameters of the lower layer model adaptively. Finally, a mask of selection history is applied to the fused attention map to bias the attention selection towards the task related regions. Efficient learning algorithm is derived. We evaluate our model on a set of natural egocentric videos captured from a wearable glass in real-world environments when performing different tasks, which is much more realistic to study human attentions in a natural view to the dynamic world. Our model outperforms the baseline and state-of-the-art bottom-up saliency models.