Home NPP Products LIDAR Frequently Asked Questions
Site Map Land / Ocean Merge Field / Lab Data Highlights / Results


Cloud Fill Algorithm

Brief description

When there is a pixel with missing data, we essentially search an expanding zone around that pixel, looking for "good" data. All "good" values in that zone around the pixel (such as, say, 30 to 40 km away) get averaged together for the replacement value. What gets entertaining is that there comes a point when it is better to search forwards and backward in time (+/- 8days, for example) than it is to search further and further away. So, the short answer is that it is a simple averaging method, but the complexity is that the "zone" that I look at isn't constrained by the current time-slice.

Gaps in the data

The Level-3 data that one gets from NASA's Ocean Color site have gaps in the coverage, often due to clouds. The number of gaps one has depends on the variable being examined. SST4, for example, is one of the best at being the least disturbed by water vapor, and it has very good coverage compared to, say, Chl. Further, when a gap is present on an 8-day file, that means there has been persistent inability to get a "good" pixel value in any of the daily files. Often we have analyses that are better performed if we can utilize some consistent approach to filling these gaps. What we describe in this white-paper is the algorithm used on the ancillary data here at the Ocean Productivity web site, before they are sent into the NPP algorithms. It is a minor detail, but one cannot gap-fill the NPP values directly, since that would be projecting "full sun" NPP values into an area with diminished light (because of the clouds) and the estimate would not be correct. The proper way is to gap-fill the ancillary data, and then calculate NPP.

General approach

There are different methods to interpolating data to fill missing coverage, ranging from simple linear interpolation to the use of bicubic splines, to other more elegant statistical methods. What we apply here is a simple method to find an estimate for the average value of the pixel. We search through expanding, predefined zones until good values are found, and then those values are averaged together to represent the pixel at the center of the search.

What makes this different from most spatial filling problems is that satellite data gives estimates of pixels both forwards and backwards in time as well (from the 8day files before and after the current file of interest, etc.). At some point, the first zone centered on the missing value is better estimated by looking at the values from 8-days before and after, rather than looking further and further away from the point. The way we determined when that happens is by looking at the correlation coefficients for the different zones and different time-slices, and then ordering those coefficients from best to worst (giving us our search order).

The correlation coefficient analysis was performed on chl data, looking for a global representation of how the search order should proceed. Once the search order was determined, then we apply the same approach to filling all data types.

Expanding spatial zones

The search distance around the pixel runs from zero to 400 km. The first zone is from 0 to 20; the second from 20 to 30; then 30 to 40; etc. Starting at 60 km the step size is 20 km (60 to 80, then 80 to 100, etc). Once we get to 300 km the step size is 50. There are 19 possible search zones total.

zone start (km) stop (km)
1 0 20
2 20 30
3 30 40
4 40 50
5 50 60
6 60 80
7 80 100
8 100 120
9 120 140
10 140 160
11 160 180
12 180 200
13 200 220
14 220 240
15 240 260
16 260 280
17 280 300
18 300 350
19 350 400

Expanding temporal slices

Along with the spatial coverage around the "gap", we also have the same spatial coverage at different sanpshots in time (ie, observations 8-days before and after the pixel of interest, etc). The 19 search zones which are applied to the current 8day composite hdf file are also applied to the hdf file before and after that file. Later, if we are still searching for "good data", we then look at two files before/after the file of interest, etc. For example, zones one through six (0 to 20km through 60 to 80 km) are searched first on the 8day file that the gap is in. However, the data in zones one, two, and three in the 8day file before and after are more highly correlated to the pixel of interest than zone seven on the original file, so zones one, two, and three in those files are searched before zone seven at time zero.

Search order

Here's the search order for the expanding spatial zones, and expanding temporal slices. You can see the search switches from time 0 (first column) to time +/- 8days after spatial zone 6, searching spatial zones 1, 2, and 3, and then goes back to time 0, zone seven, etc. This search pattern applied to 8day files is the key to the fill algorithm. The first data that are found as "good" (after a complete search of the files and zone defined by the searchkey) are then averaged and inserted into the gap.

zone time zero +/- 8days +/- 16days +/- 24days +/- 32days +/- 40days +/- 48days +/- 56days
1 1 7 15 33 45 56 78 99
2 2 8 16 34 46 58 80 100
3 3 9 18 35 47 59 81 101
4 4 11 19 36 48 60 82 103
5 5 12 20 38 49 61 83 104
6 6 13 22 39 50 62 84 105
7 10 17 24 40 51 66 85 107
8 14 23 27 44 57 69 89 113
9 21 26 30 54 70 75 96 117
10 25 29 37 63 74 86 109 124
11 28 31 41 68 79 90 112 134
12 32 42 52 73 92 97 118 141
13 43 53 64 87 106 111 130 146
14 55 65 71 94 114 115 138 161
15 67 72 77 108 120 121 142 165
16 76 88 93 116 137 136 149 170
17 91 98 110 131 145 144 162 179
18 95 102 119 140 159 150 171 187
19 122 133 139 148 167 163 184 193

Application to data

In practice, I do not use more than +/- seven 8-day files surrounding the hdf file of interest to help determine a missing value. I started this a few years back when there were persistent gaps that weren't being filled near the Arabian sea. In general, such an expansive search is not needed. In the simplest case of a big gap, which has data on the 8day file before and after, the pixel of interest simply becomes a value that is very similar to a linear interpolation of the values before and after.

General comments

This approach gap-fills data in a fashion that is based on global statistics. As long as the gaps are not too long-lasting, the fill results give estimates that appear visually "reasonable". However, this method does not take into account changing correlation distances, either spatially or temporally. A full blown kriging approach, applied with data both forwards and backwards in time, and solved for each missing pixel in every unique location (for every 8-day time slice), that would be the ideal case. At the moment, I can't do that, but if you come across an approach that generalizes to what has just been described, please do let me know. Until then, this is my best guess possible for the global solutions that we work with.



Last modified: 12 July 2012
by:  Robert O'Malley

_Home_____NPP Products_____LIDAR__FAQ sheet

Site Map Land / Ocean Merge Field / Lab Highlights / Results