/* * =============================================================================== * Copyright (c) 2004-2006 ATI Technologies Inc. * =============================================================================== * * dxtc_v11_compress.c : A high-performance, reasonable quality DXTC compressor * * PREFACE: * * All DXTC compressors have a big issue in trading off performance versus image quality. A * very high quality image is typically obtained by very many stepwise refinements, * looking for the output that contains the lowest error metric. These type of compressors * suffer from two major performance issues: first, the need to process each block very many * times - in some implementations, the entire block computation is repeated on each iteration - * is inherently slow; and secondly, that the performance is not necessarily predictable (in * worst-case images, both poor quality and slow results will be obtained. * * * * TOWARDS A BETTER SOLUTION: * * This compressor is an attempt to improve performance by generating a block based on a (mostly) * mathematical method (and a pretty simple one at that). The worst-case performance bound of this * algorithm is not much more than a factor of 2 worse than the best case (given that it does not hit * inherent slowdown conditions such as a lot of denormal operands to the FPU, which is not a major * issue). * * Stepwise refinement is a part of this method, used in a controlled manner to solve a * particularly difficult analytical problem (that of the correct mapping of the endpoints to * best represent the line). By reducing the refinement to a 1D process this makes it more * amenable to rapid implementation (at the cost of a more limited ability to match the source). * * * * COMPRESSION ALGORITHM * * The DXTC compression problem - for compressors avoiding an exhaustive search - is largely * divided into two main issues * - Find the best axis for representing the pixels in the block (the compressed block's * pixels must obviously all lie along this axis) * - Map the points in the block onto this axis in the most efficient manner. * * Different compressors choose to attack the problem in different ways. In this compressor the * two problems are tackled largely totally separately. * * * Finding the axis is performed by a simplistic line-fitting method (using the average and the * absolute centred average to find two points and calling this the axis). This provides good * results in many circumstances, although it can end up producing poor results where the axis * ends up not representative of any of the individual pixels in the block. * * The pixels are then mapped onto the axis by a nearest-point-of-approach method. The critical * value stored is the distance along the axis. * * In order to correct for cases when the axis has been poorly mapped, the points are clustered * (as they would be for the final image) and the average axis mapping error for each cluster * is compared. If the error looks bad (typically, in these cases it is large and highly asymmetrical * between the two clusters), then the axis is recomputed using the average axis between * the two end clusters. This process could be further repeated but this appears reasonably * unnecessary - frankly, in the cases where refinement helps it seems unlikely that any very rapid * compressor can provide sufficient quality for the image to be definitively usable. In fact, * the 'guess' axis has proved sufficiently good in most cases that the limited form I implement * this refinement in hurts quality as often as not, and the refinement costs 10-20% in performance * so is not worthwhile. * * * Several methods were experimented with for determining the axis endpoints. The best general * initial guess was found to be to set the extreme points on the axis to be the resultant block * colours and represent those precisely. Although this is not the general 'best solution' it is * a very important best solution from a human eye point of view. * * There is then a stepwise refinement performed that significantly reduces noise by moving these * inital endpoints (inwards) and finding the value of minimum error. The compressor can experiment * with other modes of this movement, but this doesn't in general provide much higher image quality * and does significantly increase compression time. * * * Finally, the RGB565 colour values are computed and the final output is generated. Generating the * RGB involves a rounding process that must match up to the inverse unrounding process performed * in the decompressor. This also forces the block to be opaque (in the current implementation only * 4-colour blocks are generated). * * * * OUTPUT BLOCK QUALITY DETERMINATION * * In order to compute if the compression was sucessful, three error metrics are computed: * * 1. the axis mapping error produced when mapping from the arbitrary points to the selected * axis. This is representative of how poor the colour distortion inherent from the compression is. * Since the axis generation bug was fixed, this is actually usually small. Seeing a large axis * mapping error almost invariably means that the image is extremely hard to compress. * 2. the cluster mapping error, which is the error produced when mapping the points on the axis * into the final clusters. This frequently overreacts on highly noisy images. * 3. antialiased images fail to trip cluster error, but frequently come out 'lumpy'. This is * detected by analysing the clusters on a geometric basis; images that have many blocks with * clusters appearing in geometrically ordered patterns tend to be antialiased images. * * * * CONCLUSIONS * * This compressor rapidly produces results that vary from very good on most images highly amenable * to compression, to adequate/poor on images not suited to compression. It is never as good as an * extremely high quality offline decoder, and should not be considered a substitute for such (although * it may be superior to the average offline decoder). * */ // Further work: // Add transparent block support // Add special cases for 1, 2 and some 3 colour blocks (mostly for speed reasons) // Possibly test if 3 colour blocks would generate better results in some cases (done, not // apparent in most test images) #include #include #include #include "dxtc_v11_compress.h" static void DXTCDecompressBlock(DWORD block_32[16], DWORD block_dxtc[2]) { DWORD c0, c1, c2, c3; DWORD r0, g0, b0, r1, g1, b1; int i; c0 = block_dxtc[0] & 0xffff; c1 = block_dxtc[0]>>16; if (c0 > c1) { r0 = ((block_dxtc[0]&0xf800) >> 8); g0 = ((block_dxtc[0]&0x07e0) >> 3); b0 = ((block_dxtc[0]&0x001f) << 3); r1 = ((block_dxtc[0]&0xf8000000) >> 24); g1 = ((block_dxtc[0]&0x07e00000) >> 19); b1 = ((block_dxtc[0]&0x001f0000) >> 13); // Apply the lower bit replication to give full dynamic range r0 += (r0>>5); r1 += (r1>>5); g0 += (g0>>6); g1 += (g1>>6); b0 += (b0>>5); b1 += (b1>>5); c0 = (r0<<16) | (g0<<8) | b0; c1 = (r1<<16) | (g1<<8) | b1; c2 = (((2*r0+r1)/3)<<16) | (((2*g0+g1)/3)<<8) | (((2*b0+b1)/3)); c3 = (((2*r1+r0)/3)<<16) | (((2*g1+g0)/3)<<8) | (((2*b1+b0)/3)); for(i=0; i<16; i++) { switch((block_dxtc[1]>>(2*i)) & 3) { case 0: block_32[i] = c0; break; case 1: block_32[i] = c1; break; case 2: block_32[i] = c2; break; case 3: block_32[i] = c3; break; } } } else { // Dont support transparent decode, but have to handle the case when they're both the same { r0 = ((block_dxtc[0]&0xf800) >> 8); g0 = ((block_dxtc[0]&0x07e0) >> 3); b0 = ((block_dxtc[0]&0x001f) << 3); r1 = ((block_dxtc[0]&0xf8000000) >> 24); g1 = ((block_dxtc[0]&0x07e00000) >> 19); b1 = ((block_dxtc[0]&0x001f0000) >> 13); // Apply the lower bit replication to give full dynamic range r0 += (r0>>5); r1 += (r1>>5); g0 += (g0>>6); g1 += (g1>>6); b0 += (b0>>5); b1 += (b1>>5); c0 = (r0<<16) | (g0<<8) | b0; c1 = (r1<<16) | (g1<<8) | b1; c2 = (((r0+r1)/2)<<16) | (((g0+g1)/2)<<8) | (((b0+b1)/2)); for(i=0; i<16; i++) { switch((block_dxtc[1]>>(2*i)) & 3) { case 0: block_32[i] = c0; break; case 1: block_32[i] = c1; break; case 2: block_32[i] = c2; break; case 3: block_32[i] = 0xff00ff; break; } } } } } static void DXTCDecompressAlphaBlock(BYTE block_8[16], DWORD block_dxtc[2]) { BYTE v[8]; DWORD t; int i; v[0] = (BYTE) (block_dxtc[0] & 0xff); v[1] = (BYTE) ((block_dxtc[0]>>8) & 0xff); if (v[0] > v[1]) { // 8-colour block for(i=1; i<7; i++) { t = ((7-i)*v[0] + i*v[1] + 3) / 7; v[i+1] = (BYTE) t; } for(i=0; i<16; i++) { if (i > 5) t = (block_dxtc[1]>>((3*(i-6))+2) & 7); else if (i == 5) t = ((block_dxtc[1] & 3)<<1) + ((block_dxtc[0] >> 31)&1); else t = (block_dxtc[0]>>((3*i)+16) & 7); block_8[i] = v[t]; } } else if (v[0] == v[1]) { for(i=0; i<16; i++) block_8[i] = v[0]; } else { assert(0); } } //#define TRY_3_COLOR #define MARK_BLOCK { /*v_r = v_b = v_g = 0.0f;*/ average_r = 255.0f; /*average_g = 128.0f; average_b = 0.0f;*/ } //#define AVERAGE_UNIQUE_PIXELS_ONLY // Even with this the performance increase is minor //#define AXIS_RGB 1 //#define AXIS_YCbCr 1 // Pretty poor in most cases, probably shouldn't be used //#define AXIS_Y_ONLY 1 // Use for testing; generates greyscale output #define AXIS_MUNGE 1 // Raises priority of G at expense of B - seems slightly better than no munging (needs exhaustive testing) //#define AXIS_MUNGE2 1 // Raises priority of G further at expense of R and B - maybe slightly better again... #if AXIS_RGB #define CS_RED(r, g, b) (r) #define CS_GREEN(r, g, b) (g) #define CS_BLUE(r, g, b) (b) #define DCS_RED(r, g, b) (r) #define DCS_GREEN(r, g, b) (g) #define DCS_BLUE(r, g, b) (b) #elif AXIS_YCbCr /* * Y = 0.29900 * R + 0.58700 * G + 0.11400 * B * Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B + CENTER * Cr = 0.50000 * R - 0.41869 * G - 0.08131 * B + CENTER * * R = Y + 1.40200 * Cr * G = Y - 0.34414 * Cb - 0.71414 * Cr * B = Y + 1.77200 * Cb */ #define CS_GREEN(r, g, b) (0.299f*(r) + 0.587f*(g) + 0.114f*(b)) #define CS_RED(r, g, b) (0.5f*(r) - 0.41869f*(g) - 0.08131f*(b) + 128.0f) #define CS_BLUE(r, g, b) (-0.16874f*(r) - 0.33126f*(g) + 0.5f*(b) + 128.0f) #define DCS_RED(r, g, b) ((g) + 1.402f*((r)-128.0f)) #define DCS_GREEN(r, g, b) ((g) - 0.34414f*((b) - 128.0f) - 0.71414f*((r)-128.0f)) #define DCS_BLUE(r, g, b) ((g) + 1.772f*((b) - 128.0f)) #elif AXIS_Y_ONLY #define CS_GREEN(r, g, b) (0.299f*(r) + 0.587f*(g) + 0.114f*(b)) #define CS_RED(r, g, b) (128.0f) #define CS_BLUE(r, g, b) (128.0f) #define DCS_RED(r, g, b) ((g) + 1.402f*((r)-128.0f)) #define DCS_GREEN(r, g, b) ((g) - 0.34414f*((b) - 128.0f) - 0.71414f*((r)-128.0f)) #define DCS_BLUE(r, g, b) ((g) + 1.772f*((b) - 128.0f)) #elif AXIS_MUNGE #define CS_RED(r, g, b) (r) #define CS_GREEN(r, g, b) (g) #define CS_BLUE(r, g, b) ((b+g)*0.5f) #define DCS_RED(r, g, b) (r) #define DCS_GREEN(r, g, b) (g) #define DCS_BLUE(r, g, b) ((2.0f*b)-g) #elif AXIS_MUNGE2 #define CS_RED(r, g, b) ((r+g)*0.5f) #define CS_GREEN(r, g, b) (g) #define CS_BLUE(r, g, b) ((b+3.0f*g)*0.25f) #define DCS_RED(r, g, b) ((2.0f*r)-g) #define DCS_GREEN(r, g, b) (g) #define DCS_BLUE(r, g, b) ((4.0f*b)-(3.0f*g)) #else #error No axis type defined #endif #define ROUND_AND_CLAMP(v, shift) \ {\ if (v < 0) v = 0;\ else if (v > 255) v = 255;\ else v += (0x80>>shift) - (v>>shift);\ } void DXTCV11CompressExplicitAlphaBlock(BYTE block_8[16], DWORD block_dxtc[2]) { int i; block_dxtc[0] = block_dxtc[1] = 0; for (i = 0; i<16; i++) { int v = block_8[i]; v = (v + 7 - (v >> 4)); v >>= 4; if (v<0) v = 0; if (v>0xf) v = 0xf; if (i<8) block_dxtc[0] |= v << (4 * i); else block_dxtc[1] |= v << (4 * (i - 8)); } } /************** #if !defined(_WIN64) && defined(_WIN32) // This compressor can only create opaque, 4-colour blocks void DXTCV11CompressBlock(DWORD block_32[16], DWORD block_dxtc[2]) { int i, k; float r, g, b; float uniques[16][3]; // The list of unique colours int unique_pixels; // The number of unique pixels float unique_recip; // Reciprocal of the above for fast multiplication int index_map[16]; // The map of source pixels to unique indices int pixel_count[16]; // The number of occurrences of each unique float average_r, average_g, average_b; // The centrepoint of the axis float v_r, v_g, v_b; // The axis float pos_on_axis[16]; // The distance each unique falls along the compression axis float dist_from_axis[16]; // The distance each unique falls from the compression axis float left=0, right=0, centre=0; // The extremities and centre (average of left/right) of uniques along the compression axis float axis_mapping_error=0; // The total computed error in mapping pixels to the axis int retry; // Retry count for axis mapping int swap; // Indicator if the RGB values need swapping to generate an opaque result int blocktype=0; // ------------------------------------------------------------------------------------- // Find the array of unique pixel values and sum them to find their average position // ------------------------------------------------------------------------------------- { // Find the array of unique pixel values and sum them to find their average position float *up = (float *)uniques; int current_pixel, firstdiff; current_pixel = unique_pixels = 0; average_r = average_g = average_b = 0; firstdiff = -1; for(i=0; i<16; i++) { for(k=0; k> 16) & 0xff); g = (float)((block_32[i] >> 8) & 0xff); b = (float)((block_32[i] >> 0) & 0xff); tr = CS_RED(r, g, b); tg = CS_GREEN(r, g, b); tb = CS_BLUE(r, g, b); *up++ = tr; *up++ = tg; *up++ = tb; if (k == i) { unique_pixels++; if ((i != 0) && (firstdiff < 0)) firstdiff = i; } average_r += tr; average_g += tg; average_b += tb; } #ifdef AVERAGE_UNIQUE_PIXELS_ONLY else { index_map[i] = index_map[k]; pixel_count[k]++; } #endif } #ifndef AVERAGE_UNIQUE_PIXELS_ONLY unique_pixels = 16; #endif // Compute average of the uniques unique_recip = 1.0f / (float) unique_pixels; average_r *= unique_recip; average_g *= unique_recip; average_b *= unique_recip; } // ------------------------------------------------------------------------------------- // For each component, reflect points about the average so all lie on the same side // of the average, and compute the new average - this gives a second point that defines the axis // To compute the sign of the axis sum the positive differences of G for each of R and B (the // G axis is always positive in this implementation // ------------------------------------------------------------------------------------- // An interesting situation occurs if the G axis contains no information, in which case the RB // axis is also compared. I am not entirely sure if this is the correct implementation - should // the priority axis be determined by magnitude? { float rg_pos, bg_pos, rb_pos; v_r = v_g = v_b = 0; rg_pos = bg_pos = rb_pos = 0; for(i=0; i 0) { rg_pos += g; rb_pos += b; } if (b > 0) bg_pos += g; } v_r *= unique_recip; v_g *= unique_recip; v_b *= unique_recip; if (rg_pos < 0) v_r = -v_r; if (bg_pos < 0) v_b = -v_b; if ((rg_pos == bg_pos) && (rg_pos == 0)) if (rb_pos < 0) v_b = -v_b; } #if 1 // Experimental code for infinite partition compression { // Decide which partition to use // We perform the partition based on a plane in the 3D colourspace. The plane is defined by // a point and a normal. // Finding this plane is, unsurprisingly for me, based off a hand-waving argument that seems to // solve all the simple cases and is usually bafflingly found to do OK on the difficult ones. // We shall say that the average of all the points is the point. // The normal is the old compression axis // This gives the direction of the normal vector but not the signs of each component. There // are 8 possibilities; for each we calculate the sum of the distances from the plane of all // considered points. There are two cases of interest, that for which the sum of distances is // maximum, and that for which the sum of distances is minimum. // These two possibilities define two different situations. If there are two groups of points, // widely spaced, that each compress reasonably we should pick the axis with the maximum // sum of distances. If all the points together are close to a single line, splitting by // this method will just generate two nearly identical lines. Instead, we should pick the axis // with the minimum sum of distances, which will create two close but not identical lines, // probably parallel in the colourspace. // Which of these cases is in use can be determined by the ratio of minimum to maximum. If the // minimum and maximum are 'close' then we are in a case where the maximum plane is the best used. If // the minimum is only a small fraction of the maximum then the minimum plane is best used. Intermediate // cases imply that neither is a particularly good representation of the block (this indicates // a possible compression metric for the new system). // It is not known what ratio will work for this, it will probably be determined by trial-and-error. // It is possible the ratio of choice will vary depending on the number of pixels in the // groups (for example, one highly different pixel compared to a reasonably uniform line should // likely be represented by maximum plane). // The minimum plane can be ignored initially. It is never erroneous to only use the // maximum plane, but using the minimum plane gives the ability to improve quality on those // blocks on which DXTC is already very good. (It seems likely that Andy's exhaustive search // algorithm is finding these cases already, and as such attempts to compare this algorithm with // exhaustive search will likely not do too well until the minimum case is also handled.) // Compute point - no need, we already have it (average_X) // Compute plane normal - no need, we already have it (v_X) // Test each of the possible 8 compression planes float maxsum, minsum, sum, dist, ratio; DWORD maxvector=0, minvector=0, vector=0; assert(unique_pixels == 16); // vector is confusing to interpret if this isn't true... maxsum = 0; minsum = 1000000000; for(i=0; i<8; i++) { float n_r, n_g, n_b; n_r = (i&1) ? v_r : -v_r; n_g = (i&2) ? v_g : -v_g; n_b = (i&4) ? v_b : -v_b; sum = 0; vector = 0; for(k=0; k maxsum) { maxsum = sum; maxvector = vector; } if (sum < minsum) { minsum = sum; minvector = vector; } } // Decide whether to split based on the max or the min ratio = minsum / maxsum; if (ratio > 0.20) vector = maxvector; else vector = minvector; // A refinement might be to check if there are any points which have ended up just on one side // of the partition when they would be better on the other. I think this is unlikely to be a // major issue } #endif // Note:warning C4702: unreachable code // Axis projection and remapping for(retry = 0; retry<2; retry++) { float v2_recip; // Normalise the axis for simplicity of future calculation v2_recip = (v_r*v_r + v_g*v_g + v_b*v_b); if (v2_recip > 0) v2_recip = 1.0f / (float)sqrt(v2_recip); else v2_recip = 1.0f; v_r *= v2_recip; v_g *= v2_recip; v_b *= v2_recip; // the line joining (and extended on either side of) average and axis // defines the axis onto which the points will be projected // Project all the points onto the axis, calculate the distance along // the axis from the centre of the axis (average) // From Foley & Van Dam: Closest point of approach of a line (P + v) to a point (R) is // P + ((R-P).v) / (v.v))v // The distance along v is therefore (R-P).v / (v.v) // (v.v) is 1 if v is a unit vector. // // Calculate the extremities at the same time - these need to be reasonably accurately // represented in all cases // // In this first calculation, also find the error of mapping the points to the axis - this // is our major indicator of whether or not the block has compressed well - if the points // map well onto the axis then most of the noise introduced is high-frequency noise left = 10000.0f; right = -10000.0f; axis_mapping_error = 0; for(i=0; i right) right = pos_on_axis[i]; } //#define REFINE_AXIS // Latest enhancements / bug fixes seem to have rendered this redundant #ifdef REFINE_AXIS if (retry == 0) { // Check if our axis is actually a reasonable one, by clustering the two end // point groups (of the four that would be present in the final map). If there // is a wildly differing axis distance average between the two clusters then we // haven't got a good axis float division; float cluster_r[4], cluster_g[4], cluster_b[4]; float cluster_error[4]; int cluster_points[4]; int cluster; float left_recip, right_recip; float error; int new_left, new_right; centre = (left+right)/2; if (metrics->r300_mode) division = right*5.0f/8.0f; else division = right*2.0f/3.0f; for(cluster=0; cluster<4; cluster++) { cluster_error[cluster] = 0; cluster_points[cluster] = 0; cluster_r[cluster] = 0; cluster_g[cluster] = 0; cluster_b[cluster] = 0; } for(i=0; i average) are 0 and 1, while // interpolants are 2 and 3 if (fabs(b) < division) cluster = 2; else cluster = 0; // Positive is in the latter half of the block if (b >= centre) cluster++; // Sum up the clusters cluster_error[cluster] += dist_from_axis[i]; cluster_points[cluster]++; cluster_r[cluster] += uniques[i][0]; cluster_g[cluster] += uniques[i][1]; cluster_b[cluster] += uniques[i][2]; } if (!cluster_points[0] || !cluster_points[1]) { // Oh yeah, this is a BAD axis, we failed to map // any points into at least one of the end clusters // Select the furthest in distance of the two remaining // clusters if (cluster_points[0]) new_left = 0; else if (cluster_points[2]) new_left = 2; else if (cluster_points[3]) new_left = 3; else new_left = 1; if (cluster_points[1]) new_right = 1; else if (cluster_points[3]) new_right = 3; else if (cluster_points[2]) new_right = 2; else new_right = 0; } else { new_left = 0; new_right = 1; } // If we completely failed to find a new axis we can't use this // method - in which case we should find something else to do of // course... if (new_left != new_right) { // Check the error bound between these two clusters left_recip = (1.0f / (float)cluster_points[new_left]); right_recip = (1.0f / (float)cluster_points[new_right]); cluster_error[new_left] *= left_recip; cluster_error[new_right] *= right_recip; // Now need a metric for if these errors are 'reasonable' // Say the square of the difference? error = (float) (fabs(cluster_error[new_left]) - fabs(cluster_error[new_right])); error *= error; if (error > 50000.0f) // Trial-and-error value... { // It's not good - find a new axis cluster_r[new_right] *= right_recip; cluster_g[new_right] *= right_recip; cluster_b[new_right] *= right_recip; cluster_r[new_left] *= left_recip; cluster_g[new_left] *= left_recip; cluster_b[new_left] *= left_recip; average_r = cluster_r[new_left]; average_g = cluster_g[new_left]; average_b = cluster_b[new_left]; v_r = cluster_r[new_right] - cluster_r[new_left]; v_g = cluster_g[new_right] - cluster_g[new_left]; v_b = cluster_b[new_right] - cluster_b[new_left]; } else // It's not bad, so keep the first guess break; } } #else break; #endif } // ------------------------------------------------------------------------------------- // Now we have a good axis and the basic information about how the points are mapped // to it // Our initial guess is to represent the endpoints accurately, by moving the average // to the centre and recalculating the point positions along the line // ------------------------------------------------------------------------------------- { centre = (left + right) / 2; average_r += centre*v_r; average_g += centre*v_g; average_b += centre*v_b; for(i=0; i p[i+1]) { pt[i] = p[i+1]; rt[i] = ref[i+1]; pt[i+1] = p[i]; rt[i+1] = ref[i]; } else { pt[i] = p[i]; rt[i] = ref[i]; pt[i+1] = p[i+1]; rt[i+1] = ref[i+1]; } } // Sort/interleave pairs to quads for(i=0; i<16; i+=4) { a = 0; b = 2; for(j=0; j<4; j++) { if ((b >= 4) || ((a < 2) && (pt[i+a] < pt[i+b]))) { p[i+j] = pt[i+a]; ref[i+j] = rt[i+(a++)]; } else { p[i+j] = pt[i+b]; ref[i+j] = rt[i+(b++)]; } } } // Sort/interleave quads to 8s for(i=0; i<16; i+=8) { a = 0; b = 4; for(j=0; j<8; j++) { if ((b >= 8) || ((a < 4) && (p[i+a] < p[i+b]))) { pt[i+j] = p[i+a]; rt[i+j] = ref[i+(a++)]; } else { pt[i+j] = p[i+b]; rt[i+j] = ref[i+(b++)]; } } } // Sort/interleave 8s to 16s a = 0; b = 8; for(j=0; j<16; j++) { if ((b >= 16) || ((a < 8) && (pt[i+a] < pt[i+b]))) { p[j] = pt[a]; ref[j] = rt[(a++)]; } else { p[j] = pt[b]; ref[j] = rt[(b++)]; } } } // Sorting done. After that, the algorithm is much easier #define epsilon (0.01f) for(i=0; i<16; i++) { // If this cannot be clamped to a valid configuration, dump it straight away. // (Actually not sure if this is necessary. The maxima/minima rules may automatically // eliminate these cases) if ((i != 0) && (i != 16) && (p[i] == p[i-1])) continue; G = average_p * min_error_point_table[i]; // Clamp to interval split_point = G*(2.0f/3.0f); if ((i != 0) && (p[i-1] > split_point)) G = (3.0f/2.0f)*(p[i-1]+epsilon); else if ((i != 16) && (p[i] < split_point)) G = (3.0f/2.0f)*(p[i]-epsilon); // Calculate the actual error e = average_p_squared - 2*t1_table[i]*average_p*G + t2_table[i]*G*G; if (e < -0.1f) __asm int 3; if (e < min_e) { min_G = G; min_e = e; min_i = i; } } left = -min_G; right = min_G; } #endif #define PROGRESSIVE_REFINEMENT #ifdef PROGRESSIVE_REFINEMENT { // Attempt a (simple) progressive refinement step to reduce noise in the // output image by trying to find a better overall match for the endpoints // than the first-guess solution found so far (which is just to take the ends. // The method is to move the endpoints inwards until a local minima is found. // This provides quite a significant improvement in image quality. // In fact there are six degrees of freedom that could be tested here - inc/dec // left, inc/dec right, move both in, move both out). Move both out will never // need to be tested as we already start at this worst case (both as far out as // possible). Similarly, if every 'either in, both in' case is tested individually // then that also eliminates all 'move out' cases where a single step is all that // is needed // Checking the additional modes beyond the first does not provide much enhancement // Changing the stepsize to be additive instead of multiplicative makes visual image // quality very marginally worse, for about the same performance. float error4, maxerror, v[5]; float stepsize = 0.95f; // Compromise value, seems to produce good results in most cases while only being a small speed hit DWORD bit4; float division4; float oldleft, oldright; int mode, bestmode; int first; #ifdef TRY_3_COLOR float division3, error3; DWORD bit3; #endif oldleft = left; oldright = right; maxerror = 10000000.0f; first = 1; do { for(bestmode=-1,mode=0; mode<1; mode++) // In my experience modes other than mode 0 hardly ever get a hit, and it costs nearly 30% performance to test the other two { if (!first) { switch(mode) { case 0: left = oldleft*stepsize; right = oldright*stepsize; break; case 1: right = oldright*stepsize; break; case 2: left = oldleft*stepsize; break; } if (left > right) break; } centre = (left+right)/2; division4 = (right-centre)*2.0f/3.0f; v[0] = left; v[2] = centre-((right-centre)/3.0f); v[3] = centre+((right-centre)/3.0f); v[1] = right; error4 = 0; #ifdef TRY_3_COLOR division3 = (right-centre)/3; error3 = 0; v[4] = centre; #endif for(i=0; i average) are 0 and 1, while // interpolants are 2 and 3 if (fabs(b) >= division4) bit4 = 0; else bit4 = 2; // Positive is in the latter half of the block if (b >= centre) bit4 += 1; #ifdef TRY_3_COLOR if (fabs(b) < division3) bit3 = 5; else if (b >= centre) bit3 = 1; else bit3 = 0; #endif #elif 0 { float d; d = ((float)fabs(b) - division4); bit4 = (*(DWORD *)&d>>30) & 2; d = (b - centre); bit4 |= (*(DWORD *)&d>>31)^1; } #else // Almost exactly the same as the above in performance __asm { mov eax, [b] mov ecx, eax and eax, 0x7fffffff shr ecx, 1 sub eax, [division4] and eax, 0x80000000 and ecx, 0x40000000 add eax, ecx shr eax, 30 xor eax, 1 mov [bit4], eax } #endif r = (b - v[bit4]); error4 += r*r; #ifdef TRY_3_COLOR r = (b - v[bit3]); error3 += r*r; #endif } if (error4 < maxerror) { maxerror = error4; bestmode = mode; blocktype = 4; } #ifdef TRY_3_COLOR if (error3 < maxerror) { maxerror = error3; bestmode = mode; blocktype = 3; } #endif if (bestmode != -1); maxerror -= 5.0f; // Errors smaller than a certain epsilon should be ignored - this prevents unnecessary and infinite loops if (first) break; } if (!first) switch(bestmode) { default: bestmode = -1; break; case -1: break; case 0: oldleft *= stepsize; oldright *= stepsize; break; case 1: oldright *= stepsize; break; case 2: oldleft *= stepsize; break; } first = 0; } while(bestmode != -1); left = oldleft; right = oldright; } #else blocktype = 4; #endif // ------------------------------------------------------------------------------------- // Calculate the high and low output colour values // Involved in this is a rounding procedure which is undoubtedly slightly twitchy. A // straight rounded average is not correct, as the decompressor 'unrounds' by replicating // the top bits to the bottom. // In order to take account of this process, we don't just apply a straight rounding correction, // but base our rounding on the input value (a straight rounding is actually pretty good in terms of // error measure, but creates a visual colour and/or brightness shift relative to the original image) // The method used here is to apply a centre-biased rounding dependent on the input value, which was // (mostly by experiment) found to give minimum MSE while preserving the visual characteristics of // the image. // rgb = (average_rgb + (left|right)*v_rgb); // ------------------------------------------------------------------------------------- { DWORD c0, c1, t; int rd, gd, bd; r = (average_r + left*v_r); g = (average_g + left*v_g); b = (average_b + left*v_b); rd = (int) DCS_RED(r, g, b); gd = (int) DCS_GREEN(r, g, b); bd = (int) DCS_BLUE(r, g, b); ROUND_AND_CLAMP(rd, 5); ROUND_AND_CLAMP(gd, 6); ROUND_AND_CLAMP(bd, 5); c0 = ((rd&0xf8)<<8) + ((gd&0xfc)<<3) + ((bd&0xf8)>>3); r = (average_r + right*v_r); g = (average_g + right*v_g); b = (average_b + right*v_b); rd = (int) DCS_RED(r, g, b); gd = (int) DCS_GREEN(r, g, b); bd = (int) DCS_BLUE(r, g, b); ROUND_AND_CLAMP(rd, 5); ROUND_AND_CLAMP(gd, 6); ROUND_AND_CLAMP(bd, 5); c1 = (((rd&0xf8)<<8) + ((gd&0xfc)<<3) + ((bd&0xf8)>>3)); // Force to be a 4-colour opaque block - in which case, c0 is greater than c1 if (blocktype == 4) { if (c0 < c1) { t = c0; c0 = c1; c1 = t; swap = 1; } else if (c0 == c1) { // This block will always be encoded in 3-colour mode // Need to ensure that only one of the two points gets used, // avoiding accidentally setting some transparent pixels into the block for(i=0; i average) are 0 and 1, while // interpolants are 2 and 3 if (fabs(b) >= division) bit = 0; else bit = 2; // Positive is in the latter half of the block if (b >= centre) bit += 1; // Set the output, taking swapping into account block_dxtc[1] |= (bit^swap)<<(2*i); // Average the X and Y locations for each cluster cluster_x[bit] += (float)(i&3); cluster_y[bit] += (float)(i>>2); cluster_count[bit]++; } for(i=0; i<4; i++) { float cr; if (cluster_count[i]) { cr = 1.0f / cluster_count[i]; cluster_x[i] *= cr; cluster_y[i] *= cr; } else { cluster_x[i] = cluster_y[i] = -1; } } // SEVERAL METHODS OF ANTIALIAS BLOCK DETECTION INCLUDED HERE // Mostly rejected. // patterns in axis position detection // (same algorithm as used in the SSE version) if ((block_dxtc[0]&0xffff) != (block_dxtc[0]>>16)) { DWORD i1, k1; DWORD x=0, y=0; int xstep=0, ystep=0; // Find a corner to search from #define POS(x,y) (pos_on_axis[(x)+(y)*4]) for(k1=0; k1<4; k1++) { switch(k1) { case 0: x = 0; y = 0; xstep = 1; ystep = 1; break; case 1: x = 0; y = 3; xstep = 1; ystep = -1; break; case 2: x = 3; y = 0; xstep = -1; ystep = 1; break; case 3: x = 3; y = 3; xstep = -1; ystep = -1; break; } for(i1=0; i1<4; i1++) { if ((POS(x , y+ystep*i1) < POS(x+ xstep, y+ystep*i1)) || (POS(x+ xstep , y+ystep*i1) < POS(x+2*xstep, y+ystep*i1)) || (POS(x+2*xstep , y+ystep*i1) < POS(x+3*xstep, y+ystep*i1)) ) break; if ((POS(x+xstep*i1, y) < POS(x+xstep*i1, y+ ystep)) || (POS(x+xstep*i1, y+ystep) < POS(x+xstep*i1, y+2*ystep)) || (POS(x+xstep*i1, y+2*ystep) < POS(x+xstep*i1, y+3*ystep)) ) break; } if (i1 == 4) break; } } } else // if (blocktype == 4) { block_dxtc[1] = 0; division = right/3; centre = (left+right)/2; // Actually, this code only works if centre is 0 or approximately so for(i=0; i<16; i++) { b = pos_on_axis[index_map[i]]; // Endpoints (indicated by block > average) are 0 and 1, while // interpolants are 2 if (fabs(b) < division) bit = 2; else if (b >= centre) bit = 1^swap; else bit = swap; // Set the output, taking swapping into account block_dxtc[1] |= bit<<(2*i); } } } // done } // This compressor can only create opaque, 4-colour blocks void DXTCV11CompressBlockMinimal(DWORD block_32[16], DWORD block_dxtc[2]) { int i, k; float r, g, b; float uniques[16][3]; // The list of unique colours int unique_pixels; // The number of unique pixels float unique_recip; // Reciprocal of the above for fast multiplication int index_map[16]; // The map of source pixels to unique indices int pixel_count[16]; // The number of occurrences of each unique float average_r, average_g, average_b; // The centrepoint of the axis float v_r, v_g, v_b; // The axis float pos_on_axis[16]; // The distance each unique falls along the compression axis float dist_from_axis[16]; // The distance each unique falls from the compression axis float left = 0, right = 0, centre = 0; // The extremities and centre (average of left/right) of uniques along the compression axis float axis_mapping_error = 0; // The total computed error in mapping pixels to the axis int swap; // Indicator if the RGB values need swapping to generate an opaque result // ------------------------------------------------------------------------------------- // (3) Find the array of unique pixel values and sum them to find their average position // ------------------------------------------------------------------------------------- { // Find the array of unique pixel values and sum them to find their average position float *up = (float *)uniques; int current_pixel, firstdiff; current_pixel = unique_pixels = 0; average_r = average_g = average_b = 0; firstdiff = -1; for (i = 0; i<16; i++) { for (k = 0; k> 16) & 0xff); g = (float)((block_32[i] >> 8) & 0xff); b = (float)((block_32[i] >> 0) & 0xff); tr = CS_RED(r, g, b); tg = CS_GREEN(r, g, b); tb = CS_BLUE(r, g, b); *up++ = tr; *up++ = tg; *up++ = tb; if (k == i) { unique_pixels++; if ((i != 0) && (firstdiff < 0)) firstdiff = i; } average_r += tr; average_g += tg; average_b += tb; } #ifdef AVERAGE_UNIQUE_PIXELS_ONLY else { index_map[i] = index_map[k]; pixel_count[k]++; } #endif } #ifndef AVERAGE_UNIQUE_PIXELS_ONLY unique_pixels = 16; #endif // Compute average of the uniques unique_recip = 1.0f / (float)unique_pixels; average_r *= unique_recip; average_g *= unique_recip; average_b *= unique_recip; } // ------------------------------------------------------------------------------------- // (4) For each component, reflect points about the average so all lie on the same side // of the average, and compute the new average - this gives a second point that defines the axis // To compute the sign of the axis sum the positive differences of G for each of R and B (the // G axis is always positive in this implementation // ------------------------------------------------------------------------------------- // An interesting situation occurs if the G axis contains no information, in which case the RB // axis is also compared. I am not entirely sure if this is the correct implementation - should // the priority axis be determined by magnitude? { float rg_pos, bg_pos, rb_pos; v_r = v_g = v_b = 0; rg_pos = bg_pos = rb_pos = 0; for (i = 0; i 0) { rg_pos += g; rb_pos += b; } if (b > 0) bg_pos += g; } v_r *= unique_recip; v_g *= unique_recip; v_b *= unique_recip; if (rg_pos < 0) v_r = -v_r; if (bg_pos < 0) v_b = -v_b; if ((rg_pos == bg_pos) && (rg_pos == 0)) if (rb_pos < 0) v_b = -v_b; } // ------------------------------------------------------------------------------------- // (5) Axis projection and remapping // ------------------------------------------------------------------------------------- { float v2_recip; // Normalise the axis for simplicity of future calculation v2_recip = (v_r*v_r + v_g*v_g + v_b*v_b); if (v2_recip > 0) v2_recip = 1.0f / (float)sqrt(v2_recip); else v2_recip = 1.0f; v_r *= v2_recip; v_g *= v2_recip; v_b *= v2_recip; } // ------------------------------------------------------------------------------------- // (6) Map the axis // ------------------------------------------------------------------------------------- // the line joining (and extended on either side of) average and axis // defines the axis onto which the points will be projected // Project all the points onto the axis, calculate the distance along // the axis from the centre of the axis (average) // From Foley & Van Dam: Closest point of approach of a line (P + v) to a point (R) is // P + ((R-P).v) / (v.v))v // The distance along v is therefore (R-P).v / (v.v) // (v.v) is 1 if v is a unit vector. // // Calculate the extremities at the same time - these need to be reasonably accurately // represented in all cases // // In this first calculation, also find the error of mapping the points to the axis - this // is our major indicator of whether or not the block has compressed well - if the points // map well onto the axis then most of the noise introduced is high-frequency noise { left = 10000.0f; right = -10000.0f; axis_mapping_error = 0; for (i = 0; i < unique_pixels; i++) { // Compute the distance along the axis of the point of closest approach pos_on_axis[i] = ((uniques[i][0] - average_r) * v_r) + ((uniques[i][1] - average_g) * v_g) + ((uniques[i][2] - average_b) * v_b); // Compute the actual point and thence the mapping error r = uniques[i][0] - (average_r + pos_on_axis[i] * v_r); g = uniques[i][1] - (average_g + pos_on_axis[i] * v_g); b = uniques[i][2] - (average_b + pos_on_axis[i] * v_b); dist_from_axis[i] = r*r + g*g + b*b; axis_mapping_error += dist_from_axis[i]; // Work out the extremities if (pos_on_axis[i] < left) left = pos_on_axis[i]; if (pos_on_axis[i] > right) right = pos_on_axis[i]; } } // ------------------------------------------------------------------------------------- // (7) Now we have a good axis and the basic information about how the points are mapped // to it // Our initial guess is to represent the endpoints accurately, by moving the average // to the centre and recalculating the point positions along the line // ------------------------------------------------------------------------------------- { centre = (left + right) / 2; average_r += centre*v_r; average_g += centre*v_g; average_b += centre*v_b; for (i = 0; i> 3); r = (average_r + right*v_r); g = (average_g + right*v_g); b = (average_b + right*v_b); rd = (int)DCS_RED(r, g, b); gd = (int)DCS_GREEN(r, g, b); bd = (int)DCS_BLUE(r, g, b); ROUND_AND_CLAMP(rd, 5); ROUND_AND_CLAMP(gd, 6); ROUND_AND_CLAMP(bd, 5); c1 = (((rd & 0xf8) << 8) + ((gd & 0xfc) << 3) + ((bd & 0xf8) >> 3)); // Force to be a 4-colour opaque block - in which case, c0 is greater than c1 // blocktype == 4 { if (c0 < c1) { t = c0; c0 = c1; c1 = t; swap = 1; } else if (c0 == c1) { // This block will always be encoded in 3-colour mode // Need to ensure that only one of the two points gets used, // avoiding accidentally setting some transparent pixels into the block for (i = 0; i average) are 0 and 1, while // interpolants are 2 and 3 if (fabs(b) >= division) bit = 0; else bit = 2; // Positive is in the latter half of the block if (b >= centre) bit += 1; // Set the output, taking swapping into account block_dxtc[1] |= (bit^swap) << (2 * i); // Average the X and Y locations for each cluster cluster_x[bit] += (float)(i & 3); cluster_y[bit] += (float)(i >> 2); cluster_count[bit]++; } for (i = 0; i<4; i++) { float cr; if (cluster_count[i]) { cr = 1.0f / cluster_count[i]; cluster_x[i] *= cr; cluster_y[i] *= cr; } else { cluster_x[i] = cluster_y[i] = -1; } } // patterns in axis position detection // (same algorithm as used in the SSE version) if ((block_dxtc[0] & 0xffff) != (block_dxtc[0] >> 16)) { DWORD i1, k1; DWORD x = 0, y = 0; int xstep = 0, ystep = 0; // Find a corner to search from #define POS(x,y) (pos_on_axis[(x)+(y)*4]) for (k1 = 0; k1<4; k1++) { switch (k1) { case 0: x = 0; y = 0; xstep = 1; ystep = 1; break; case 1: x = 0; y = 3; xstep = 1; ystep = -1; break; case 2: x = 3; y = 0; xstep = -1; ystep = 1; break; case 3: x = 3; y = 3; xstep = -1; ystep = -1; break; } for (i1 = 0; i1<4; i1++) { if ((POS(x, y + ystep*i1) < POS(x + xstep, y + ystep*i1)) || (POS(x + xstep, y + ystep*i1) < POS(x + 2 * xstep, y + ystep*i1)) || (POS(x + 2 * xstep, y + ystep*i1) < POS(x + 3 * xstep, y + ystep*i1)) ) break; if ((POS(x + xstep*i1, y) < POS(x + xstep*i1, y + ystep)) || (POS(x + xstep*i1, y + ystep) < POS(x + xstep*i1, y + 2 * ystep)) || (POS(x + xstep*i1, y + 2 * ystep) < POS(x + xstep*i1, y + 3 * ystep)) ) break; } if (i1 == 4) break; } } } } // done } #endif ***************/ #ifndef _WIN64 void DXTCV11CompressAlphaBlock(BYTE block_8[16], DWORD block_dxtc[2]) { int i; float pos[16]; // The list of colours int blocktype; float b; DWORD n, bit; float step, rstep, offset; int count_0, count_255; float average_inc, average_ex; float max_ex, min_ex; max_ex = 0; min_ex = 255; average_inc = average_ex = 0; count_0 = count_255 = 0; for(i=0; i<16; i++) { int ex = 0; if (block_8[i] == 0) { count_0++; ex = 1; } else if (block_8[i] == 255) { count_255++; ex = 1; } pos[i] = (float)block_8[i]; average_inc += pos[i]; if (!ex) { average_ex += pos[i]; if (pos[i] > max_ex) max_ex = pos[i]; if (pos[i] < min_ex) min_ex = pos[i]; } } // Make assumptions if (!count_0 && !count_255) blocktype = 8; else { // There are 0 or 255 blocks and we need to represent them blocktype = 8; // Actually should probably try both here if (count_0) min_ex = 0; if (count_255) max_ex = 255; } // Start out assuming our endpoints are the min and max values we've determined // If the minimum is 0, it must stay 0, otherwise we shall move inwards cf. the colour compressor // Progressive refinement makes very little difference averaged across a whole image, but in certain // tricky areas can be noticeably better. //#define ALPHA_PROGRESSIVE_REFINEMENT #ifdef ALPHA_PROGRESSIVE_REFINEMENT { // Attempt a (simple) progressive refinement step to reduce noise in the // output image by trying to find a better overall match for the endpoints // than the first-guess solution found so far (which is just to take the ends. float error, maxerror; float oldmin, oldmax; int mode, bestmode; int first; float r, v; maxerror = 10000000.0f; oldmin = min_ex; oldmax = max_ex; first = 1; do { for(bestmode=-1,mode=0; mode<6; mode++) // Other modes seem more important for alpha block compression, 4-6 seem broken atm { if (!first) { switch(mode) { case 0: min_ex = oldmin+1.0f; max_ex = oldmax-1.0f; break; case 1: max_ex = oldmax-1.0f; break; case 2: min_ex = oldmin+1.0f; break; case 3: min_ex = oldmin-1.0f; max_ex = oldmax+1.0f; break; case 4: max_ex = oldmax+1.0f; break; case 5: min_ex = oldmin-1.0f; break; } if ((min_ex + 1.0f) > max_ex) break; if ((min_ex < 0.0f) || (max_ex > 255.0f)) continue; } error = 0; step = (max_ex - min_ex) / (float) (blocktype-1); rstep = 1.0f / step; offset = min_ex - (step*0.5f); for(i=0; i<16; i++) { b = pos[i]; if ((blocktype == 6) && ((b == 0) || (b == 255))) continue; // Work out which value in the block this selects n = (int)((b - offset) * rstep); if (n < 0) n = 0; if (n > 7) n = 7; // Compute the interpolated value v = ((float)n * step) + offset; // And accumulate the error r = (b - v); error += r*r; } if (error < maxerror) { maxerror = error; bestmode = mode; } if (first) break; } if (!first) switch(bestmode) { default: bestmode = -1; break; case -1: break; case 0: oldmin += 1.0f; oldmax -= 1.0f; break; case 1: oldmax -= 1.0f; break; case 2: oldmin += 1.0f; break; case 3: oldmin -= 1.0f; oldmax += 1.0f; break; case 4: oldmax += 1.0f; break; case 5: oldmin -= 1.0f; break; } first = 0; } while(bestmode != -1); min_ex = oldmin; max_ex = oldmax; metrics->sum_cluster_errors += maxerror; if (maxerror > 2000.0f) metrics->high_cluster_error_blocks++; } #endif // ALPHA_PROGRESSIVE_REFINEMENT // Generating the rounded values is slightly arcane. if (blocktype == 6) block_dxtc[0] = ((int)(min_ex + 0.5f)) | (((int)(max_ex + 0.5f))<<8); else block_dxtc[0] = ((int)(max_ex + 0.5f)) | (((int)(min_ex + 0.5f))<<8); step = (max_ex - min_ex) / (float) (blocktype-1); rstep = 1.0f / step; offset = min_ex - (step*0.5f); block_dxtc[1] = 0; for(i=0; i<16; i++) { b = pos[i]; if (blocktype == 6) { if ((b == 0.0f) || (b == 255.0f)) { if (b == 0) bit = 6; else bit = 7; } else { // Work out which value in the block this selects n = (int)((b - offset) * rstep); if (n <= 0) bit = 0; else if (n >= 5) bit = 1; else bit = n+1; } } else { // Blocktype 8 // Work out which value in the block this selects n = (int)((b - offset) * rstep); if (n <= 0) bit = 1; else if (n >= 7) bit = 0; else bit = 8 - n; } if (i == 5) { block_dxtc[1] |= (bit>>1); block_dxtc[0] |= (bit&1)<<31; } else if (i < 5) block_dxtc[0] |= bit << (3*i+16); else block_dxtc[1] |= bit << (3*(i-6)+2); } // done } #endif // !_WIN64