摘要
Data chunking algorithms divide data into several small data chunks in a certain way, thus transforming the operation of data into the one of multiple small data chunks. Data chunking algorithms have been widely used in duplicate data detection, parallel computing and other fields, but it is seldom used in data incremental synchronization. Aiming at the characteristics of incremental data synchronization, this paper proposes a novel data chunking algorithm. By dividing two data that need synchronization into small data chunks, comparing the contents of these small data chunks, different ones are the incremental data that need to be found. The new algorithm determines to set a cut-point based on the number of 1 contained in the binary format of all bytes in an interval. Thus it improves the resistance against the byte shifting problem at the expense of the chunk size stability, which makes it more suitable for the incremental data synchronization. Comparing this algorithm with several known classical or state of art algorithms, experiments show that the incremental data found by this algorithm can be reduced by 32 & x0025;& x007E;57 & x0025; compared to the others with same changes between two data. The experimental results based on real-world datasets show that PCI improves the calculation speed of classic Rsync algorithm up to 70 & x0025;, however, with a drawback of increasing the Transmission compression rate up to 11.8 & x0025;.