Fastest way to find difference of very large PHP arrays

You have 2 arrays and you want to find the differences. How do you do that?

Using array_diff. Obviously using array_diff. But array_diff works well on small arrays. How about large arrays. Say arrays with elements more than 2 millions! If you ever have to find array difference in php for large array you’d find array_diff is painfully slow. Today I had to do the same thing.

I had 2 arrays of integers. In fact they are mobile numbers. I needed to find the difference of these 2 lists. Each list contained almost 2.5 millions of element. When I invoke the array_dif solution it took more than 20 minutes. And then I had to press Ctrl+C. As I am working with large arrays, I was equipped with huge memory. I started php with 4 gigabytes of memory. So the following option.

[code language=”bash”]
php -d memory_limit=4G
[/code]

So I thought to improve it a little. I flip both arrays and find differneces using array_diff_key. It took about 10 seconds. What a surprise.  Why it took so little time? Because flipping an array made its values as keys. So earlier when I was searching for values, it becomes a search for keys. And keys are hash now. Searching hash is not a big deal. So the speed boost. Here is the code.

[code language=”php”]
function flip_array_diff_key($b, $a) {
$at = array_flip($a);
$bt = array_flip($b);
$d = array_diff_key($bt, $at);
return array_keys($d);
}
[/code]

Its a small function. Only problem is I had to call array_* functions too many times.  So I modified it a little. I flipped both arrays as I had to use array_diff_keys function. If I hadn’t used array_diff_keys  may be I’d do this,

[code language=”php”]
function flip_isset_diff($b, $a) {
$at = array_flip($a);
$d = array();
foreach ($b as $i)
if (!isset($at[$i]))
$d[] = $i;
return $d;
}
[/code]

See, Its just one call to array_flip.

And here is a complete custom way to achieve this.

[code language=”php”]
function large_array_diff($b, $a) {
$at = array();
foreach ($a as $i)
$at[$i] = 1;

$d = array();

foreach ($b as $i)
if (!isset($at[$i]))
$d[] = $i;

return $d;
}
[/code]

I create a benchmark script to compare all these methods. When I run it the result was a surprise.

----------------------------------------------------
Name                            Execution Time (sec)
----------------------------------------------------
----------------------------------------------------
Complete Custom                 7.122
----------------------------------------------------
Flip one and isset              5.450
----------------------------------------------------
Flip both and array_diff_key    9.915
----------------------------------------------------

I thought using array_* would be the fastest as I was using all PHPs native extensions. But no Its the slowest. The solution with one single call to array_filp seems the fastest. And surprisingly the complete custom solution is in the middle while It seems it’d be slowest.

In the conclusion I’d say, try to use as much as language construct as possible. But when It comes to a time consuming function use a native extension.