Fastest way to find difference of very large PHP arrays

You have 2 arrays and you want to find the differences. How do you do that?

Using array_diff. Obviously using array_diff. But array_diff works well on small arrays. How about large arrays. Say arrays with elements more than 2 millions! If you ever have to find array difference in php for large array you’d find array_diff is painfully slow. Today I had to do the same thing.

I had 2 arrays of integers. In fact they are mobile numbers. I needed to find the difference of these 2 lists. Each list contained almost 2.5 millions of element. When I invoke the array_dif solution it took more than 20 minutes. And then I had to press Ctrl+C. As I am working with large arrays, I was equipped with huge memory. I started php with 4 gigabytes of memory. So the following option.

 php -d memory_limit=4G

So I thought to improve it a little. I flip both arrays and find differneces using array_diff_key. It took about 10 seconds. What a surprise.  Why it took so little time? Because flipping an array made its values as keys. So earlier when I was searching for values, it becomes a search for keys. And keys are hash now. Searching hash is not a big deal. So the speed boost. Here is the code.

function flip_array_diff_key($b, $a) {
$at = array_flip($a);
$bt = array_flip($b);
$d = array_diff_key($bt, $at);
return array_keys($d);
}

Its a small function. Only problem is I had to call array_* functions too many times.  So I modified it a little. I flipped both arrays as I had to use array_diff_keys function. If I hadn’t used array_diff_keys  may be I’d do this,

function flip_isset_diff($b, $a) {
$at = array_flip($a);
$d = array();
foreach ($b as $i)
if (!isset($at[$i]))
$d[] = $i;
return $d;
}

See, Its just one call to array_flip.

And here is a complete custom way to achieve this.

function large_array_diff($b, $a) {
$at = array();
foreach ($a as $i)
$at[$i] = 1;

$d = array();

foreach ($b as $i)
if (!isset($at[$i]))
$d[] = $i;

return $d;
}

I create a benchmark script to compare all these methods. When I run it the result was a surprise.

----------------------------------------------------
Name                            Execution Time (sec)
----------------------------------------------------
----------------------------------------------------
Complete Custom                 7.122
----------------------------------------------------
Flip one and isset              5.450
----------------------------------------------------
Flip both and array_diff_key    9.915
----------------------------------------------------

I thought using array_* would be the fastest as I was using all PHPs native extensions. But no Its the slowest. The solution with one single call to array_filp seems the fastest. And surprisingly the complete custom solution is in the middle while It seems it’d be slowest.

In the conclusion I’d say, try to use as much as language construct as possible. But when It comes to a time consuming function use a native extension.

Iterate through each valid calendar days

If you develop software like me, you may face a scenario where you need to iterate through calendar days.  These are specially needed in event management application or any other application that has some sorts of daily tracker. The main point is you want to iterate through dates.

For example, If the current date is “February 28th 2001″, next date will be  “March 1st 2001″. Obviously not “February 29th 2001″. If the current date is “February 28th 2004″, next date will be “February 29th 2004″, Not “March 1st 2004″.  Yes, that’s quite tricky.  So how do you achieve it in programming?

The solution is to use time function families found in standard C header file time.h. The main idea is to first create the time you want to show by mktime(). This is the initial time. To advance to the next day/hour/min/sec/month just convert the resultant value of mktime() to struct tm structure by localtime() or gmtime(). Then add 1 day/hour/min/sec/month to struct tm structure. Use this structure to create time again by mktime().

The above strategy works because of the following lines from manual.

     In addition to computing the calendar time, mktime() normal-
     izes  the  supplied tm structure. The original values of the
     tm_wday and tm_yday components of the structure are ignored,
     and the original values of the other components are not res-
     tricted to the ranges indicated in  the  definition  of  the
     structure

Here is a sample code that iterates through the first 200 dates from epoch.


#include <time.h>;
#include <stdio.h>;

int main(){
    struct tm *lt;
    int i=200;
    time_t t=1;  // first second on epoch

    while(i--){
        lt = localtime(&t);
        printf("%s\n", asctime(lt));
        lt->tm_hour+=24; // adding 1 day = 24 hours
        t = mktime(lt);
    }
    return 0;
}

If you are working with other language you may wonder if this will apply to your language. The good news is most language has wrapper for C routines.  May be the name is different. But  most probably you have it. Just see the manual.  Same technique will work on other language too.

Data extraction from external url via php made easy

How many times you extracted data from other website to include in your site? I guess its many times. In this web 2.0 era we all mix up different type of contents to make our own. Sometimes you grab data from youtube.com, sometimes amazon.com and then you create a mesh up. All these things need little bit of data parsing knowledge. Also you need to interact with an http resource. Many PHP coders does it by curl, DOMDocument etc extensions. This job is quite tedious. To resolve this problem I created a class long time ago. Now I put them on http://github.com/shiplu. The classes I am talking about can be found on http://github.com/shiplu/dxtool.

“dxtool” stands for data extraction tools. Its very easy to use. Here I dump the README file from github.

Requirement

  • php5
  • php5-curl extension
  • php5-json extension (already included with php5)

Features

  • Extract Data from any http resource
  • Use simple regular expression to extract data
  • Hassle free http transaction
  • Supports cookie (via curl)
  • Can cache http response

Here is an example on how to use it.

<?php
require 'DataExtractor.php';
require 'WebGet.php';
$google_new_feed = 'http://news.google.com/news?pz=1&cf=all&ned=in&hl=en&output=rss';
$w = new WebGet();
$content = $w->requestContent($google_new_feed);
$dx = new DataExtractor($content);
$dx->titles = '|title>([^<]+)</title|a';
$dx->rsstitle = '|title>([^<]+)</title|';
$data = $dx->extractArray();
print_r($data);
?>

If you run it you’ll see this output. Output will be different as google news rss will change over time

Array
(
    [titles] => Array
        (
            [0] => Top Stories - Google News
            [1] => Top Stories - Google News
            [2] => Draft of Lokpal Bill discussed informally at cabinet meet - Hindustan Times
            [3] => cabinet clears Food Security Bill - Hindustan Times
            [4] => Obama hails Havel's 'moral leadership' and 'dignity' - AFP
            [5] => Philippines struggles to cope after storm leaves 650 dead - Telegraph.co.uk
            [6] => Civil nuclear liability rules balanced, India to Russia - Hindustan Times
            [7] => 'Include Muslims, expand OBC quota' - Hindustan Times
            [8] => Sadhbhavana's success answer to Gujarat detractors: Narendra Modi - Daily News & Analysis
            [9] => Female protestor's beating sparks Egypt outrage - Telegraph.co.uk
            [10] => Romney says US withdrawal from Iraq 'precipitous' - AFP
            [11] => Come clean on Chidambaram's alleged favours to ex-client: BJP to Centre - The Hindu
        )

    [rsstitle] => Top Stories - Google News
)

HOWTO: Convert 6 digit css color code to 3 digit

Okay, So you want to convert 6 digit css color codes to 3 digit. They are almost same. But 3 digit helps to remember it.  Before jumping into code let me explain what does 3 digit represents. A css color #abc means #aabbcc. No its NOT  #a0b0c0. It may appear that  the second one is more appropriate. But the fact is Its not. You can test it.   See the following table.

#aabbcc #abc #a0b0c0
     

So to convert any 6 digit css color to a 3 digit color needs some calculation. Thats why I have written a php function.

See the code bellow

function convert_color($color){
 preg_match("|#([\da-h]{2})([\da-h]{2})([\da-h]{2})|", $color, $match);
 $n=array();
 array_shift($match);
 foreach($match as $m)
  array_push($n, reduce_digit($m));
 return "#". implode("", $n);
}

function reduce_digit($hex){
 $n = hexdec($hex);
 $r = $n%17 ;
 $d = intval($n/17)+ (($r<8)?0:1);
 return dechex($d);
}

Just call the convert_color function with 6 digit css color. For example. “#bcd465″. Here # is necessary. This will return the 3 digit css color.

If you want to test it, run the following code.

$params= array(
array("#aabbcc","#abc"),
array("#112233","#123"),
array("#456789","#468"),
array("#1234fa","#13f"),
array("#000000","#000")
);

foreach($params as $param){
    $f = convert_color($param[0]);
    $e = $param[1];
    echo "Passed={$param[0]}, Expected=", $e, ", Actual=";
    echo $f, ", Status=", (($e==$f)?"SUCCESS":"FAILED"), PHP_EOL;
}

I have run it it the result is good.

Passed=#aabbcc, Expected=#abc, Actual=#abc, Status=SUCCESS
Passed=#112233, Expected=#123, Actual=#123, Status=SUCCESS
Passed=#456789, Expected=#468, Actual=#468, Status=SUCCESS
Passed=#1234fa, Expected=#13f, Actual=#13f, Status=SUCCESS
Passed=#000000, Expected=#000, Actual=#000, Status=SUCCESS