PHP & MySQL Search Engine

Do you hate sites that do not have a search feature? I do. I think it defeats the purpose of cramming information endlessly in cyberspace.

So we want to give a search engine to our users. This sounds really simple, we could try:
copyraw
$search_term_esc = AddSlashes($search_term);
$sql = "SELECT * FROM Content WHERE content_body LIKE '%$search_term_esc%'";
  1.  $search_term_esc = AddSlashes($search_term)
  2.  $sql = "SELECT * FROM Content WHERE content_body LIKE '%$search_term_esc%'"
Great! Few problems though, multiple terms are not supported; quotation marks and apostrophes may be an issue; Full PHP:
copyraw
$sanitized_search=(isset($_GET['search'])&&($sanitized_search==""))?trim(strip_tags(addslashes($_GET['search']))):"";

		# 2 functions to split up search terms and accepting double-quotes as phrase makers.
		function search_transform_term($term){
			$term = preg_replace("/(\s)/e", "'{WHITESPACE-'.ord('\$1').'}'", $term);
			$term = preg_replace("/,/", "{COMMA}", $term);
			return $term;
		}
		function search_split_terms($terms){
			$terms = preg_replace("/\"(.*?)\"/e", "search_transform_term('\$1')", $terms);
			$terms = preg_split("/\s+|,/", $terms);
			$out = array();
			foreach($terms as $term){
				$term = preg_replace("/\{WHITESPACE-([0-9]+)\}/e", "chr(\$1)", $term);
				$term = preg_replace("/\{COMMA\}/", ",", $term);
				$out[] = $term;
			}
			return $out;
		}

		# prepare for MySQL regular expressions
		function search_escape_rlike($string){
			return preg_replace("/([.\[\]*^\$])/", '\\\$1', $string);
		}

		# function to list terms into list of regular expressions
		# eg. search for 'foo' matches 'a foo a' but not 'a food a'.
		function search_db_escape_terms($terms){
			$out = array();
			foreach($terms as $term){
				$out[] = '[[::]]';
			}
			return $out;
		}
		# function to organise ranking
		function search_rx_escape_terms($terms){
			$out = array();
			foreach($terms as $term){
				$out[] = '\b'.preg_quote($term, '/').'\b';
			}
			return $out;
		}

		# function to sort results by rank.
		# NOTE: does not seem to be working properly
		function search_sort_results($a, $b){
			$ax = $a["Score"];
			$bx = $b["Score"];
			if ($ax == $bx){ return 0; }
			return ($ax > $bx) ? -1 : 1;
		}

		# get the words we're going to display
		function search_return_keywords($str) {
			$search = array (
				'@]*?>.*?@si',	// Strip out javascript
				'@]*?>.*?@si',		// Strip out Inline Stylesheets
				'@[{].*?[}]@si',					// Strip out Internal Joomla Curly Brackets commands?
		        '@@si',			// Strip out HTML tags
		        '@([\r\n])[\s]+@',					// Strip out white space
		        '@&#(\d+);@e'
			);
			$str = preg_replace($search, ' ', $str);
			$str = strip_tags(str_replace(array('[',']'), array(''), $str));	// remove BB Code and HTML tags
			$str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );					// converts html entities into characters (symbols)
			$str = preg_replace("/&.{0,}?;/",'', $str);								// removes numeric entities?
			$str = preg_replace('//i', '', $str);	// removes MS Office style html comments (and everything in between)
			$str = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $str);			// remove carriage returns, line feeds, tabs and double-spaces

			# for mediawiki
			$str = str_replace('=', ' ', $str);									// replace any equal signs with a single-space
			$str = str_replace('"', ' ', $str);									// replace any double-quote signs with a single-space
			$str = str_replace("'", ' ', $str);									// replace any apostrophe signs with a single-space
			$str = str_replace("#", ' ', $str);									// replace any hash signs with a single-space
			$str = str_replace("_", ' ', $str);									// replace any hash signs with a single-space
			$str = str_replace("*", ' ', $str);									// replace any asterisk signs with a single-space
			$str = str_replace("(", ' ', $str);									// replace any open parenthesis with a single-space
			$str = str_replace(")", ' ', $str);									// replace any closing parenthesis with a single-space
			$str = str_replace("\\", ' ', $str);									// replace any closing parenthesis with a single-space

			# to clean up
			$str = str_replace('  ', ' ', $str);									// replace any double-spaces with a single-space
			$str = str_replace('  ', ' ', $str);									// replace any double-spaces with a single-space

			return $str;
		}

		# highlight terms within search results
		function search_highlight($text, $terms_rx){
			$start = '(^|)';
			$end   = '($|)';
			return preg_replace(
				"/$start(.*?)$end/se",
				"StripSlashes('\\1').".
				"search_highlight_inner(StripSlashes('\\2'), \$terms_rx).".
				"StripSlashes('\\3')",
			$text );
		}

		# highlight terms within search results
		function search_highlight_inner($text, $terms_rx){
			$colors = search_get_highlight_colors();
			foreach($terms_rx as $term_rx){
				$color = array_shift($colors);
				$text = preg_replace(
					"/($term_rx)/ise",
					"search_highlight_do(StripSlashes('\\1'), \$color)",
				$text );
			}
			return $text;
		}

		# determine colors to use
		function search_get_highlight_colors(){
			return array(
				array('#ffffcc','#CA6C17'),
				array('#EEE823','#000000'),
				array('#2EBDB6','#000000'),
				array('#7570B3','#000000'),
				array('#FFD400','#000000'),
				array('#ED017C','#ffffff'),
				array('#7A003F','#ffffff'),
				array('#7C6A55','#ffffff'),
				array('#0084B6','#ffffff'),
				array('#260357','#ffffff'),
			);
		}

		function search_highlight_do($fragment, $color){
			return "".
				"$fragment";
		}

		function search_pretty_terms_highlighted($terms_html){
			$colors = search_get_highlight_colors();
			$temp = array();
			foreach($terms_html as $term_html){
				$color = array_shift($colors);
				$temp[] = search_highlight_do($term_html, $color);
			}
			return search_pretty_terms($temp);
		}

		# separate function for dealing with apostrophes (injection?)
		function search_final_clean($terms) {
			$out = array();
			foreach($terms as $term){
				$out[] = addslashes($term);
			}
			return $out;
		}


		$terms = search_split_terms($terms);

			$out=array();
			foreach($terms as $term) {

				# remove terms that are not long enough (so no searches for the letter "e")
				# remove terms that are in the omit word database
				if ((!in_array($term, $omit_search_words))&&(strlen($term)>=$MinimimumStringLength)) {
					$omitted_words_array[]=$term;
					$out[]=$term;
				}
			}
			$terms=array();
			$terms=$out;


		# start creating the sql query
		$terms_db = search_db_escape_terms($terms);
		$terms_db_sql = search_final_clean($terms);
		$parts = array();
		foreach($terms_db as $term_db){
			$parts[] = "PageTitle RLIKE '$term_db'";
		}
		$parts1 = implode(' '.strtoupper($select_joiner).' ', $parts);
		$parts = array();
		foreach($terms_db as $term_db){
			$parts[] = "PageContent RLIKE '$term_db'";
		}
		$parts2 = implode(' '.strtoupper($select_joiner).' ', $parts);
		$parts = $parts1 . " OR " . $parts2;

		$scoreparts = array();
		foreach($terms_db_sql as $term){
			$scoreparts[] = "(((SELECT LENGTH(t.PageTitle)-length(replace(LOWER(t.PageTitle),LOWER('$term'),'')))/LENGTH('$term'))*10)";
		}
		foreach($terms_db_sql as $term){
			$scoreparts[] = "((SELECT LENGTH(t.PageContent)-length(replace(LOWER(t.PageContent),LOWER('$term'),'')))/LENGTH('$term'))";
		}
		if (count($terms)==0) {
			$terms=array();
			$terms[]=strtolower($sanitized_search);
			$terms_db = search_db_escape_terms($terms);
			$terms_db_sql = bukb_apostrophes($terms);
			$term_db=$terms_db_sql[0];

			$scorepartstring = "(((SELECT LENGTH(t.PageTitle)-length(replace(LOWER(t.PageTitle),LOWER('$term'),'')))/LENGTH('$term'))*10)";
			$parts="(PageTitle RLIKE '[[::]]')".$parts."(PageContent RLIKE '[[::]]')";
		} else {
			$scorepartstring = implode(' + ', $scoreparts);
		}

		// Generate the SQL query for your search engine
		$filter_sql = "
		SELECT
			t.PageID,
			t.PageTitle,
			t.PageContent,
			t.Source,
			".$scorepartstring."
			AS Score

		FROM (".$live_sub_query.") AS t
		WHERE ".$parts."
		ORDER BY Score DESC
		LIMIT 0, ".$maximum_number_of_results_from_db;

		$terms_rx = search_rx_escape_terms($terms);
		$results_rows = array();
		$result = mysql_query($filter_sql);
		if ($result) {
			while($row=mysql_fetch_assoc($result)) {
				$row["Score"] = 0;
				foreach($terms_rx as $term_rx){
					$row["Score"] += preg_match_all("/$term_rx/i", $row["PageTitle"], $null)*10;
					$row["Score"] += preg_match_all("/$term_rx/i", $row["PageContent"], $null);
				}
				$results_rows[] = $row;
			}
		}

		uasort($results_rows, 'search_sort_results');
		$terms_rx = search_rx_escape_terms($terms);
		$display_excerpt=search_highlight($result_Content, $terms_rx)."...";
  1.  $sanitized_search=(isset($_GET['search'])&&($sanitized_search==""))?trim(strip_tags(addslashes($_GET['search']))):""
  2.   
  3.          # 2 functions to split up search terms and accepting double-quotes as phrase makers. 
  4.          function search_transform_term($term){ 
  5.              $term = preg_replace("/(\s)/e", "'{WHITESPACE-'.ord('\$1').'}'", $term)
  6.              $term = preg_replace("/,/", "{COMMA}", $term)
  7.              return $term
  8.          } 
  9.          function search_split_terms($terms){ 
  10.              $terms = preg_replace("/\"(.*?)\"/e", "search_transform_term('\$1')", $terms)
  11.              $terms = preg_split("/\s+|,/", $terms)
  12.              $out = array()
  13.              foreach($terms as $term){ 
  14.                  $term = preg_replace("/\{WHITESPACE-([0-9]+)\}/e", "chr(\$1)", $term)
  15.                  $term = preg_replace("/\{COMMA\}/", ",", $term)
  16.                  $out[] = $term
  17.              } 
  18.              return $out
  19.          } 
  20.   
  21.          # prepare for MySQL regular expressions 
  22.          function search_escape_rlike($string){ 
  23.              return preg_replace("/([.\[\]*^\$])/", '\\\$1', $string)
  24.          } 
  25.   
  26.          # function to list terms into list of regular expressions 
  27.          # eg. search for 'foo' matches 'a foo a' but not 'a food a'. 
  28.          function search_db_escape_terms($terms){ 
  29.              $out = array()
  30.              foreach($terms as $term){ 
  31.                  $out[] = '[[::]]'
  32.              } 
  33.              return $out
  34.          } 
  35.          # function to organise ranking 
  36.          function search_rx_escape_terms($terms){ 
  37.              $out = array()
  38.              foreach($terms as $term){ 
  39.                  $out[] = '\b'.preg_quote($term, '/').'\b'
  40.              } 
  41.              return $out
  42.          } 
  43.   
  44.          # function to sort results by rank. 
  45.          # NOTE: does not seem to be working properly 
  46.          function search_sort_results($a, $b){ 
  47.              $ax = $a["Score"]
  48.              $bx = $b["Score"]
  49.              if ($ax == $bx){ return 0} 
  50.              return ($ax > $bx) ? -1 : 1
  51.          } 
  52.   
  53.          # get the words we're going to display 
  54.          function search_return_keywords($str) { 
  55.              $search = array ( 
  56.                  '@]*?>.*?@si',    // Strip out javascript 
  57.                  '@]*?>.*?@si',        // Strip out Inline Stylesheets 
  58.                  '@[{].*?[}]@si',                    // Strip out Internal Joomla Curly Brackets commands? 
  59.                  '@@si',            // Strip out HTML tags 
  60.                  '@([\r\n])[\s]+@',                    // Strip out white space 
  61.                  '@&#(\d+);@e' 
  62.              )
  63.              $str = preg_replace($search, ' ', $str)
  64.              $str = strip_tags(str_replace(array('[',']'), array(''), $str));    // remove BB Code and HTML tags 
  65.              $str = html_entity_decode( $str, ENT_QUOTES, "utf-8" );                    // converts html entities into characters (symbols) 
  66.              $str = preg_replace("/&.{0,}?;/",'', $str);                                // removes numeric entities? 
  67.              $str = preg_replace('//i', '', $str);    // removes MS Office style html comments (and everything in between) 
  68.              $str = str_replace(array("\r\n", "\r", "\n", "\t"), ' ', $str);            // remove carriage returns, line feeds, tabs and double-spaces 
  69.   
  70.              # for mediawiki 
  71.              $str = str_replace('=', ' ', $str);                                    // replace any equal signs with a single-space 
  72.              $str = str_replace('"', ' ', $str);                                    // replace any double-quote signs with a single-space 
  73.              $str = str_replace("'", ' ', $str);                                    // replace any apostrophe signs with a single-space 
  74.              $str = str_replace("#", ' ', $str);                                    // replace any hash signs with a single-space 
  75.              $str = str_replace("_", ' ', $str);                                    // replace any hash signs with a single-space 
  76.              $str = str_replace("*", ' ', $str);                                    // replace any asterisk signs with a single-space 
  77.              $str = str_replace("(", ' ', $str);                                    // replace any open parenthesis with a single-space 
  78.              $str = str_replace(")", ' ', $str);                                    // replace any closing parenthesis with a single-space 
  79.              $str = str_replace("\\", ' ', $str);                                    // replace any closing parenthesis with a single-space 
  80.   
  81.              # to clean up 
  82.              $str = str_replace('  ', ' ', $str);                                    // replace any double-spaces with a single-space 
  83.              $str = str_replace('  ', ' ', $str);                                    // replace any double-spaces with a single-space 
  84.   
  85.              return $str
  86.          } 
  87.   
  88.          # highlight terms within search results 
  89.          function search_highlight($text, $terms_rx){ 
  90.              $start = '(^|)'
  91.              $end   = '($|)'
  92.              return preg_replace( 
  93.                  "/$start(.*?)$end/se", 
  94.                  "StripSlashes('\\1')."
  95.                  "search_highlight_inner(StripSlashes('\\2'), \$terms_rx)."
  96.                  "StripSlashes('\\3')", 
  97.              $text )
  98.          } 
  99.   
  100.          # highlight terms within search results 
  101.          function search_highlight_inner($text, $terms_rx){ 
  102.              $colors = search_get_highlight_colors()
  103.              foreach($terms_rx as $term_rx){ 
  104.                  $color = array_shift($colors)
  105.                  $text = preg_replace( 
  106.                      "/($term_rx)/ise", 
  107.                      "search_highlight_do(StripSlashes('\\1'), \$color)", 
  108.                  $text )
  109.              } 
  110.              return $text
  111.          } 
  112.   
  113.          # determine colors to use 
  114.          function search_get_highlight_colors(){ 
  115.              return array( 
  116.                  array('#ffffcc','#CA6C17'), 
  117.                  array('#EEE823','#000000'), 
  118.                  array('#2EBDB6','#000000'), 
  119.                  array('#7570B3','#000000'), 
  120.                  array('#FFD400','#000000'), 
  121.                  array('#ED017C','#ffffff'), 
  122.                  array('#7A003F','#ffffff'), 
  123.                  array('#7C6A55','#ffffff'), 
  124.                  array('#0084B6','#ffffff'), 
  125.                  array('#260357','#ffffff'), 
  126.              )
  127.          } 
  128.   
  129.          function search_highlight_do($fragment, $color){ 
  130.              return ""
  131.                  "$fragment"
  132.          } 
  133.   
  134.          function search_pretty_terms_highlighted($terms_html){ 
  135.              $colors = search_get_highlight_colors()
  136.              $temp = array()
  137.              foreach($terms_html as $term_html){ 
  138.                  $color = array_shift($colors)
  139.                  $temp[] = search_highlight_do($term_html, $color)
  140.              } 
  141.              return search_pretty_terms($temp)
  142.          } 
  143.   
  144.          # separate function for dealing with apostrophes (injection?) 
  145.          function search_final_clean($terms) { 
  146.              $out = array()
  147.              foreach($terms as $term){ 
  148.                  $out[] = addslashes($term)
  149.              } 
  150.              return $out
  151.          } 
  152.   
  153.   
  154.          $terms = search_split_terms($terms)
  155.   
  156.              $out=array()
  157.              foreach($terms as $term) { 
  158.   
  159.                  # remove terms that are not long enough (so no searches for the letter "e") 
  160.                  # remove terms that are in the omit word database 
  161.                  if ((!in_array($term, $omit_search_words))&&(strlen($term)>=$MinimimumStringLength)) { 
  162.                      $omitted_words_array[]=$term
  163.                      $out[]=$term
  164.                  } 
  165.              } 
  166.              $terms=array()
  167.              $terms=$out
  168.   
  169.   
  170.          # start creating the sql query 
  171.          $terms_db = search_db_escape_terms($terms)
  172.          $terms_db_sql = search_final_clean($terms)
  173.          $parts = array()
  174.          foreach($terms_db as $term_db){ 
  175.              $parts[] = "PageTitle RLIKE '$term_db'"
  176.          } 
  177.          $parts1 = implode(' '.strtoupper($select_joiner).' ', $parts)
  178.          $parts = array()
  179.          foreach($terms_db as $term_db){ 
  180.              $parts[] = "PageContent RLIKE '$term_db'"
  181.          } 
  182.          $parts2 = implode(' '.strtoupper($select_joiner).' ', $parts)
  183.          $parts = $parts1 . " OR " . $parts2
  184.   
  185.          $scoreparts = array()
  186.          foreach($terms_db_sql as $term){ 
  187.              $scoreparts[] = "(((SELECT LENGTH(t.PageTitle)-length(replace(LOWER(t.PageTitle),LOWER('$term'),'')))/LENGTH('$term'))*10)"
  188.          } 
  189.          foreach($terms_db_sql as $term){ 
  190.              $scoreparts[] = "((SELECT LENGTH(t.PageContent)-length(replace(LOWER(t.PageContent),LOWER('$term'),'')))/LENGTH('$term'))"
  191.          } 
  192.          if (count($terms)==0) { 
  193.              $terms=array()
  194.              $terms[]=strtolower($sanitized_search)
  195.              $terms_db = search_db_escape_terms($terms)
  196.              $terms_db_sql = bukb_apostrophes($terms)
  197.              $term_db=$terms_db_sql[0]
  198.   
  199.              $scorepartstring = "(((SELECT LENGTH(t.PageTitle)-length(replace(LOWER(t.PageTitle),LOWER('$term'),'')))/LENGTH('$term'))*10)"
  200.              $parts="(PageTitle RLIKE '[[::]]')".$parts."(PageContent RLIKE '[[::]]')"
  201.          } else { 
  202.              $scorepartstring = implode(+ ', $scoreparts)
  203.          } 
  204.   
  205.          // Generate the SQL query for your search engine 
  206.          $filter_sql = " 
  207.          SELECT 
  208.              t.PageID, 
  209.              t.PageTitle, 
  210.              t.PageContent, 
  211.              t.Source, 
  212.              ".$scorepartstring." 
  213.              as Score 
  214.   
  215.          FROM (".$live_sub_query.") as t 
  216.          WHERE ".$parts." 
  217.          ORDER BY Score DESC 
  218.          LIMIT 0, ".$maximum_number_of_results_from_db
  219.   
  220.          $terms_rx = search_rx_escape_terms($terms)
  221.          $results_rows = array()
  222.          $result = mysql_query($filter_sql)
  223.          if ($result) { 
  224.              while($row=mysql_fetch_assoc($result)) { 
  225.                  $row["Score"] = 0
  226.                  foreach($terms_rx as $term_rx){ 
  227.                      $row["Score"] +preg_match_all("/$term_rx/i", $row["PageTitle"], $null)*10
  228.                      $row["Score"] +preg_match_all("/$term_rx/i", $row["PageContent"], $null)
  229.                  } 
  230.                  $results_rows[] = $row
  231.              } 
  232.          } 
  233.   
  234.          uasort($results_rows, 'search_sort_results')
  235.          $terms_rx = search_rx_escape_terms($terms)
  236.          $display_excerpt=search_highlight($result_Content, $terms_rx)."..."
I like to include a "Did-you-mean" section which I also use for any AJAX suggestion feature (be flexible with us as some customers have asked how does it get that word and why is it relevant? (when it isn't, the sound matches and then it's a typo on the sound so it matches words that are vaguely similar instead of spot on).
copyraw
$didyoumean_words_array=array();
foreach ($terms as $term) {
	$term=strtolower($term);

	# get closest word by LEVENSHTEIN
	$shortest = -1;
	foreach ($levenshtein_words as $word) {
	    $lev = levenshtein($term, $word);
	    if ($lev == 0) {
		$closest = $word;
		$shortest = 0;
		break 1;
	    }
	    if ($lev 0)?"Did you mean: ".$didyoumean_string_joined."?":"";
$didyoumean_words_array=array();

// Note: for some systems you can do a levenshtein on two words soundex value and compare these.  
// Be warned that it does go off track though in terms of relevance...
  1.  $didyoumean_words_array=array()
  2.  foreach ($terms as $term) { 
  3.      $term=strtolower($term)
  4.   
  5.      # get closest word by LEVENSHTEIN 
  6.      $shortest = -1
  7.      foreach ($levenshtein_words as $word) { 
  8.          $lev = levenshtein($term, $word)
  9.          if ($lev == 0) { 
  10.          $closest = $word
  11.          $shortest = 0
  12.          break 1
  13.          } 
  14.          if ($lev 0)?"Did you mean: ".$didyoumean_string_joined."?":""
  15.  $didyoumean_words_array=array()
  16.   
  17.  // Note: for some systems you can do a levenshtein on two words soundex value and compare these. 
  18.  // Be warned that it does go off track though in terms of relevance... 
Sources:
  1. iamcal.com
  2. Nadeau Software
Category: Personal Home Page :: Article: 328

Credit where Credit is Due:


Feel free to copy, redistribute and share this information. All that we ask is that you attribute credit and possibly even a link back to this website as it really helps in our search engine rankings.

Disclaimer: Please note that the information provided on this website is intended for informational purposes only and does not represent a warranty. The opinions expressed are those of the author only. We recommend testing any solutions in a development environment before implementing them in production. The articles are based on our good faith efforts and were current at the time of writing, reflecting our practical experience in a commercial setting.

Thank you for visiting and, as always, we hope this website was of some use to you!

Kind Regards,

Joel Lipman
www.joellipman.com

Related Articles

Joes Revolver Map

Joes Word Cloud

parenthesis   didyoumean   single   query   term   content   entities   words   final   closing   select   within   preg   signs   rows   string   engine   array   matches   double   tags   mysql   remove   replace   does   hash   foreach   escape   parts1   result   parts   colors   list   search   sort   highlight   score   html   text   strip   return   clean   terms   pretty   split   note   spaces   space   function   results   JoelLipman.Com

Accreditation

Badge - Certified Zoho Creator Associate
Badge - Certified Zoho Creator Associate

Donate & Support

If you like my content, and would like to support this sharing site, feel free to donate using a method below:

Paypal:
Donate to Joel Lipman via PayPal

Bitcoin:
Donate to Joel Lipman with Bitcoin bc1qf6elrdxc968h0k673l2djc9wrpazhqtxw8qqp4

Ethereum:
Donate to Joel Lipman with Ethereum 0xb038962F3809b425D661EF5D22294Cf45E02FebF
© 2024 Joel Lipman .com. All Rights Reserved.