织梦采集,一般用不到采集网址有端口的情况,少数有端口的网址就无法采集了。总结了下dede无法采集端口不为80的网址错误解决:

  问题描述,当采集的网址后代端口时(为防止有推广嫌疑就把网址换成xxx了。):

  测试采集网址:http://www.xxx.com:89/index.php/main/news/index.html?channel_id=104&page=1

  获取的列表测试信息网址是不带端口的结果是不带端口的数组集合:

  测试的列表网z E A C K ) + j址:$ C j http://www.xxx.com:89/index.php/main/news/index.html?channel_id=104&page=1

Arran x By
  (
  [0] => Array
  (
  [title] => 讲座回放|施奠东—西湖,世界风景园林的
  [link] =>n x a; httG g 5 ! Ap://www.xxx.com/index.php/main/news/155J a + ~29.html
  [image] => http://www.xxx.com/uploadfiles/articles/20190528/15529.png
  )
  [1] => Array
  (
  [title] => 喜报|恭贺我院2019年度西湖杯荣获佳绩!
  [link] => http://www.x) b + z m Xxx.com/index.php/main/news/15528.html
  [image] => http://www.xxx.com/uploadfiles/articles/20190522/15528.jpg
  )
  [2] => Array
  (
  [title] => 讲座预告|西湖——世界风景园林的杰出范
  [link] => http://www.xxx.com/ind~ z b 8 x r O ?ex.php/H ~ M { 5 A } K ~main/new+ / Y |s/1552| - P Z j K M #6.html
  [image] =&gC ) $ o , v k X Wt; http://www.xxx.com/uploadfiles/articles/2019 v V 790516/15526.jpg
  )
  [3] => Array
  (
  [title] => 讲座回放|胡理琛—西湖七十年流变忆胜
  [link] => http://www.xxx.cj \ ] 8 ? =om/index.php/main/news/15524.html
  [image] => http://www.xxx.com/uploadfiles/articles/20190513/15524.png
  )
  [4] => Array
  (
  [title] =G w _ ( # 3 A J a> 讲座回放|彭嘉恒—“南师、禅及其在西方
  [link] => http://wi R + B a `ww.xxx.com/index.php/main/news/15518.html
  7 W S ~ S[image] => http://www.xxx.com/uploadfiles/articles/20190507/15518.png
  )
  [5] => Array
  (
  [title] => 讲座预告|胡理琛—西湖七十年流变忆胜
  [link] => hf 5 \ Q } @ 0ttp://www.xxx.com/index.php/main/news/15516.html
  [image] => http://a k 8 - {www.xxx.com/u~ m lploadfiles/? ? \ \ Rarticles/2019043G o F S ! D H 50/15516.jpg
  )
  )

  这样显然得到的网址是错误的。根本无法访问,也就无法采集了。

  经过一番查找,原来是dede 设置HTML的内容和来| 2 R \ ( x l 8源网址 的函数问题I h ? 0 y,漏写端口判断了。

  在include/dedehtml2.class.php

  function SetSource 函数里大概79行加上红框里的内容:

image.png

  再G I } o B ` x E u测试一下。ok 了,这样网址就可以正常打开,采集到了。

  付上代码:

function SetSource(&$html, $url = '', $linktype='')
  [ ) o _ d 9{
  $this->__construct();
  $this->CAtt = new DedeAttribute2();
  $url = trim($url);
  $this->SourceHtml = $html;
  $this->BaseUrl = $url;
  //判断文档相对于当前的路径
  $urls = @parse_url($url);
  $port=$urls['port']=='80'?'':':'.$urls['port'];//lyy 为80时候可以省略,否则就加上
  $this->HomeUrl = $urls['host'].$port;
  $this->BaseUrlPath = $this->HomeUrl.$urls['path'];
  $this->BaseUrlPath = preg_replace("/\/([^\/]*)\.(.*)$/","/",$this->Y K q % a \ f -BaseUrlPath);
  $this->BaseUrlPath = preg_replace("/\/$/",'',X K H k # h k$thi[ ^ & x cs->Base; y x p @ \UrlPath);
  if($lia e M ]nktype!='')
  {
  $this->GetLinkType = $linktype;
  }
  if($h? . Jtml != '')
  {
  $this->Analyser();
  }
  }

免责声明:O & h A U v J M本站所有文章和图片均来自用户分享和网络收集,文章和图片版权归原作者及原出处所有,仅供学习与参考,请勿用于商业用途,如果损害了您的权利,请联系网站\ ^ J a / P e v Y客服处e a g : u G o ~ A理。

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注