如何用Python获取成都租房信息-编程学习网

这篇文章将为大家详细讲解有关如何用Python获取成都租房信息，文章内容质量较高，因此小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。

信息数据的获取，这里首先收集赶集网和自如网的信息。

赶集网信息获取

如何用Python获取成都租房信息

I. 获取当页内容

这里的规则比较明显，获取网页内容用xpath解析即可，各个板块的信息都很容易获取，最后用列表保存并返回即可，首先循环出每个divs块，对里面的每个版块内容逐个获取

def get_this_page_gj(url, tmp): html = etree.HTML(requests.get(url).text) divs = html.xpath('//div[@class="f-list-item ershoufang-list"]') for div in divs: title = div.xpath('./dl/dd[@class="dd-item title"]/a/text()')[0] house_url = div.xpath('./dl/dd[@class="dd-item title"]/a/@href')[0] size = "、".join(div.xpath('./dl/dd[@class="dd-item size"]/span/text()')) address = '-'.join([ data.strip() for data in divs[0].xpath('./dl/dd[@class="dd-item address"][1]//a//text()') if data.strip() != '' ] ) agent_string = div.xpath('./dl/dd[@class="dd-item address"][2]/span/span/text()')[0] agent = re.sub(' ', '', agent_string) price = div.xpath('./dl/dd[@class="dd-item info"]/div[@class="price"]/span[@class="num"]/text()')[0] tmp.append([ title, size, price, address, agent, house_url ]) return tmp

II. URL构造

访问首页链接，获取总页数，按照url的访问规则构造url，调用获取当页数据的方法即可，这里的url都是以http://cd.ganji.com/zufang/pn开头的，后面跟上网页的页码

def house_gj(headers): index_url = 'http://cd.ganji.com/zufang/' html = etree.HTML(get_html(index_url, headers)) total = html.xpath('//div[@class="pageBox"]/a[position() = last() -1]/span/text()')[0] result = [] for num in range(1, int(total) + 1): result += get_this_page_gj('http://cd.ganji.com/zufang/pn{}'.format(num), []) print('完成读取第{}页/赶集网'.format(num)) return result

2 .

这里和赶集网类似，结构也相似，同样的获取方式，我们也抓取基础信息加url链接，区别在于这里的价格可能不太好获取，并不是直接显示，而是以图片+偏移量的形式展示

如何用Python获取成都租房信息

价格获取

每个数字对应一张图片，图片中的数字会根据style中设置的偏移去原图中获取，每页的原图也不尽相同，所以处理起来比较麻烦

如何用Python获取成都租房信息

这里我们仔细留心的会发现其实每个数字间的间距是一样的，可以自己在页面上更改数值查看规律，每个数字间的距离是21.4px，从原图的左边开始做偏移，根据偏移确定对应的数字，返回的数字下标 = |偏移量/21.4|,当然这里根据页面图片、内容等元素会有微小的误差，但都是极小的误差了，最后取个整去原图的数字列表中取得对应下标的值即可，这里我们用到tesseract来对图片进行解析

............price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')offset_list = []for data in price_strings: offset_list.append(re.findall('position: (.*?)px', data)[0])style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0]price = get_price_zr(pic, offset_list)def get_price_zr(pic_url, offset_list): ''' 这里的index保存所有数字的下标值，等待图片解析完成获取对应下标的价格数字 ''' index, price = [], [] with open('pic.png', 'wb') as f: f.write(requests.get(pic_url).content) code_list = list(pytesseract.image_to_string(Image.open('pic.png'))) for data in offset_list: index.append(int(math.fabs(eval(data)/21.4))) for data in index: price.append(code_list[data]) return "".join(price)

pic_url是每页的原图地址，将之下载下来后用pytesseract解析，最后返回每个下标对应的数字所组成的新的数字字符串(价格),offset_list是获取的每个数字的偏移值组成的列表

自如网数据获取

这里和赶集网类似，结构也相似，同样的获取方式，我们也抓取基础信息加url链接，区别在于这里的价格可能不太好获取，并不是直接显示，而是以图片+偏移量的形式展示

如何用Python获取成都租房信息

I. 价格获取

每个数字对应一张图片，图片中的数字会根据style中设置的偏移去原图中获取，每页的原图也不尽相同，所以处理起来比较麻烦

如何用Python获取成都租房信息

这里我们仔细留心的会发现其实每个数字间的间距是一样的，可以自己在页面上更改数值查看规律，每个数字间的距离是21.4px，从原图的左边开始做偏移，根据偏移确定对应的数字，返回的数字下标 = |偏移量/21.4|,当然这里根据页面图片、内容等元素会有微小的误差，但都是极小的误差了，最后取个整去原图的数字列表中取得对应下标的值即可，这里我们用到tesseract来对图片进行解析

............price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')offset_list = []for data in price_strings: offset_list.append(re.findall('position: (.*?)px', data)[0])style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0]pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0]price = get_price_zr(pic, offset_list)def get_price_zr(pic_url, offset_list): ''' 这里的index保存所有数字的下标值，等待图片解析完成获取对应下标的价格数字 ''' index, price = [], [] with open('pic.png', 'wb') as f: f.write(requests.get(pic_url).content) code_list = list(pytesseract.image_to_string(Image.open('pic.png'))) for data in offset_list: index.append(int(math.fabs(eval(data)/21.4))) for data in index: price.append(code_list[data]) return "".join(price)

pic_url是每页的原图地址，将之下载下来后用pytesseract解析，最后返回每个下标对应的数字所组成的新的数字字符串(价格),offset_list是获取的每个数字的偏移值组成的列表

II. 获取当页数据

这里和赶集网类似，我们构造获取每页数据的函数，之后调用函数传入每页的url即可，这里可以关注一下xpath的扩展用法(contains函数)和正则获取原图链接

def get_this_page_zr(url, tmp): html = etree.HTML(requests.get(url).text) divs = html.xpath('//div[@class="item"]') for div in divs: if div.xpath('./div[@class="info-box"]/h6/a/text()'): title = div.xpath('./div[@class="info-box"]/h6/a/text()')[0] else: continue link = 'http:' + div.xpath('./div[@class="info-box"]/h6/a/@href')[0] location = div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[@class="location"]/text()')[0] area = div.xpath('./div[@class="info-box"]/div[@class="desc"]/div[contains(text(), "㎡")]/text()')[0] price_strings = div.xpath('./div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style') offset_list = [] for data in price_strings: offset_list.append(re.findall('position: (.*?)px', data)[0]) style_string = html.xpath('//div[@class="info-box"]/div[@class="price"]/span[@class="num"]/@style')[0] pic = "http:" + re.findall(r'background-image: url\((.*?)\);.*?', style_string)[0] price = get_price_zr(pic, offset_list) tag = '、'.join(div.xpath('./div[@class="info-box"]//div[@class="tag"]/span/text()')) tmp.append([ title, tag, price, area, location, link ]) return tmp

III. url构造

原理同赶集网的一样，主要关注一下xpath的扩展用法position()=last()

def house_zr(headers): index_url = 'http://cd.ziroom.com/z/' html = etree.HTML(get_html(index_url, headers)) total = html.xpath('//div[@class="Z_pages"]/a[position()=last()-1]/text()')[0] result = [] for num in range(1, int(total) + 1): result += get_this_page_zr('http://cd.ziroom.com/z/p{}/'.format(num), []) print('完成读取第{}页/自如网'.format(num)) return result

关于如何用Python获取成都租房信息就分享到这里了，希望以上内容可以对大家有一定的帮助，可以学到更多知识。如果觉得文章不错，可以把它分享出去让更多的人看到。

文章详情

如何用Python获取成都租房信息

软考中级精品资料免费领

相关文章

猜你喜欢

如何用Python获取成都租房信息

python如何进行爬取链家二手房租赁信息

python如何获取对象信息

Python如何使用psutil获取系统信息

python如何获取服务器硬件信息

如何在 Java 中获取 token 中的用户信息？(java如何获取token中的用户信息)

如何使用phonegap获取位置信息

python如何利用traceback获取详细的异常信息

小程序如何获取用户信息

解决在Python中如何获取证书信息

微信小程序如何获取用户信息

反射是如何获取结构体成员信息的？

.NET Core如何全局获取用户信息？

详解如何使用Python网络爬虫获取招聘信息

小程序开发如何获取用户信息

Linux下如何使用Inxi获取系统信息

小程序如何获取用户信息失败

微信小程序开发中如何获取用户信息

Python如何从 XML 解析器获取选项设置信息

SpringBoot如何使用ip2region获取地理位置信息