CrawlWeiBo

新浪微博的数据采集主要有两种方法，基于新浪微博API和基于网络爬虫的页面解析。本系统采取基于网络爬虫的页面解析方法，基于网络爬虫的微博信息采集可以突破 API开放接口限制，不间断地爬取信息。网络爬虫根据顺序URL列队获取URL地址，并下载其指向页面至本地，再利用DOM树进行网页解析。利用XPath可以定位存放关键信息的DOM节点位置,最后抽取XPath特征节点中的内容。

政务微博分析

根据需求说明需要采集的数据属性如下：

微博内容
是否原创
转发内容
发布时间
转发数
评论数
点赞数
设备源
微博ID

对于抓取到的页面源码分析不同属性对应的标签分别提取数据。最后将采集到的数据保存为csv格式，供数据分析使用。

人工选择微博账号

根据事件发生的时间爬取事件发生前后一个月总共三个月的微博，为了实现自动采集数据，根据微博账号爬取PageId，将PageId作为爬取数据的URL的一个字段拼接，通过微博账号就能实现对微博数据的爬取。

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
3000		3000
JingWuWeibo		JingWuWeibo
WeiBoName		WeiBoName
事件		事件
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
account_alldata_lib.xls		account_alldata_lib.xls
crawlWeiboAccountAllData.py		crawlWeiboAccountAllData.py
crawlWeiboContent.py		crawlWeiboContent.py
crawlZWWBAccount.py		crawlZWWBAccount.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CrawlWeiBo

政务微博分析

人工选择微博账号

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

CrawlWeiBo

政务微博分析

人工选择微博账号

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages