1.3检查网络服务器图元文件是否有信息泄露

概括

本节介绍如何测试各种元数据文件是否存在 Web 应用程序路径或功能的信息泄漏。此外，蜘蛛、机器人或爬虫要避免的目录列表也可以创建为通过应用程序映射执行路径的依赖项。还可以收集其他信息以识别攻击面、技术细节或用于社会工程参与。

测试目标

通过分析元数据文件识别隐藏或混淆的路径和功能。
提取并映射可能导致更好地了解手头系统的其他信息。

如何测试

下面用执行的任何操作wget也可以用完成curl。许多动态应用程序安全测试 (DAST) 工具（例如 ZAP 和 Burp Suite）都包含对这些资源的检查或解析，作为其蜘蛛/爬虫功能的一部分。还可以使用各种Google Dorks或利用诸如inurl:.

机器人

Web Spiders、Robots 或 Crawlers 检索网页，然后递归遍历超链接以检索更多的 Web 内容。它们可接受的行为由Web 根目录中的robots.txt文件的机器人排除协议指定。

例如，下面引用了 2020 年 5 月 5 日从Googlerobots.txt采样的文件的开头：

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
...

User-Agent指令是指特定的网络蜘蛛/机器人/爬虫。例如，User-Agent: Googlebot指的是来自谷歌的蜘蛛，而User-Agent: bingbot指的是来自微软的爬虫。User-Agent: *在上面的例子中适用于所有网络蜘蛛/机器人/爬虫。

该Disallow指令指定哪些资源被蜘蛛/机器人/爬虫禁止。在上面的示例中，禁止以下内容：

...
Disallow: /search
...
Disallow: /sdch
...

网络蜘蛛/机器人/爬虫可以故意忽略Disallow文件中指定的指令robots.txt。因此，robots.txt不应将其视为对第三方访问、存储或重新发布 Web 内容的方式实施限制的机制。

该robots.txt文件是从 Web 服务器的 Web 根目录中检索的。例如，要使用or检索robots.txtfrom ：www.google.com``wget``curl

$ curl -O -Ss http://www.google.com/robots.txt && head -n5 robots.txt
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
...

使用 Google 网站站长工具分析 robots.txt

网站所有者可以使用 Google 的“Analyze robots.txt”功能来分析网站，作为其Google 网站管理员工具的一部分。该工具可以辅助测试，流程如下：

使用 Google 帐户登录 Google 网站站长工具。
在仪表板上，输入要分析的站点的 URL。
在可用方法之间进行选择，然后按照屏幕上的说明进行操作。

元标签

<META>标记位于HEAD每个 HTML 文档的部分内，并且在机器人/蜘蛛/爬虫起点不是从 webroot 以外的文档链接（即深层链接）开始的情况下，应该在整个网站上保持一致。Robots 指令也可以通过使用特定的META 标记来指定。

机器人 META 标签

如果没有<META NAME="ROBOTS" ... >条目，则“机器人排除协议”默认为INDEX,FOLLOW分别。因此，“机器人排除协议”定义的另外两个有效条目以NO...ieNOINDEX和为前缀NOFOLLOW。

根据 webroot 文件中列出的 Disallow 指令，在每个网页中进行robots.txt正则表达式搜索，并将结果与 webroot 中的文件进行比较。<META NAME="ROBOTS"``robots.txt

杂项元信息标签

组织经常在 Web 内容中嵌入信息 META 标签以支持各种技术，例如屏幕阅读器、社交网络预览、搜索引擎索引等。此类元信息对于测试人员识别所使用的技术以及要探索的其他路径/功能可能很有价值和测试。以下元信息是www.whitehouse.gov在 2020 年 5 月 5 日通过查看页面源检索到的：

...
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<meta property="og:title" content="The White House" />
<meta property="og:description" content="We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all. – President Donald Trump." />
<meta property="og:url" content="https://www.whitehouse.gov/" />
<meta property="og:site_name" content="The White House" />
<meta property="fb:app_id" content="1790466490985150" />
<meta property="og:image" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" />
<meta property="og:image:secure_url" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:description" content="We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all. – President Donald Trump." />
<meta name="twitter:title" content="The White House" />
<meta name="twitter:site" content="@whitehouse" />
<meta name="twitter:image" content="https://www.whitehouse.gov/wp-content/uploads/2017/12/wh.gov-share-img_03-1024x538.png" />
<meta name="twitter:creator" content="@whitehouse" />
...
<meta name="apple-mobile-web-app-title" content="The White House">
<meta name="application-name" content="The White House">
<meta name="msapplication-TileColor" content="#0c2644">
<meta name="theme-color" content="#f5f5f5">
...

站点地图

站点地图是一个文件，开发人员或组织可以在其中提供有关站点或应用程序提供的页面、视频和其他文件的信息，以及它们之间的关系。搜索引擎可以使用此文件更智能地探索您的网站。测试人员可以使用sitemap.xml文件来了解有关站点或应用程序的更多信息，从而更全面地探索它。

以下摘录自 2020 年 5 月 5 日检索到的 Google 主要站点地图。

$ wget --no-verbose https://www.google.com/sitemap.xml && head -n8 sitemap.xml
2020-05-05 12:23:30 URL:https://www.google.com/sitemap.xml [2049] -> "sitemap.xml" [1]

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
  <sitemap>
    <loc>https://www.google.com/gmail/sitemap.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.google.com/forms/sitemaps.xml</loc>
  </sitemap>
...

测试人员可能希望从那里探索以检索 gmail 站点地图https://www.google.com/gmail/sitemap.xml：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://www.google.com/intl/am/gmail/about/</loc>
    <xhtml:link href="https://www.google.com/gmail/about/" hreflang="x-default" rel="alternate"/>
    <xhtml:link href="https://www.google.com/intl/el/gmail/about/" hreflang="el" rel="alternate"/>
    <xhtml:link href="https://www.google.com/intl/it/gmail/about/" hreflang="it" rel="alternate"/>
    <xhtml:link href="https://www.google.com/intl/ar/gmail/about/" hreflang="ar" rel="alternate"/>
...

安全TXT

security.txt被 IETF 批准为RFC 9116 - 一种有助于安全漏洞披露的文件格式，允许网站定义安全策略和联系方式。有多种原因可能对测试场景感兴趣，包括但不限于：

识别进一步的路径或资源以包含在发现/分析中。
开源情报收集。
查找有关 Bug Bounties 等的信息。
社会工程学。

该文件可能存在于网络服务器的根目录中或.well-known/目录中。前任：

https://example.com/security.txt
https://example.com/.well-known/security.txt

这是从 LinkedIn 2020 年 5 月 5 日检索到的真实示例：

$ wget --no-verbose https://www.linkedin.com/.well-known/security.txt && cat security.txt
2020-05-07 12:56:51 URL:https://www.linkedin.com/.well-known/security.txt [333/333] -> "security.txt" [1]
# Conforms to IETF `draft-foudil-securitytxt-07`
Contact: mailto:security@linkedin.com
Contact: https://www.linkedin.com/help/linkedin/answer/62924
Encryption: https://www.linkedin.com/help/linkedin/answer/79676
Canonical: https://www.linkedin.com/.well-known/security.txt
Policy: https://www.linkedin.com/help/linkedin/answer/62924

人类 TXT

humans.txt是一个了解网站背后的人的倡议。它采用文本文件的形式，其中包含有关为构建网站做出贡献的不同人员的信息。该文件通常（但不总是）包含有关职业或工作地点/路径的信息。

以下示例检索自 Google 2020 年 5 月 5 日：

$ wget --no-verbose  https://www.google.com/humans.txt && cat humans.txt
2020-05-07 12:57:52 URL:https://www.google.com/humans.txt [286/286] -> "humans.txt" [1]
Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see careers.google.com.

其他众所周知的信息来源

还有其他 RFC 和 Internet 草案建议在.well-known/目录中对文件进行标准化使用。可以在此处或此处找到其中的列表。

测试人员查看 RFC/草稿并创建一个列表以提供给爬虫或模糊器，以验证此类文件的存在或内容，这将是相当简单的。

工具

浏览器（查看源代码或开发工具功能）
卷曲
wget
打嗝套件
ZAP