1 Star 0 Fork 0

情调之声 / intelligent_document_retrieval_system

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
cppjieba @ 391121d
Loading...
README
CC-BY-SA-4.0

智能文档检索系统


1、项目的相关背景

  • 现在市面上有各种大型的搜索引擎

    比如:谷歌、bing、百度

    这些大型的搜索引擎,需要把全网的所有网页信息给抓取下来

    并且需要把数据进行保存,并建立相应的后端索引模块

    后续需要根据用输入的关键字,对数据进行排序,来设置网页显示优先级

    还包含了网页和网页之间关联度的问题

  • 所以我们自己是不可能实现这些大型搜索引擎的

    我们可以做的就是站内搜索

    因为站内搜索搜索的数据更垂直具有很强的相关性,数据量更小

    这就意味着我们保存和建立索引的工作量是更小的

  • 我们通过实现一个站内的搜索引擎,以此达到管中窥豹的效果

    去揣测别人的大型搜索引擎大概是怎么做的

    刚好boost的官网是没有站内搜索,所以我们可以自己去实现一个

对于我们自己实现的搜索引擎的展现结果是怎么样的,我们在实现之前需要进行明确

我们以美食作为搜索的关键词分别向谷歌、bing、百度进行搜索

google search delicious food.png

bing search delicious food.png

baidu search delicious food.png

通过以上的搜索我们可以的得知,搜索引擎的大部分展现搜索结果都是由三个部分构成

  1. 网页的title
  2. 网页内容的摘要描述
  3. 即将要跳转的网址URL

所以也明确了我们自己实现的搜索引擎以什么样的形式展示搜索的结果了

然后通过点击title的方式,跳转到目标网站

2、搜索引擎的相关宏观原理

在开始项目之前,我们首先需要了解关于搜索引擎的相关的宏观原理

让我们在实现项目的时候不会对具体的实现步骤有太多的疑点

Macroscopic principle.png

首先服务端会在各种网站当中,通过网络爬虫程序,抓取网页资源,放在对应的服务端特定目录的磁盘下

然后我们的searcher会对抓取来的网页资源做两步处理

  1. 去标签和数据清理,保留部分网页内容,尤其是网页的标题,网页的内容和网页的url
  2. 建立索引,核心功能是加速网页查找的(一个搜索引擎的好坏,很大程度的影响因素就是搜索数据量大小和索引建立的好不好)

接下来是客户端通过浏览器,使用http请求,进行搜索任务

而http请求当中是一定包含搜索关键字的

这个http请求通过get方法进行提交的,将搜索的关键字通过url传递给服务端

searcher通过得到的关键字,检索索引,得到相关的html

最后把搜索到的多个网页结果, 进行拼接,构建一个新的网页,返回给客户端

我们所实现的搜索引擎,会涵盖黄色框的中的部分

因为众所周知的原因,我们不去实现爬虫的部分,我们通过合法渠道把相关的网络资源进行下载

3、搜索引擎技术栈和项目环境

  • 后端:C/C++、C++11、STL、准标准库Boost、Jsoncpp、cppjieba、cpp-httplib
  • 前端:html5、css、js、jQuery、Ajax
  • 项目环境:Centos 7.6云服务器、vim、gcc(g++)、Makefile、visual studio code

4、搜索引擎的具体原理-正排索引和倒排索引

假设我们现在有两个文档

  • 文档1:妈妈买了六斤小米
  • 文档2:妈妈买了个小米手机

正排索引:就是从文档ID找到文档内容(文档中的关键字)

文档ID 文档内容
1 妈妈买了六斤小米
2 妈妈买了个小米手机

当我们得到了目标文档内容之后,我们需要对文档进行分词

目的是为了方便建立倒排索引和方便进行查找

如下是我们分词的大致情况

  • 文档1[妈妈买了六斤小米]:妈妈/买/六斤/小米/六斤小米
  • 文档2[妈妈买了个小米手机]:妈妈/买/小米/手机/小米手机

我们可以发现,分词的时候有一些词我们不进行考虑,这种词我们叫做停止词

因为这些词出场频率太高了,如果我们把这些词也作为关键词,那么搜索的时候区分唯一性的价值不大,而且会增加我们建立索引的成本,从而增加我们搜索的成本

常见的停止词:了、个、的、吗、a、the......

倒排索引:根据文档内容进行分词,整理不重复的各个关键词,对应联系到具体的文档ID

关键词(唯一性) 文档ID
妈妈 文档1、文档2
文档1、文档2
六斤 文档1
小米 文档1、文档2
六斤小米 文档1
手机 文档2
小米手机 文档2

我们使用正排索引和倒排索引模拟一次查找的过程:

首先用户输入小米->我们利用倒排索引,将关键词用于查找->提取出文档ID是1和2->然后我们根据正排索引,找到文档中的内容->接着我们将文档中的内容进行摘要,形成title + description + url的形式->最后我们构建响应结果回去

这里有一个问题,如果我们根据关键字,找到了多个文档,那么我们应该怎么呈现给用户呢?

我们一般会根据文档和关键字之间的相关性,建立对应的公式,来计算当前的关键字对于各个文档的权值是多少,权值高的将优先展示给客户

5、编写数据去标签与数据清洗的模块parser

  • 我们需要对数据进行去标签的处理,那么这些数据是哪来的?

    我们进入boost的官网:https://www.boost.org

    image-20230514103424541.png

    点击Download,下载最新版本的boost库文件,并进行解压

    我们所需要的数据文件就是/boost_1_78_0/doc/*下的所有html文件

  • 我们为什么需要对原始的网页数据进行去标签的操作?

    我们随意展示一段html网页代码

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>What's Included in This Document</title>
        <link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
        <meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
        <link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook Documentation Subset">
        <link rel="up" href="index.html" title="The Boost C++ Libraries BoostBook Documentation Subset">
        <link rel="prev" href="index.html" title="The Boost C++ Libraries BoostBook Documentation Subset">
        <link rel="next" href="libraries.html" title="Part I. The Boost C++ Libraries (BoostBook Subset)">
    </head>
    
    <body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
        <table cellpadding="2" width="100%">
            <tr>
                <td valign="top"><img alt="Boost C++ Libraries" width="277" height="86" src="../../boost.png"></td>
                <td align="center"><a href="../../index.html">Home</a></td>
                <td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
                <td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
                <td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
                <td align="center"><a href="../../more/index.htm">More</a></td>
            </tr>
        </table>
        <hr>
        <div class="spirit-nav">
            <a accesskey="p" href="index.html"><img src="../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u"
                href="index.html"><img src="../../doc/src/images/up.png" alt="Up"></a><a accesskey="h"
                href="index.html"><img src="../../doc/src/images/home.png" alt="Home"></a><a accesskey="n"
                href="libraries.html"><img src="../../doc/src/images/next.png" alt="Next"></a>
        </div>
        <div class="preface">
            <div class="titlepage">
                <div>
                    <div>
                        <h1 class="title">
                            <a name="about"></a>What's Included in This Document
                        </h1>
                    </div>
                </div>
            </div>
            <p>This document represents only a subset of the full Boost
                documentation: that part which is generated from BoostBook or
                QuickBook sources. Eventually all Boost libraries may use these
                formats, but in the meantime, much of Boost's documentation is not
                available here. Please
                see <a href="http://www.boost.org/libs" target="_top">http://www.boost.org/libs</a>
                for complete documentation.
            </p>
            <p>
                Documentation for some of the libraries described in this document is
                available in alternative formats:
            </p>
            <div class="itemizedlist">
                <ul class="itemizedlist" style="list-style-type: disc; ">
                    <li class="listitem"><a class="link" href="index.html"
                            title="The Boost C++ Libraries BoostBook Documentation Subset">HTML</a></li>
                </ul>
            </div>
            <p>
            </p>
            <div class="itemizedlist">
                <ul class="itemizedlist" style="list-style-type: disc; ">
                    <li class="listitem"><a href="http://sourceforge.net/projects/boost/files/boost-docs/"
                            target="_top">PDF</a></li>
                </ul>
            </div>
            <p>
            </p>
        </div>
        <table xmlns:rev="http://www.cs.rpi.edu/~gregod/boost/tools/doc/revision" width="100%">
            <tr>
                <td align="left"></td>
                <td align="right">
                    <div class="copyright-footer"></div>
                </td>
            </tr>
        </table>
        <hr>
        <div class="spirit-nav">
            <a accesskey="p" href="index.html"><img src="../../doc/src/images/prev.png" alt="Prev"></a><a accesskey="u"
                href="index.html"><img src="../../doc/src/images/up.png" alt="Up"></a><a accesskey="h"
                href="index.html"><img src="../../doc/src/images/home.png" alt="Home"></a><a accesskey="n"
                href="libraries.html"><img src="../../doc/src/images/next.png" alt="Next"></a>
        </div>
    </body>
    
    </html>

    可以看到html有两种标签:

    1. <></>:这种标签是成对进行出现的
    2. <>:这种标签一般只有一个

    这些标签本身和里面的内容,对于我们进行搜索是没有价值的

    当我们去掉这些标签之后,剩下的内容,才是我们搜索所需要的内容

  • 建立存放boost原始的html文件和存放去标签之后的干净文件的目录

    原始html文件路径:~/boost_search_engine/data/input

    干净文件文件路径:~/boost_search_engine/raw_html

    路径建立好之后,我们将/boost_1_78_0/doc/*下的所有html文件拷贝到原始html文件路径下

    使用如下命令查看input目录下共有多少个文件

    ls -Rl | grep -E '*.html' | wc -l
    8141

    我们parser的目标就是将这8141这个html文件都进行去标签,然后写入到同一个文件中

    每个文件的内容不需要任何的\n,文档中title/content/url之间用\3区分

    \3就是ASCII码为3的字符,其字符解释是正文结束

    所以我们可以用来作为文档和文档之间的区分

    这样做的原因是我们ASCII表分为控制字符和打印字符

    \3属于是控制字符的一种,是不会显示在屏幕上

    而我们的html文件内容,里面的字符都是属于ASCII打印字符,是会显示在屏幕上

    这样就不会污染我们文档中的显示内容

    类似于xxxxxxxxx\3yyyyyyyyy\3zzzzzzzzz\3

  • 明白了数据的来源和parser的处理的基本原理,我们可以正式开始编写parser了

    因为我们的parser需要对文件进行处理,但是C++的标准库对于文件处理的支持不太好

    所以我们我们需要手动安装boost开发库

    sudo yum install -y boost-devel
  • 编写parser的过程中,我们到了解析html文件的一步时

    我们需要了解整个html的大致结构和我们需要拿到哪一部分的数据

    1. 首先我们需要解析的是title,这个标签对在整个html文件中只会出现一对

      我们需要做的就是提取橙色部分,舍去红色部分

      image-20230514151241200.png

    2. 然后我们需要解析的是content,这一步的本质就是去标签

      我们需要做的就是提取橙色部分,舍去红色部分

      image-20230514152023116.png

    3. 最后我们需要解析的是url

      下面是booost官网的url样例

      image-20230514215451229.png

      下面是我们下载下来的url样例(accumulators.html)

      image-20230514215702404.png

      下面是我们把下载下来的boost库的样例拷贝到项目中的样例(accumulators.html)

      image-20230514215913240.png

      我们可以具体观察他们之间的差距

      通过固定的url前缀 + url后缀就能构建一个官网的连接

      url前缀 : url_head = "https://www.boost.org/doc/libs/1_78_0/doc/html";

      url后缀 : url_tail = data/input/accumulators.html -> url_tail = /accumulators.html;

      官网连接 : url = url_head + url_tail;

  • 最后我们拿到所有的网页解析内容,将其写入到文件中,路径是~/boost_search_engine/raw_html

    之前我们说过一种网页和网页进行分割的一种方式是直接采用'\3'作为分割符的形式

    但是我们我们把数据写入文件中,需要考虑下一次读取的时候,也方便操作

    所以我们需要采用如下的分割方式

    title/3content/3url \n title/3content/3url \n title/3content/3url \n

    使用'\3'作为title和content和url的分隔符,使用'\n'作为网页与网页之间的分割符

    这样方便我们使用getline(),获取到文档的全部内容

6.编写建立索引的模块index

  • index的基本结构

    我们需要明确的是我们需要建立正排索引和倒排索引

    正排索引是根据id拿到内容的

    所以我们使用如下结构代表正排索引

    //正排索引的数据结构用数组,数组的下标天然是文档的ID
    std::vector<DocInfo> forward_index; //正排索引
    
    struct DocInfo
    {
    std::string title;   //文档的标题
    std::string content; //文档对应的去标签之后的内容
    std::string url;     //官网文档url
    uint64_t doc_id;     //文档的ID,容易方便倒排索引的建立
    };

    倒排索引是根据关键字拿到id的

    所以我们使用如下结构代表倒排索引

    //倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]
    std::unordered_map<std::string, InvertedList> inverted_index;
    
    struct InvertedElem
    {
    	//文档的id,d
         uint64_t doc_id;
         //关键字
         std::string word;
         //权值
         int weight;
    };
    //倒排拉链
    typedef std::vector<InvertedElem> InvertedList;
  • 建立正排索引的具体步骤

    我们拿到raw_html文件中的内容之后

    对其进行解析,因为之前的parse对html的源文件进行了去标签的处理

    同时对去掉标签后的数据,进行了一定的格式化处理

    所以我们按照格式化的分割符('/3')对拿到的raw_html数据进行解析切分

    然后将解析切分完毕的数据,再次填充到我们所定义的DocInfo结构中

    这个解析的过程,我们使用boost库中的split进行内容的切分

    注意其中的doc_id就是插入DocInfo到forward_index的下标

  • 建立倒排索引的具体步骤

    我们的倒排索引索引是需要根据正排索引的元素也就是DocInfo来建立的

    同时倒排索引还需要将DocInfo里面的title和content进行分词的操作

    所以我们这里使用cppjieba的分词库

    下载地址是:https://github.com/yanyiwu/cppjieba.git

    注意使用cppjieba还需要一个库就是cppjieba/deps/limonp

    这个库是需要单独下载的

    下载地址是:https://github.com/yanyiwu/limonp.git

    同时我们需要注意,需要把limonp库放在include/cppjieba/下

    进行完分词操作之后,我们需要对这些词进行词频统计

    所以我们有如下的结构

    struct word_cnt
    {
    	title_cnt;
    	content_cnt;
    }
    unordered_map<std::string, word_cnt> word_cnt;

    word_cnt是统计这个词在title和content出现的次数,也就是记录词频的

    word_cnt是用来暂时保存词频的映射表

    然后我们遍历word_cnt,对每一个词进行InvertedElem的建立

    将建立好的InvertedElem插入到inverted_list中,形成一个关键词的倒排拉链

    后续inverted_index可以根据同一个关键词找到同一个倒排拉链

    PS:建立倒排索引的时候,要忽略大小写!!

7.编写搜索引擎模块searcher

  • 当索引的模块建立编写好了,那么就可以进行搜索模块的准备了

在搜索模块中,我们利用编写好的索引模块,建立索引对象

使得每个词都有所对应的倒排拉链

我们搜索模块的目的,就是传入一个字符串的时候,将这个所有字符串进行进行jieba分词

然后获得所有的词的倒排拉链

我们把所有倒排拉链进行汇总,然后对倒排拉链里面的元素按照权值进行降序排列

然后根据倒排拉链里面的元素获得其在正排索引的位置

也就拿到了关于这个词的文档

最后我们根据拿到的文档,然后构建一个json字符串进行返回

这个字符串包含了这些所有词的文档

  • 在编写完searcher模块后,我们经过测试,发现了一些问题

通过运行中的搜索程序,我们输入了split,然后返回了这个字符串所对应的倒排拉链,以json的形式返回的

image-20230518201938423.png

我们挑选了一个weight为13的网页进行访问

image-20230518202031755.png

但是可以看见,网页中的weight有4个,如果加上标题中的1个,那么理论上这个网页的weight的权值应该是15,那么为什么我们的权值是13呢?

PS:对整个文件进行去标签,其中是包括标签的 实际如果一个词在title中出现,一定会被当标题和当内容分别被统计一次

所有一个标签实际占用的权值是11

image-20230518202235167.png

我们根据倒排元素中的doc_id找到了,这个文档的位置

其中橙色圈中的单词t因为我们去标签时候的特殊处理,导致多个词合成了一个词

所以可以看出这些词合grep出来的词是一致的,但是因为jieba分词的原因

多个词合成的词,jieba不能将他们进行拆分,所以split的数目就会减少

而且我们可以注意到,这里splitboost合成词,是标题词和其他词进行了合成

所以上面说的一个词在title中出现,一定会被当标题和当内容分别被统计一次

这里就不会当作内容词进行统计了

除掉了橙色圈中的词那么最后的权值就应该是13

所以出现上述权值和网页中的单词数量不匹配的原因就是

jeiba分词无法将合成词进行分词和内容去标签处理不够到位导致的

image-20230518205741136.png

image-20230518201319921.png

image-20230518204503233.png

8.编写网络服务模块http_server

  • cpp-httplib库的引用

    我们编写网络服务模块,需要使用到cpp-httplib库

    下载地址如下:cpp-httplib

    我们使用0.7.15的版本,这个版本相对稳定一点,如果是最新版本可能对编译的要求比较高

    image-20230518220323743.png

    下载好压缩包,上传到服务器解压之后得到如下的文件,而我们主要使用的是httplib.h

    image-20230518220656208.png

    使用cpp-httplib库的时候,需要较新版本的gcc编译器

    用老的编译器,要么编译不通过,要么直接运行报错

    我们centos 7默认gcc版本是4.8.5

    image-20230518220906063.png

  • gcc的升级

    首先安装scl

    sudo yum -y install centos-release-scl scl-utils-build

    然后安装新版本的gcc

    sudo yum -y install devtoolset-7-gcc devtoolset-7-gcc-c++

    查看是否安装成功

    image-20230518221823851.png

    使用命令进行新版本的启动

    scl enable devtoolset-7 bash

    可以看到已经是7.3.1版本了

    image-20230518221948446.png

    但是命令行启动只有在本次对话窗口有效,如果新启对话窗口,那么就还是原来的版本

    image-20230518220906063.png

    如果不想每次都输入指令来启动新版本的gcc

    可以在~/.bash_profile文件里面加入如下的指令

    scl enable devtoolset-7 bash

    image-20230518222507044.png

9.编写前端代码

  • 在编写前端代码之前我们需要知道我们的网页的大致结构是什么样式的

    在参考了一些所有引擎的页面之后,根据我现有的能力

    我的设计大致如下

    image-20230520125654550.png

  • 在编写前端代码的过程中,涉及到前后端进行交互的代码,需要使用JS进行编写

    但是如果直接使用原生的JS进行编写成本太高了,所以我们这里条件使用JQuery

    在网页html文件里面引入JQuery的链接,我们就可以直接进行使用

    image-20230520130247858.png

  • 进行一通编写后,我们可以的如下的网页效果

    image-20230520131057125.png

10.后续

  • 到此位置,整个项目已经完成的差不多了

    不过还存在一点点小问题,比如我们的搜索关键词为:

    你是一个好人

    经过jieba分词之后,形成你/是/一个/好人

    如果一个文档中,都出现了这四个词,那么返回的搜索页面会出现重复的结果

    所以我们做如下的测试

    在项目的原始数据来源的文件夹中新增一个test.html

    image-20230520135810178.png

    修改里面的文本内容为你是一个好人

    image-20230520140210985.png

    因为是我们新增加的原始html数据

    所以我们需要重新进行数据清理,重新执行parse模块

    当我们搜索你是一个好人的时候,会重复出现一样的链接

    image-20230520140832809.png

    image-20230520140959243.png

    所以我们需要对这些重复的搜索结果进行去重

    在后端的搜索模块中我们根据关键字获得所有的倒排拉链

    然后我们将倒排拉链里面的倒排元素进行去重

    倒排元素如果有相同的id,就归为同一个

    然后进行倒排元素权值的相加

    最后呈现出来的结果如下

    image-20230520144333570.png

    image-20230520144357956.png

  • 添加日志

    增加一个日志,记录程序运行过程中的异常情况

    效果如下

    image-20230520161753889.png

    image-20230520161808535.png

  • 最后我们把这个服务部署到linux机器上,这样我们就可以随时随地的使用网站进行搜索了

    使用如下命令进行服务的部署

    nohup ./http_server > log.txt 2>&1 &

    把所有的日志信息都重定向到log.txt文件中

    效果如下:

    image-20230520163712428.png

  • 去掉暂停词

    现在我们的搜索引擎,如果搜索一些暂停词,也是可以搜索出结果的

    比如我们搜索is,我们知道这种暂停出场频率太高了,如果我们把这些词也作为关键词,那么搜索的时候区分唯一性的价值不大,而且会增加我们建立索引的成本,从而增加我们搜索的成本

    image-20230520215114937.png

    可以看到我们以is作为关键词进行搜索的话,我们可以搜索到3000多个文档

    image-20230520215239186.png

    当我们去掉暂停词之后,我们再次搜索is

    可以看到如下的效果,什么也搜索不到

    image-20230520220745121.png

Attribution-ShareAlike 4.0 International ======================================================================= Creative Commons Corporation ("Creative Commons") is not a law firm and does not provide legal services or legal advice. Distribution of Creative Commons public licenses does not create a lawyer-client or other relationship. Creative Commons makes its licenses and related information available on an "as-is" basis. Creative Commons gives no warranties regarding its licenses, any material licensed under their terms and conditions, or any related information. Creative Commons disclaims all liability for damages resulting from their use to the fullest extent possible. Using Creative Commons Public Licenses Creative Commons public licenses provide a standard set of terms and conditions that creators and other rights holders may use to share original works of authorship and other material subject to copyright and certain other rights specified in the public license below. The following considerations are for informational purposes only, are not exhaustive, and do not form part of our licenses. Considerations for licensors: Our public licenses are intended for use by those authorized to give the public permission to use material in ways otherwise restricted by copyright and certain other rights. Our licenses are irrevocable. Licensors should read and understand the terms and conditions of the license they choose before applying it. Licensors should also secure all rights necessary before applying our licenses so that the public can reuse the material as expected. Licensors should clearly mark any material not subject to the license. This includes other CC- licensed material, or material used under an exception or limitation to copyright. More considerations for licensors: wiki.creativecommons.org/Considerations_for_licensors Considerations for the public: By using one of our public licenses, a licensor grants the public permission to use the licensed material under specified terms and conditions. If the licensor's permission is not necessary for any reason--for example, because of any applicable exception or limitation to copyright--then that use is not regulated by the license. Our licenses grant only permissions under copyright and certain other rights that a licensor has authority to grant. Use of the licensed material may still be restricted for other reasons, including because others have copyright or other rights in the material. A licensor may make special requests, such as asking that all changes be marked or described. Although not required by our licenses, you are encouraged to respect those requests where reasonable. More_considerations for the public: wiki.creativecommons.org/Considerations_for_licensees ======================================================================= Creative Commons Attribution-ShareAlike 4.0 International Public License By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions. Section 1 -- Definitions. a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image. b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License. c. BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License. d. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights. e. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements. f. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material. g. License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike. h. Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License. i. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license. j. Licensor means the individual(s) or entity(ies) granting rights under this Public License. k. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them. l. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world. m. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning. Section 2 -- Scope. a. License grant. 1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to: a. reproduce and Share the Licensed Material, in whole or in part; and b. produce, reproduce, and Share Adapted Material. 2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions. 3. Term. The term of this Public License is specified in Section 6(a). 4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a) (4) never produces Adapted Material. 5. Downstream recipients. a. Offer from the Licensor -- Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License. b. Additional offer from the Licensor -- Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter's License You apply. c. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material. 6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i). b. Other rights. 1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise. 2. Patent and trademark rights are not licensed under this Public License. 3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties. Section 3 -- License Conditions. Your exercise of the Licensed Rights is expressly made subject to the following conditions. a. Attribution. 1. If You Share the Licensed Material (including in modified form), You must: a. retain the following if it is supplied by the Licensor with the Licensed Material: i. identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated); ii. a copyright notice; iii. a notice that refers to this Public License; iv. a notice that refers to the disclaimer of warranties; v. a URI or hyperlink to the Licensed Material to the extent reasonably practicable; b. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and c. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License. 2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information. 3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable. b. ShareAlike. In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply. 1. The Adapter's License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License. 2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material. 3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply. Section 4 -- Sui Generis Database Rights. Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material: a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database; b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database. For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights. Section 5 -- Disclaimer of Warranties and Limitation of Liability. a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS, IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU. b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION, NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT, INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES, COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR IN PART, THIS LIMITATION MAY NOT APPLY TO YOU. c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability. Section 6 -- Term and Termination. a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically. b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates: 1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or 2. upon express reinstatement by the Licensor. For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License. c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License. d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License. Section 7 -- Other Terms and Conditions. a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed. b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License. Section 8 -- Interpretation. a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License. b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions. c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor. d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority. ======================================================================= Creative Commons is not a party to its public licenses. Notwithstanding, Creative Commons may elect to apply one of its public licenses to material it publishes and in those instances will be considered the “Licensor.” The text of the Creative Commons public licenses is dedicated to the public domain under the CC0 Public Domain Dedication. Except for the limited purpose of indicating that material is shared under a Creative Commons public license or as otherwise permitted by the Creative Commons policies published at creativecommons.org/policies, Creative Commons does not authorize the use of the trademark "Creative Commons" or any other trademark or logo of Creative Commons without its prior written consent including, without limitation, in connection with any unauthorized modifications to any of its public licenses or any other arrangements, understandings, or agreements concerning use of licensed material. For the avoidance of doubt, this paragraph does not form part of the public licenses. Creative Commons may be contacted at creativecommons.org.

简介

基于离线文档实现的智能文档检索系统 展开 收起
C++ 等 3 种语言
CC-BY-SA-4.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
C++
1
https://gitee.com/voice-of-sentiment/intelligent_document_retrieval_system.git
git@gitee.com:voice-of-sentiment/intelligent_document_retrieval_system.git
voice-of-sentiment
intelligent_document_retrieval_system
intelligent_document_retrieval_system
master

搜索帮助