在本地索引日志数据集

在本指南中，我们将在本地机器上对大约 2000 万条日志条目（解压缩后 7 GB）进行索引。如果您想在 AWS S3 上启动带有多个搜索节点的服务器，请参阅分布式搜索教程。

这里是一个日志条目的示例：

{
  "timestamp": 1460530013,
  "severity_text": "INFO",
  "body": "PacketResponder: BP-108841162-10.10.34.11-1440074360971:blk_1074072698_331874, type=HAS_DOWNSTREAM_IN_PIPELINE terminating",
  "resource": {
    "service": "datanode/01"
  },
  "attributes": {
    "class": "org.apache.hadoop.hdfs.server.datanode.DataNode"
  },
  "tenant_id": 58
}

安装

让我们下载并安装 Quickwit。

curl -L https://install.quickwit.io | sh
cd quickwit-v*/

或者在隔离的 Docker 容器中拉取并运行 Quickwit 二进制文件。

docker run quickwit/quickwit --version

启动 Quickwit 服务器

CLI
Docker

./quickwit run

docker run --rm -v $(pwd)/qwdata:/quickwit/qwdata -p 127.0.0.1:7280:7280 quickwit/quickwit run

如果您使用的是基于 Apple silicon 的 macOS 系统，可能需要指定平台，使用 --platform linux/amd64 标志。您也可以安全地忽略 jemalloc 的警告。

创建索引

让我们创建一个配置好的索引来接收这些日志。

# First, download the hdfs logs config from Quickwit repository.
curl -o hdfs_logs_index_config.yaml https://raw.githubusercontent.com/quickwit-oss/quickwit/main/config/tutorials/hdfs-logs/index-config.yaml

索引配置定义了五个字段：timestamp、tenant_id、severity_text、body，以及一个 JSON 字段用于嵌套值 resource.service。我们本可以在这里使用对象字段并维护一个固定的模式，但为了方便起见，我们将使用 JSON 字段。它还设置了 default_search_fields、tag_fields 和 timestamp_field。 timestamp_field 和 tag_fields 由 Quickwit 用于查询时的分片剪枝，以提高搜索速度。有关更多详细信息，请参阅索引配置文档。

hdfs-logs-index.yaml
version: 0.7

index_id: hdfs-logs

doc_mapping:
  field_mappings:
    - name: timestamp
      type: datetime
      input_formats:
        - unix_timestamp
      output_format: unix_timestamp_secs
      fast_precision: seconds
      fast: true
    - name: tenant_id
      type: u64
    - name: severity_text
      type: text
      tokenizer: raw
    - name: body
      type: text
      tokenizer: default
      record: position
    - name: resource
      type: json
      tokenizer: raw
  tag_fields: [tenant_id]
  timestamp_field: timestamp

search_settings:
  default_search_fields: [severity_text, body]

现在让我们使用 create 子命令创建索引（假设您位于 Quickwit 安装目录内）：

CLI
cURL

./quickwit index create --index-config hdfs_logs_index_config.yaml

curl -XPOST http://localhost:7280/api/v1/indexes -H "content-type: application/yaml" --data-binary @hdfs_logs_index_config.yaml

现在您可以填充索引了。

索引日志

数据集是一个压缩的 NDJSON 文件。我们不是先下载再索引数据，而是使用管道直接将解压缩的流发送给 Quickwit。这可能需要长达 10 分钟的时间，正好适合休息一下喝杯咖啡。

CLI
Docker

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz | gunzip | ./quickwit index ingest --index hdfs-logs

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants.json.gz | gunzip | docker run -v $(pwd)/qwdata:/quickwit/qwdata -i quickwit/quickwit index ingest --index hdfs-logs

如果您赶时间，可以使用包含 10,000 条文档的样本数据集，我们将使用这个数据集进行示例查询：

CLI
Docker
cURL

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json | ./quickwit index ingest --index hdfs-logs

在 macOS 或 Windows 上：

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json | docker run -v $(pwd)/qwdata:/quickwit/qwdata -i quickwit/quickwit index ingest --index hdfs-logs --endpoint http://host.docker.internal:7280

在 Linux 上：

curl https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json | docker run --network=host -v $(pwd)/qwdata:/quickwit/qwdata -i quickwit/quickwit index ingest --index hdfs-logs --endpoint http://127.0.0.1:7280

wget https://quickwit-datasets-public.s3.amazonaws.com/hdfs-logs-multitenants-10000.json
curl -XPOST http://localhost:7280/api/v1/hdfs-logs/ingest -H "content-type: application/json" --data-binary @hdfs-logs-multitenants-10000.json

您可以检查是否正在工作，方法是在 severity_text 字段中搜索 INFO：

CLI
Docker

./quickwit index search --index hdfs-logs  --query "severity_text:INFO"

在 macOS 或 Windows 上：

docker run -v $(pwd)/qwdata:/quickwit/qwdata quickwit/quickwit index search --index hdfs-logs  --query "severity_text:INFO" --endpoint http://host.docker.internal:7280

在 Linux 上：

docker run --network=host -v $(pwd)/qwdata:/quickwit/qwdata quickwit/quickwit index search --index hdfs-logs  --query "severity_text:INFO" --endpoint http://127.0.0.1:7280

note

ingest 子命令会生成包含 500 万条文档的分片。每个分片是一个小型索引的表示，其中保存了索引文件和元数据文件。

查询返回的 JSON 结果：

{
  "num_hits": 10000,
  "hits": [
    {
      "body": "Receiving BP-108841162-10.10.34.11-1440074360971:blk_1073836032_95208 src: /10.10.34.20:60300 dest: /10.10.34.13:50010",
      "resource": {
        "service": "datanode/03"
      },
      "severity_text": "INFO",
      "tenant_id": 58,
      "timestamp": 1440670490
    }
    ...
  ],
  "elapsed_time_micros": 2490
}

索引配置显示我们可以使用时间戳字段参数 start_timestamp 和 end_timestamp 并从中受益于时间剪枝。在幕后，Quickwit 只会查询包含在这个时间范围内的日志的分片。

让我们使用这些参数与以下查询：

curl 'http://127.0.0.1:7280/api/v1/hdfs-logs/search?query=severity_text:INFO&start_timestamp=1440670490&end_timestamp=1450670490'

清理

让我们做一些清理工作，删除索引：

CLI
cURL

./quickwit index delete --index hdfs-logs

curl -XDELETE http://127.0.0.1:7280/api/v1/indexes/hdfs-logs

恭喜！您完成了这个教程！

要继续您的 Quickwit 之旅，请参阅分布式搜索教程或深入研究搜索 REST API 或查询语言。

安装​

启动 Quickwit 服务器​

创建索引​

索引日志​

清理​

安装

启动 Quickwit 服务器

创建索引

索引日志

清理