Distributed indexing with SolrCloud

Nội dung

Giới thiệu Solr và SolrCloud
Setup Solr với SolrCloud
SolrCloud in action
Kết luận

Giới thiệu Solr và SolrCloud

Solr

Là 1 open source enterprise search platform, viết trên Java và được phát triển dựa trên Apache Lucene. Solr là 1 search engine hỗ trợ nhiều tính năng mạnh mẽ như là real-time indexing, faceted search, dynamic clustering với SolrCloud, database integration, NoSQL features and quản lý được nhiều loại văn bản (e.g., Word, PDF)

Sử dụng Solr như thế nào? -> Solr sẽ hỗ trợ chức năng search cho service của bạn thông qua việc đồng bộ dữ liệu từ hệ quản trị cơ sở dữ liệu của bạn thông qua Solr, đối với 1 service thì database là trái tim của nó nên nếu ta nhồi trái tim của nó nhiều quá ( ý là query ) sẽ khiến nguy cơ nhồi máu cơ tim tăng cao, service hoạt động thiếu ổn định. Solr với vai trò như chiếc máy trợ tim vậy, đối với những truy vấn dạng search vào khối dữ liệu ( hiểu là bảng nhé ) thì Solr sẽ takecare nó thay cho hệ quản trị dữ liệu của service.

Ngoài Solr thì có một search engine khác đang trở nên phổ biến hơn là Elasticsearch (đều phát triển dựa trên Lucene).

SolrCloud

Ở trên tôi có nhắc đến tính năng dynamic clustering của Solr, để sử dụng tính năng này ta phải run Solr ở cloud mode. Đối với lượng văn bản lớn vượt quá tầm kiểm soát của 1 node( 1 server) thì ta phải scale out ra thành hệ thống phân tán (distributed system) để giải quyết các vấn đề về:

Dung lượng dữ liệu: dữ liệu được chia thành các shard-mảnh chứ ko lưu toàn bộ nên node Solr sẽ không gặp phải vấn đề dữ liệu quá lớn ko lưu trữ được (có 1 lưu ý là số shard là cố định nên nếu kích thước shard tăng lớn lên và có khả năng vượt tầm kiểm soát thì ta phải plit shard ra, cái này ta phải làm bằng tay, qua API của Solr collection)
Faul tolerance: khả năng chịu lỗi -> khi một node trong cluster tạch or mất kết nối hệ thống vẫn hoạt động ổn định, nhờ replication (tạo các bản sao) dữ liệu, tức là từng mảnh dữ liệu (shard) sẽ được lưu trữ trên nhiều node khác nhau giúp cho dữ liệu luôn khả dụng
High Availability: tính khả dụng cao -> tự động rebalancing
Central configuration: với SolrCloud thì tất cả các config của cả cluster sẽ được quản lý tập trung với ZooKeeper cluster ( ZooKeeper là công cụ cung cấp 2 chức năng chính là centralized service để quản lý configuration, naming, distributed synchronization), vì vậy khi thay đổi config ta chỉ cần thay đổi trên ZooKeeper và tất các các node trong cluster sẽ được ốp chung config đấy.

Giải thích một số thuật ngữ trong SolrCloud

Node: ko có nghĩa là server mà là 1 instance( hay 1 process) của Solr. Một server (machine) sẽ có khoảng từ 1->4 node.
Cluster: tất cả các node kết nối với nhau thành cluser.
Collection: logical index, đại diện cho 1 bộ dữ liệu.
Shard: Một collection sẽ được chia thành nhiều shard (logical partition của collection)
Replica: các bản sao của shard.
Leader: là replica của shard, nó sẽ phân phối các request tới các replica còn lại, 1 shard chỉ có 1 leader.

SolrCloud in action

Các bước cơ bản để chạy ở Cloud mode

Solr có cung cấp 1 vài example để người dùng có thể tạo các cấu hình, các mode chạy nhanh chóng, có thể xem các example bằng câu lệnh:

$ bin/solr create --help
-e <example>  Name of the example to run; available examples:
      cloud:         SolrCloud example
      techproducts:  Comprehensive example illustrating many of Solr's core capabilities
      dih:           Data Import Handler
      schemaless:    Schema-less example

Với Cloud mode ta run bằng lệnh sau
bin/solr -e cloud

Cụ thể bạn xem tại link

Lúc chạy example or ko chỉ định rõ ZooKeeper cluster host thì Solr sẽ sử dụng embedded ZooKeeper, tuy nhiên lúc Solr process bị dừng lại thì ZooKeeper process cũng bị dừng luôn, các node khác sẽ ko connect được nữa.

Với example của cloud, nếu để default thì nó sẽ tạo ra collection là gettingstarted, ta có thể vào webui của Solr để xem

Trong ảnh trên (collection item) ta sẽ thấy được những thông tin sau:

collection item được chia thành 2 shard là shard1 và shard2
Mối shard có 3 replica được lưu tại các node như trên hình.

Setup cloud mode thực tế

Gồm có các bước sau:

Cài đặt java (với Solr 6.5.1 thì cần java 8) + download solr về. tham khảo link
Thiết lập ZooKeeper cluster(min là 3 node): tham khảo, cơ bản sẽ như sau:

① Download ZooKeeper + Cài java 8
② Thiết lập file config <ZK_HOME_PATH>/conf/zoo.cfg
dataDir=/var/lib/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=tokushop.zk1.jp:2888:3888
server.2=tokushop.zk2.jp:2888:3888
server.3=tokushop.zk3.jp:2888:3888
③ Tạo AppId cho mỗi node (uniqueID)
echo "<id>" > /var/lib/zookeeper/myid
④ Start ZooKeeper
$<ZK_HOME_PATH>/bin/zkServer.sh start

Upload file solr.xml lên ZooKeeper

bin/solr zk cp <FILE_PATH>/solr.xml zk:/ -z <ZooKeeper_Host>:2181

Nội dung của file solr.xml(default)

<solr>
  <solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">${jetty.port:8983}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <int name="distribUpdateSoTimeout">${distribUpdateSoTimeout:600000}</int>
    <int name="distribUpdateConnTimeout">${distribUpdateConnTimeout:60000}</int>
    <str name="zkCredentialsProvider">${zkCredentialsProvider:org.apache.solr.common.cloud.DefaultZkCredentialsProvider}</str>
    <str name="zkACLProvider">${zkACLProvider:org.apache.solr.common.cloud.DefaultZkACLProvider}</str>
  </solrcloud>
  <shardHandlerFactory name="shardHandlerFactory"
    class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:600000}</int>
    <int name="connTimeout">${connTimeout:60000}</int>
  </shardHandlerFactory>
</solr>

Tạo các thư mục cho các Solr node: ví dụ sau sẽ tạo cho 2 node

mkdir -p /opt/solr/node1/solr
mkdir -p /opt/solr/node2/solr
chown solr.solr -R  /opt/solr

Chú ý : nhờ việc đẩy file solr.xml lên ZooKeeper nên những thư mục cho các solr node lúc khởi tạo để rỗng vẫn ok
5. Start các node: ví dụ sau sẽ start 2 node trên 2 port 8983 và 7475

bin/solr start -cloud -p 8983 -s "/opt/solr/node1/solr" -z <zk_host1>:2181,<zk_host2>:2181,<zk_host3>:2181
bin/solr start -cloud -p 7475 -s "/opt/solr/node2/solr" -z <zk_host1>:2181,<zk_host2>:2181,<zk_host3>:2181

Tạo collection: tạo collection test với 2 shard, mỗi shard 3 replica

bin/solr create -c test -shards 2 -replicationFactor 3

Quan sát trên Web UI của Solr vào http://<solr_ip>:8983/solr/#/~cloud?view=tree ta sẽ thấy tất cả config của collection sẽ được lưu trữ trên ZooKeeper:

Định nghĩa schema và upload dữ liệu lên solr

Một số khái niệm:

Document: gọi là đơn vị thông tin (information unit), là 1 tập các dữ liệu. Ta cứ hiểu như là 1 file bao gồm thông tin của nhiều bản dữ liệu (thông tin các user, thông tin các sản phẩm,...)
Field: trường dữ liệu, ví dụ thông tin của user sẽ có họ, tên, ngày sinh -> các field. Field sẽ chứa các thông tin để cho Solr biết dữ liệu của field đó sẽ được đánh phân tích, đánh index, filter như thế nào. Ví dụ định nghĩa 1 field ( tag ) -> định nghĩa cách mà field này được phân tích như thế nào + được filter như thế nào (filter các stopword-các từ vô nghĩa ...)

<fieldType name="text_ja" class="solr.TextField" autoGeneratePhraseQueries="false" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
      <filter class="solr.JapaneseBaseFormFilterFactory"/>
      <filter class="solr.JapanesePartOfSpeechStopFilterFactory" tags="lang/stoptags_ja.txt"/>
      <filter class="solr.CJKWidthFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_ja.txt" ignoreCase="true"/>
      <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Ví dụ định nghĩa field cho document (tag ) -> định nghĩa trường dữ liệu thuộc type nào (string, int, text_ja,...) và có được đánh index ko

   <field name="master_id"             type="int"    indexed="true"  stored="true"  required="true"/>
   <field name="shop_id"               type="string"     indexed="true"  stored="true"  required="true"/>
   <field name="service_id"            type="int"       indexed="true"  stored="true"  required="true"/>
   <field name="create_datetime"       type="date"       indexed="true"  stored="true"  required="true"/>
   <field name="modify_datetime"       type="date"       indexed="true"  stored="true"  required="true"/>
   <field name="toku_flag"             type="string"     indexed="true"  stored="true"  required="true"/>
   <field name="card_flag"             type="string"     indexed="true"  stored="true"  required="true"/>
   <field name="repay_flag"            type="string"     indexed="true"  stored="true"  required="true"/>
   <field name="bank_flag"             type="string"     indexed="true"  stored="true"  required="true"/>
   <field name="convenience_flag"      type="string"     indexed="true"  stored="true"  required="true"/>

Config schema và upload dữ liệu lên Solr

Config schema: schema là chúng ta định nghĩa cấu trúc document, document sẽ gồm có các field như thế nào, được phân tích, đánh index, filter như thế nào. Ta hình dung nó như là định nghĩa của table(DDL) trong RDB vậy. Trong các phiên bản cũ thì file schema là schema.xml còn ở các phiên bản mới hơn thì đổi thành file managed-schema
Upload dữ liệu lên solr dùng curl: Solr hỗ trợ 3 upload 3 loại file là XML, JSON, CSV (TSV ta cũng hiểu là CSV):
Ví dụ upload file XML (-v giúp ta xem được response code trả về) lên collection test

curl -v "http://localhost:8938/solr/test/update?commit=true" -H "Content-Type: text/xml" --data-binary @data.xml

Khi ta upload document lên Solr thì Solr sẽ dùng định nghĩa schema của collection để đánh index cho document.

Query using solrj

Ta có thể query tới Solr bằng 2 cách là trực tiếp qua HTTP request or sử dụng các thư viện Client được cung cấp sẵn. Solr cung cấp cho người dùng các thư viện Client để truy cập đến Solr mà ta ko cần phải tự tạo các http request + parser response cũng như những thiết lập liên quan tới query (facet, limit, sort,...). Thông tin các client có thể xem tại link

Ví dụ về query sử dụng HTTP request

[solr@longkyo solr-6.5.1]$ curl 'http://localhost:8983/solr/test-post/select?indent=on&q=*:*&rows=5&wt=csv'
user_id,name,id
200,PC is not go,52d23302-c884-4a58-8e11-429d2da07f70
300,PeCo is good,346be232-3a9d-4a0e-b956-e3f6840994c0
200,PC is not go,8ff6165d-0bb8-46cb-b15f-2c2effb58fc5
300,PeCo is good,bf76cfd9-65da-4467-a293-e8aaafeb2f27
100,Long Ta hoho aa,1e5359f6-60a9-47a9-b0ff-6a816f4b51a4

Ví dụ về sử dụng SolrJ
Tạo project gradle với config build.gradle như sau

apply plugin: 'application'
mainClassName = 'com.example.Main'
repositories {
    mavenCentral()
}
dependencies {
    compile 'org.apache.solr:solr-solrj:6.5.1'
}

Code Java (file src/main/java/com/example/Main.java): Solr ở standalone mode

package com.example;
import java.io.IOException;
import org.apache.solr.common.SolrDocument;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.client.solrj.SolrClient;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.impl.HttpSolrClient;
public class Main {
    public static void main(String[] args) {
        try {
            SolrClient client = new HttpSolrClient("http://localhost:8983/solr/mycore");
            QueryResponse response = client.query(new SolrQuery("*:*"));
            SolrDocumentList list = response.getResults();
            for (SolrDocument doc : list) {
                System.out.printf("id: %s name: %s\n", doc.get("id"), doc.get("name"));
            }
        } catch (SolrServerException | IOException e) {
            e.printStackTrace();
        }
    }
}

Với SolrCloud thì ta tạo client object như sau

String zkHostString = "zkServerA:2181,zkServerB:2181,zkServerC:2181/solr";
SolrClient solr = new CloudSolrClient.Builder().withZkHost(zkHostString).build();

Kết luận

Với service có dữ liệu lớn, user tìm kiếm nhiều ( các site về thương mại điện tử, tin tức,...) việc sử dụng Solr làm search engine cho service sẽ giúp cho service được giảm tải rất nhiều.
Cơ chế đánh index, filter của Solr cũng giúp tối ưu được kết quả tìm kiếm.
SolrCloud với khả năng chịu lỗi, khả dụng cao, scale out linh hoạt với shard và replica giúp đảm bảo tính bền vững, linh hoạt cho hệ thống.