Elasticsearch DSL for searching and ranking information

8124-shadow.jpg

In modern information systems, the amount of data increases significantly every hour. Each user enters new information into information systems, and in turn, this also increases the size of backups, the size of logs, duplicate transactions, and so on. For an effective search for information, it is necessary to use the appropriate means that will effectively solve the task. Moreover, the amount of information can be so large that it is necessary to use multithreaded calculations to work with the tasks of searching, sorting and matching information. The wonderful Elasticsearch tool can help you with this. To work with this system, you can also use the library for Python Elasticsearch DSL. In this blog, we’ll talk about the basic possibilities for finding information using these tools.

What is Elasticsearch?

Elasticsearch is a free software search server. It provides a distributed, multitenant full-text search engine with an HTTP web interface and support for seamless JSON documents. The first version of Elasticsearch went live in February 2010. Elasticsearch can be used to index and search any type of document. It provides extensive search, and has near real-time search support. Elasticsearch has the ability to distribute indexes that can be separated by shards, and each shard can have zero or more replicas. Each node contains one or more shards and acts as a coordinator for delegating operations to the desired shard. Balancing and routing is performed automatically.

In favor of the Elasticsearch system, the following can be confirmed:

  • In Elasticsearch, you can perform and combine different kinds of searches, regardless of the data type. Information can include structured, unstructured, geographic, metric, and other data types.
  • Libraries for different programming languages ​​and HTTP API requests are supported.
  • A GET request can quickly retrieve data in the required form.
  • Elasticsearch can efficiently analyze billions of records in a matter of seconds.
  • The system provides aggregates to help you investigate trends and patterns in your data.

In a nutshell, Elasticsearch provides scale-out search, multithreading support. Search indexes can be divided into shards, each shard can have multiple replicas, each node can host multiple shards, with each node acting as a coordinator to delegate operations to the correct shard, rebalancing and routing are automatic. Related data is often stored in the same index, which consists of one or more primary shards and possibly multiple replicas.

Installing Elasticsearch

Installing an Elasticsearch system is not very difficult. If, for example, you want to try to install the system on MacOS, you would use brew. Other systems have corresponding installation tools, which you can see as part of the documentation on the Elasticsearch site.

brew tap elastic/tap

This will take some time for installation:

==> New Formulae
archey4             dory                llvm@11             marcli              organize-tool       stp                 zinit
conftest            gnupg@2.2           lychee              minisat             revive              webhook
csvtk               lefthook            macchina            mr2                 six                 xplr
==> Updated Formulae
Updated 680 formulae.
==> Renamed Formulae
fcct -> butane

==> Tapping elastic/tap
Cloning into '/usr/local/Homebrew/Library/Taps/elastic/homebrew-tap'...
remote: Enumerating objects: 870, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (84/84), done.
remote: Total 870 (delta 63), reused 55 (delta 26), pack-reused 759
Receiving objects: 100% (870/870), 202.89 KiB | 490.00 KiB/s, done.
Resolving deltas: 100% (649/649), done.
Tapped 17 formulae (50 files, 319.4KB).
Code language: PHP (php)

Next command will install a full set of Elastisearch.

brew install elastic/tap/elasticsearch-full

 Then, add in your

.bash_profile
Code language: CSS (css)

the following lines:

ES_HOME=/usr/local/var/homebrew/linked/elasticsearch-full
export ES_HOME
Code language: JavaScript (javascript)

This will help you run Elasticsearch from the installation directory on your computer.

Starting and Testing the Elasticsearch Installation

Start elasticsearch and restart at login:

brew services start elasticsearch

Or, if you don't need a background service you can just run:

elasticsearch

To test the elasticsearch installation type:

curl localhost:9200
Code language: CSS (css)

This will produce the following output:

{
  "name" : "MacBook-Pro-K-2",
  "cluster_name" : "elasticsearch_konst1970",
  "cluster_uuid" : "O2BwRmCCQ8amY3CiZa7Bpg",
  "version" : {
    "number" : "7.12.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "78722783c38caa25a70982b5b042074cde5d3b3a",
    "build_date" : "2021-03-18T06:17:15.410153305Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
Code language: JSON / JSON with Comments (json)

This means that Elasticsearch version 7.12.0 is up and running.

Using Python and Elasticsearch

There are many ways to use Elasticsearch with the Python programming language. For instance you can write HTTP requests to Elasticsearch API with your favorite Python network library. Or you can use the official low-level Python Elasticsearch library known as elasticsearch.

Web Solutions

On a higher level, it is possible to use Elasticsearch DSL library for Python to create more compact and effective code. As mentioned on their website: “Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built on top of the official low-level client (elasticsearch-py). It provides a more convenient and idiomatic way to write and manipulate queries. It stays close to the Elasticsearch JSON DSL, mirroring its terminology and structure. It exposes the whole range of the DSL from Python either directly using defined classes or a queryset-like expressions. It also provides an optional wrapper for working with documents as Python objects: defining mappings, retrieving and saving documents, wrapping the document data in user-defined classes.”

To install elasticsearch and elasticsearch_dsl libraries on your computer please use pip. Consider the Elasticsearch engine is already installed on your system. 

pip install elasticsearch
pip install elasticsearch_dsl

To try Elasticsearch with Python please run the following code:

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

print (es)
Code language: JavaScript (javascript)

This will indicate that Elasticsearch works correctly:

<Elasticsearch([{'host': 'localhost', 'port': 9200}])>
Code language: CSS (css)

Please note, if you like to delete records from Elasticsearch, it is necessary to make additional configurations. To remove information from the Elasticsearch system, you can use the following configuration request:

curl -XPUT -H "Content-Type: application/json" 
Code language: JavaScript (javascript)

This command will provide the following output that indicates the successful system operation.

http://127.0.01:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
Code language: JavaScript (javascript)

Add, search, and delete information from Elasticsearch.

Let’s create a simple Python script to add, search, and delete information to Elasticsearch. This script uses data records generated by JSON Data Generator on this website. The aim of the script is:

  • First, the script tests the Elasticsearch server on port 9200. 
  • Then, it deletes all records from the search index. 
  • Then, the script adds records to the search index. 
  • Finally, the script finds all records with an age greater and equal than 20.
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Q
from elasticsearch_dsl import Search

# establish connection with Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

print (es)

# delete all records from elasticsearch
s1 = Search(using=es, index='my-index').query("range", index={'gte': 0})
response = s1.delete()

data = [
 {
   "index": 0,
   "guid": "e5598a00-b2ed-437c-b13b-1e6f387bf23f",
   "isActive": False,
   "balance": "$2,428.26",
   "picture": "http://placehold.it/32x32",
   "age": 21,
   "eyeColor": "blue",
   "name": "Cherry Baird",
   "gender": "female",
   "company": "UPLINX",
   "email": "cherrybaird@uplinx.com",
   "phone": "+1 (957) 504-3326",
   "address": "585 Ludlam Place, Deseret, Virginia, 5268",
   "about": "Cupidatat reprehenderit mollit et qui pariatur enim est commodo non duis sit. Do mollit esse commodo ad pariatur dolore qui. Deserunt ullamco eiusmod cillum eiusmod pariatur do minim elit minim veniam incididunt ad Lorem est. Quis elit nostrud non sit dolore. Ea nulla velit enim nostrud Lorem.\r\n",
   "registered": "2015-03-18T11:29:59 -02:00",
   "latitude": 11.444065,
   "longitude": -104.466353,
   "tags": [
     "veniam",
     "amet",
     "nostrud",
     "ipsum",
     "pariatur",
     "ad",
     "sunt"
   ],
   "friends": [
     {
       "id": 0,
       "name": "Clements Fletcher"
     },
     {
       "id": 1,
       "name": "Stuart Mcintosh"
     },
     {
       "id": 2,
       "name": "Finch Cleveland"
     }
   ],
   "greeting": "Hello, Cherry Baird! You have 10 unread messages.",
   "favoriteFruit": "strawberry"
 },
 ... # add more records here
]

# add records to elasticsearch
for body in data:
  result = es.index(index='my-index', body=body)
  print(result)

# form query for search: match company and age 
query = Q('match', company='UPLINX') & Q('range', age={'gte': 20})
s = Search(using=es, index='my-index').query(query)
response = s.execute()

# print search results
for hit in response:
   print(hit.name)
Code language: PHP (php)

This script will generate the following output:

<Elasticsearch([{'host': 'localhost', 'port': 9200}])>
{'_index': 'my-index', '_type': '_doc', '_id': 'BTj4CnkB-8-eDhz1rKIb', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 130, '_primary_term': 5}
{'_index': 'my-index', '_type': '_doc', '_id': 'Bjj4CnkB-8-eDhz1rqJ9', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 131, '_primary_term': 5}
...
{'_index': 'my-index', '_type': '_doc', '_id': 'Czj4CnkB-8-eDhz1saI1', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 136, '_primary_term': 5}
Cherry Baird
Manning Maddox
Hawkins WilsonCode language: JavaScript (javascript)

Here, we found 3 records with an age greater than equal to 20 and specified request options.  This simple example demonstrates how to build search systems with Python libraries and use powerful search systems such as Elasticsearch.

Conclusion

Even with a complex system such as Elasticsearch, you can work with a library in the Python language and solve the assigned tasks quite effectively. At the same time, the main computational load is on the side of Elasticsearch and Python does not slow down the search system in any way. In terms of functionality and capabilities of building queries to the search system, the Elasticsearch-DSL library also provides ample opportunities and allows you to solve practical tasks for finding the necessary information.

Svitla Systems' qualified software engineers have extensive experience with Elasticsearch and information retrieval systems. You can contact our company for software development, including complex software systems in which it is necessary to implement various methods of information retrieval. Also, our developers know how to correctly build a system architecture that will work as quickly as possible with large amounts of information and using cloud systems.