File size: 2,847 Bytes

330205e

---
inference: false
language:
- en
- zh
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
---

# Medical txtai embeddings index

This is a [txtai](https://github.com/neuml/txtai) embeddings index specifically designed for medical texts, encompassing a diverse corpus in both English and Chinese.
The model is primed for integration into medical information systems, aiding in the quick retrieval of relevant clinical information.

### Data Sources

The model is trained on a substantial dataset, including 434411 entries from a bilingual (English and Chinese) corpus of clinical texts. The sources are:

- `shibing624/medical`, a dataset featuring a variety of medical scenarios and questions in both English and Chinese, suitable for text generation and medical question-answering systems. It's licensed under Apache 2.0.
- `keivalya/MedQuad-MedicalQnADataset`, offering detailed insights into various health conditions and their treatments, covering prevention, diagnosis, treatment, and susceptibility.
- `GBaker/MedQA-USMLE-4-options`, a collection of multiple-choice questions based on the USMLE, focusing on a wide range of medical topics and scenarios.
- `medalpaca/medical_meadow_medqa`, a dataset for question answering in English and Chinese, encompassing clinical scenarios and medical queries with multiple-choice answers.
- `medalpaca/medical_meadow_medical_flashcards`, featuring over 34,000 rows of question and answer pairs derived from medical flashcards, focusing on a wide range of medical subjects.

Each of these datasets contributes to the depth and diversity of the medical knowledge encapsulated in the txtai embeddings model, making it an effective tool for medical information retrieval and analysis.

## Indexing

The txtai embeddings model utilizes 'efederici/multilingual-e5-small-4096', a transformer-based model with 12 layers and an embedding size of 384, supporting 94 languages. 

### Configuration

The embedding model is quantized to 4 bits for size efficiency and supports batch encoding of 15 for optimized performance. 
The indexing is implemented using simple numpy cosine similarity, ensuring straightforward and efficient retrieval.

## Usage

1. Load the dataset using the provided JSON file.
2. Initialize and load the embeddings using txtai:
   ```python
   from txtai import Embeddings
   embeddings = Embeddings()
   embeddings.load('index.tar.gz')
   ```

## Next Steps
1. More detailed usage, including using txtai to create inter-operability between English and Chinese
2. Create an usecase with [CrewAI](https://github.com/joaomdmoura/crewAI) and [Dr.Samantha](https://huggingface.co/sethuiyer/Dr_Samantha_7b_mistral)

## License

This model is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License and the GNU Free Documentation License.