Codestin Search App

Tecnhological Stack

Project is tightly related to AI/ML, thai is why i chose Python to implement the solution.

The main libraries used in the project include: fastapi, pydantic, celery, sentence_transformers, keybert, pytorch, whisperx, pytest, pymongo,

As the storage for analysis results I selected MongoDB, as a higly scalable and efficient document based database, and Weaviate as the storage for the vector embeddings (refer to section 'Calls Categorization').

As the communication transport between backend and workers I selected RabbitMQ, since it is de-facto standard approach for tasks distribution in python applications and it has great support in celery.

Solution Architecture

As the main purpose of the solution is to analyze big amounts of phone calls data, this system was designed as highly scalable and distributed. Currenty the system consists of:

backend that is responsible for extracting the results and presenting them to user via API
workers that are responsible for audio and text analysis tasks

The following architecture allows to scale components individually according to hardware requirements, traffic load etc.

On the component level I used hexagonal architecture. It allows to build modular and highly testable systems that are able to be easily adjusted to new requirements. There are three layers:

api: this layer is responsible for REST API, it's schema and input validation
domain: this layer contains adapter for AI models and business logic regarding categories and call analysis
infrastructure: this layer contains implementation for DB adapters, repositories and workflow setup for call analysis

Text Analysis

Text processing pipeline is optimized to be highly distributed and parallel. At first, audio call transcription is obtained. Then three parallel jobs are triggered: emotional tone analysis, entities extraction and categorization. In celery worflow looks the following way:

def analyze_call(call_id: str, audio_url: str):
    workflow = chain(
        set_status.si(call_id, CallAnalysisStatus.IN_PROGRESS),
        download_and_transcribe_audio.si(call_id, audio_url),
        chord(
            group(
                identify_call_entities.s(call_id),
                identify_emotional_tone.s(call_id),
                classify_category.s(call_id),
            ),
            set_status.si(call_id, CallAnalysisStatus.FINISHED),
        ),
    )

    workflow.apply_async(link_error=set_status.si(call_id, CallAnalysisStatus.FAILED))

Results of the call analysis are stored in MongoDB with the following data model:

{
    "call_id": "string",
    "emotional_tone": "string",
    "transcription": "string",
    "extracted_entities": {
        "ENTITY_TYPE_1": "string",
        "ENTITY_TYPE_2": "string"
    }
}

Phone Call Transcription

For phone call transcription, I chose to use WhisperX, an advanced automatic speech recognition model. Balancing between quality and resource requirements i chose base.en model as the default one. WhisperX offers great transcription accuracy, even in challenging conditions such as background noise or varying accents. Being an open-source model, WhisperX aligns with our goal of not sending data to external APIs, ensuring data privacy and security.

The model processes the audio file and generates a text transcript of the conversation, which serves as the foundation for further analysis steps.

Emotional Tone Analysis

To analyze the emotional tone of the transcribed calls, I utilized the j-hartmann/emotion-english-distilroberta-base model. This model is capable of classifying text into seven emotional categories: anger, joy, disgust, sadness, fear, neutral, and surprise. After obtaining the emotion scores from the model, they are combined to get the final emotional tone using the following rules:

combined_scores = {
    CallEmotionalTone.ANGRY: emotion_scores["anger"],
    CallEmotionalTone.POSITIVE: emotion_scores["joy"],
    CallEmotionalTone.NEGATIVE: emotion_scores["disgust"] + emotion_scores["sadness"] + emotion_scores["fear"],
    CallEmotionalTone.NEUTRAL: emotion_scores["neutral"] + emotion_scores["surprise"]
}

Call Entities Extraction

For call entities extraction, I utilized the Named Entity Recognition (NER) capabilities of the spaCy library, specifically employing the en_core_web_sm model. This model identifies various entity types such as person names, organizations, locations, dates, and more within the transcribed text. To determine the caller's identity, I implement a heuristic that assumes the first occurrence of a person's name in the transcript is likely the caller, based on the common practice of self-introduction at the call's start.

Calls Categorization

Calls categorization could be interpreted as a classification task, however according to the system requirements categories and their key points can be updated or deleted at a run time, that is why retraining model would be inefficient. That is why the algorithm should easily adapt to new categories, and not introduce extra load for reevaluation of call in case some categories were changed. The approach that I propose is based on the similarity search between key points in the category and key points mentioned in the call. This approach consists of the following steps.

On the category add, update or delete:

embed category name and key points using all-mpnet-base-v2 model
save embeddings into the vector database

def add_category(self, category: Category):
    properties = self.__create_properties(category)
    vectors = self.__embed_text(properties)
    with self.collection.batch.dynamic() as batch:
        for p, vector in zip(properties, vectors):
            batch.add_object(
                properties=p,
                vector=vector
            )

def __create_properties(self, category):
    properties = []
    properties.extend([self.__create_key_point_document(category.id, category.title, point) for point in category.points])
    properties.append(self.__create_category_document(category.id, category.title))
    return properties

During the call analysis:

extract keywords for the call trascription
embed keywords using all-mpnet-base-v2 model
save keywords embeddings into the vector database

def classify(self, call_id: str, transcription: str):
    keywords = self.keywords_extractor.extract(transcription)
    self.keywords_store.insert(call_id, keywords)

When results of analysis are requested:

extract keyword vectors from vector database
extract similar categories by cosine similarity > 0.5

def get_result(self, call_id: str) -> List[str]:
    result = set()
    keyword_vectors = self.keywords_store.get(call_id)
    for category in self.category_store.get_similar(keyword_vectors):
        result.add(category)
    return result


def get_similar(self, vectors: List[np.array], threshold=0.5) -> list[str]:
    result = set()
    for vector in vectors:
        response = self.collection.query.near_vector(
            near_vector=vector,
            distance=threshold,
            return_metadata=MetadataQuery(distance=True)
        )
        for o in response.objects:
            result.add(o.properties["category"])
    return list(result)

How To Start Application

To start an application please use the following command:

docker compose up --build

Testing

To run unit tests in docker container please use the following command:

docker exec devchallenge-xxi-backend bash -c "cd /app/tests && pytest -v"

Corner Cases Covered

large files handling: in case file content cannot fit into the model, it is splitted into chunks and then results are aggregated
calls categorization after category updates and deletes: by utilizing shared latent vector space for key topics in categories and keywords in calls it is not needed to recalculate all categorization results for each of the calls
in case of error during the analysis user will be given a correspoding details
system works correstly in case there is noise in the input audio
system works correctly for different accents in english
all components of the system don't send data to external APIs which is critical for such a solution
system chooses the surname and name of the caller in case both are present
each call can have multiple categories or no category at all
concurrent analysis runs and asynchronous processing: system can handle multiple analysis requests simultaneously balancing the load between workers

Further Steps

Deploy application in the cloud
Utilize GPU for faster speech-to-text
Speech diarization in processing pipeline
LLM based embeddings for categories classification
Open souce LLM (like llama) to conduct text analysis

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tecnhological Stack

Solution Architecture

Text Analysis

Phone Call Transcription

Emotional Tone Analysis

Call Entities Extraction

Calls Categorization

How To Start Application

Testing

Corner Cases Covered

Further Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tecnhological Stack

Solution Architecture

Text Analysis

Phone Call Transcription

Emotional Tone Analysis

Call Entities Extraction

Calls Categorization

How To Start Application

Testing

Corner Cases Covered

Further Steps

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages