Codestin Search App

Code Test for Software Engineers interested in NLP

This is a code test for Software Engineers at Quantworks interested in NLP. Consider this an assessment rather than a graded exam. This test helps us gauge a candidate's level of familiarity and comfort with technologies and techniques.

Project Description

This exercise provides data from the National Transportation Safety Board's database of aviation accidents. We'll ask you to perform some routine high-level analytic tasks with the data.

Some guidance

Use open source tools, such as Python, R, or Java. Do not use proprietary tools, such as SAS, SPSS, JMP, Tableau, or Stata.
Fork this repository to your personal GitHub account and clone the fork to your computer.
Use NLP processing techniques you have familiarity with to perform analysis. We will discuss your approach and understanding of NLP concepts together.
Save and commit your answers to your fork of the repository, and push them back to your personal GitHub account. You can then provide a link to that fork of the repository if you need to show a code example.
Use the Internet as a resource to help you complete your work. We do it all the time.
Comment your code so that when you look back at it in a year, you'll remember what you were doing.
There are many ways to approach and solve the problems presented in this exercise.
Have fun!

The Data

There are 145 files in this repository:

AviationData.xml: This is a straight export of the database provided by the NTSB. The data has not been altered. It was retrieved by clicking "Download All (XML)" from this page on the NTSB site.

There are 144 files in the following format:

NarrativeData_xxx.json: These files were created by taking the EventIds from AviationData.xml and collecting two additional pieces of data for each event:
- narrative: This is the written narrative of an incident.
- probable_cause: If the full narrative includes a "probable cause" statement, it is in this field.

The Task

Explore the data and prepare a 1/2 to 1 page written overview of your findings. Submit all your work (preprocessing/processing/analysis) in well documented code. Submit at least one visualization that supports your analysis and findings.

Additional Context:

Assume the audience for your write-up is a non-technical stakeholder.
Assume the audience for your code is a colleague who may need to read or modify it in the future.
Please document your code as appropriate

Possible approaches

This data is ripe for exploration. Working with the files provided, here are a few structured tasks that you can attempt. Note: these are suggested data exploration tasks. Feel free to complete none, some, or all of these steps as you explore the data.

This is a sample approach. Please feel free to come up with your own process.

Consume the data from files into datastructure/database you wish to work with
Determine what fields you have available for each incident record in the structured AviationData.xml
Perform initial exploratory analysis of the narrative text. What, if anything, can you learn from analyzing the text that you cannot learn from the structured data fields?
Perform initial exploratory analysis of the narrative text and determine an analytical question you would like to answer with this data.
Perform common preprocessing tasks ( removal of common stopwords, parsing, etc) as required by the analytic question you would like to explore. (i.e. what are the most frequent terms in the narrative text? Do the word frequencies change over time?)
Choose a text representation method for your analysis
Use topic modeling to cluster the incidents based on the narrative text and/or probable cause descriptions. How do the clusters differ from those created with the structured data fields?
Imagine you have a client with a fear of flying. That client wants to know what types of flights present the most risk of an incident. Use your findings from the earlier tasks to answer the client's question in a report no longer than 1 page.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code Test for Software Engineers interested in NLP

Project Description

Some guidance

The Data

The Task

Possible approaches

About

Uh oh!

Releases

Packages

angrbrd/code-test.nlp.301

Folders and files

Latest commit

History

Repository files navigation

Code Test for Software Engineers interested in NLP

Project Description

Some guidance

The Data

The Task

Possible approaches

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages