Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Changes from 1 commit
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
2933930
ENH add hash based unique
adrinjalali Mar 14, 2024
8da2e72
getting closer
adrinjalali Mar 14, 2024
961ef5b
trying to expose as a module
adrinjalali Mar 15, 2024
a6b1847
trying to create a module
adrinjalali Mar 17, 2024
1f1c36c
fix build
adrinjalali Mar 18, 2024
0bc43c3
...
adrinjalali Mar 18, 2024
f94cf89
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Mar 18, 2024
a37b151
segfault fix, imported numpy
adrinjalali Mar 19, 2024
8f42b0b
getting unique back, refcount issues exist
adrinjalali Mar 19, 2024
f56634f
unique works
adrinjalali Mar 19, 2024
bce7534
remove header
adrinjalali Mar 19, 2024
0c6c588
cleanups and comments
adrinjalali Mar 21, 2024
9b7d6f6
change type
adrinjalali Mar 21, 2024
85cf692
fix for initialization issue
adrinjalali Mar 21, 2024
3db3349
trying to move module
adrinjalali Mar 22, 2024
8d4b6be
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali May 13, 2024
9e7d671
Revert "trying to move module"
adrinjalali May 13, 2024
a4e8a29
use unordered_set and use a finally construct to handle exceptions
adrinjalali Sep 19, 2024
6a8c69c
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Sep 19, 2024
0c3b889
handle C++ exceptions, and use explicit types
adrinjalali Sep 20, 2024
cc39a50
make it C importable
adrinjalali Sep 23, 2024
92adb26
add missing header file
adrinjalali Sep 23, 2024
8b7ad2e
fix skip API test
adrinjalali Sep 23, 2024
8f95240
rename _core.unique to _core._unique
adrinjalali Sep 23, 2024
1d0c596
use _unique name
adrinjalali Sep 23, 2024
ed4ea89
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Jan 9, 2025
8adbf70
add freethreaded slot
adrinjalali Jan 9, 2025
a8e69ff
apply comments from review
adrinjalali Jan 14, 2025
cdf3af9
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Jan 14, 2025
fc1d50e
remove own module, fix segfault
adrinjalali Jan 14, 2025
5dbdf48
raise NotImplementedError instead of returning None
adrinjalali Jan 14, 2025
c8b9d22
release and regrab GIL
adrinjalali Jan 19, 2025
a9df742
fix GIL issues and compile separately
adrinjalali Jan 20, 2025
5333e80
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Jan 20, 2025
3bb7c97
add np_core_dep dependency, hoping it fixes the issue
adrinjalali Jan 23, 2025
95a577b
remove include
adrinjalali Jan 23, 2025
724b794
debug ...
adrinjalali Jan 23, 2025
1abc6b5
debug ...
adrinjalali Jan 23, 2025
214cd06
Py_INCREF needs the GIL
adrinjalali Jan 23, 2025
8c184ab
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Jan 23, 2025
113e021
revert debug info in CI
adrinjalali Jan 23, 2025
c733f75
Merge highway submodule changes from main
seberg Jan 24, 2025
ae0e936
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Feb 1, 2025
b50e7f3
reviews
adrinjalali Feb 20, 2025
712a5cf
add test for ValueError
adrinjalali Feb 20, 2025
8a45f04
changelog
adrinjalali Feb 20, 2025
4999daa
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Feb 20, 2025
ab86574
use macro to return notimplemented
adrinjalali Feb 21, 2025
08d7d62
Merge remote-tracking branch 'upstream/main' into unique-cpp
adrinjalali Feb 22, 2025
e1e2ddf
Apply suggestions from code review
seberg Feb 25, 2025
f96411a
MAINT,ENH: Smaller reorgs/maint and use `sorted=False` for `unique_va…
seberg Feb 25, 2025
2319947
Ensure we don't iterate if iterator is empty (also change thread stat…
seberg Feb 25, 2025
e188bf3
DOC: unique_values doc examples may have different order now
seberg Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
getting closer
  • Loading branch information
adrinjalali committed Mar 14, 2024
commit 8da2e72a0a5a0344e40645678d95f06b816fd12e
143 changes: 35 additions & 108 deletions numpy/_core/src/multiarray/unique.cpp
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
#define NPY_NO_DEPRECATED_API NPY_API_VERSION

#include <ctime>
#include <unordered_map>
#include <map>
#include <vector>
#include <random>
#include <iostream>
#include <string>

#define _MULTIARRAYMODULE
#include "numpy/ndarraytypes.h"
Expand All @@ -15,99 +13,15 @@

#include "numpy/npy_2_compat.h"


template <typename T>
T *random_data(std::size_t size, std::size_t max, T type)
{
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_int_distribution<std::mt19937::result_type> rnd(0, max);
T *res = new T[size];
for (std::size_t i = 0; i < size; i++)
{
res[i] = rnd(rng);
}
return res;
}

void process_args(int argc, char *argv[], std::string &alg, std::size_t &size, std::size_t &max)
{
if (argc != 4)
{
std::cerr << "Usage: " << argv[0] << " {hash,rbt} <size> <max>" << std::endl;
std::exit(1);
}
alg = argv[1];
size = (std::size_t)std::stoi(argv[2]);
max = (std::size_t)std::stoi(argv[3]);
}

template <typename ContainerType, typename DataType>
std::vector<DataType> _unique(ContainerType &container, DataType *data, std::size_t size)
{
for (std::size_t i = 0; i < size; i++)
container[data[i]] = 0;

std::vector<DataType> res;
res.reserve(container.size());
for (auto it = container.begin(); it != container.end(); it++)
res.emplace_back(it->first);

return res;
}

template <typename T>
std::vector<T> unique(std::string &alg, T *data, std::size_t size)
{
if (alg == "hash")
{
std::unordered_map<T, char> umap;
return _unique(umap, data, size);
}
else if (alg == "rbt")
{
std::map<T, char> map;
return _unique(map, data, size);
}
else
{
std::cerr << "Unknown algorithm: " << alg << std::endl;
std::exit(1);
}
}

NPY_NO_EXPORT npy_intp
PyArray_Unique(PyArrayObject *self)
template<typename T>
npy_intp unique(PyArrayObject *self)
{
/* Nonzero boolean function */
// PyArray_NonzeroFunc* nonzero = PyDataType_GetArrFuncs(PyArray_DESCR(self))->nonzero;

NpyIter* iter;
NpyIter_IterNextFunc *iternext;
char** dataptr;
npy_intp nonzero_count;
npy_intp* strideptr,* innersizeptr;
std::unordered_map<T, char> hashmap;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, would an unordered_set not be enough?


/* Handle zero-sized arrays specially */
if (PyArray_SIZE(self) == 0) {
return 0;
}

/*
* Create and use an iterator to count the nonzeros.
* flag NPY_ITER_READONLY
* - The array is never written to.
* flag NPY_ITER_EXTERNAL_LOOP
* - Inner loop is done outside the iterator for efficiency.
* flag NPY_ITER_NPY_ITER_REFS_OK
* - Reference types are acceptable.
* order NPY_KEEPORDER
* - Visit elements in memory order, regardless of strides.
* This is good for performance when the specific order
* elements are visited is unimportant.
* casting NPY_NO_CASTING
* - No casting is required for this operation.
*/
iter = NpyIter_New(self, NPY_ITER_READONLY|
NPY_ITER_EXTERNAL_LOOP|
NPY_ITER_REFS_OK,
Expand All @@ -133,38 +47,51 @@ PyArray_Unique(PyArrayObject *self)
/* The location of the inner loop size which the iterator may update */
innersizeptr = NpyIter_GetInnerLoopSizePtr(iter);

sum = 0;
std::cout << "printing values: " << std::endl;
do {
/* Get the inner loop data/stride/count values */
char* data = *dataptr;
npy_intp stride = *strideptr;
npy_intp count = *innersizeptr;
npy_intp size = PyArray_ITEMSIZE(self);
/* This is a typical inner loop for NPY_ITER_EXTERNAL_LOOP */

while (count--) {
if (nonzero(data, self)) {
++nonzero_count;
}
std::cout << (T)* data << std::endl;
hashmap[(T)* data] = 0;
data += stride;
}

/* Increment the iterator to the next inner loop */
} while(iternext(iter));

NpyIter_Deallocate(iter);
std::vector<T> res;
std::cout << "unique values :" << std::endl;
res.reserve(hashmap.size());
for (auto it = hashmap.begin(); it != hashmap.end(); it++) {
res.emplace_back(it->first);
std::cout << it->first << std::endl;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we copy it over manually anyway (and don't already build the result, which makes sense). Then we should allocate the array first.

That also ensures we use the custom allocator a user may have set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand, isn't the PyArray_NewFromDescr doing that just bellow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that you are using T* res = new T[hashset.size()]; and then binding it to the array.

That works, but it is really the wrong way around: We should create the (empty) array with data=NULL and then use T *res = <T *>PyArray_DATA((PyArrayObject *)arr);.

After you move the order and use NULL for the data, you need to check if res_obj is NULL and return NULL early then. (Even now technically wrong, because we are leaking res.)


return nonzero_count;
NpyIter_Deallocate(iter);
return 0;
}

int main(int argc, char *argv[])
NPY_NO_EXPORT npy_intp
PyArray_Unique(PyArrayObject *self)
{
std::size_t size, max;
std::string alg;
process_args(argc, argv, alg, size, max);
double sample = 0;
double *data = random_data(size, max, sample);
const clock_t begin_time = clock();
std::vector<double> unique_values = unique(alg, data, size);
std::cout << float( clock () - begin_time ) / CLOCKS_PER_SEC;
delete data;
npy_intp itemsize;

/* Handle zero-sized arrays specially */
if (PyArray_SIZE(self) == 0) {
return 0;
}

itemsize = PyArray_ITEMSIZE(self);
std::cout << "Item size: " << itemsize << std::endl;

if (sizeof(char) == itemsize) {
unique<char>(self);
} else if (sizeof(int) == itemsize) {
unique<int>(self);
}
return 0;
}