Spaces:
Runtime error
Runtime error
# Thrust 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1) | |
## Summary | |
Thrust 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release | |
and the CUDA Toolkit 11.1 release. | |
## Bug Fixes | |
- #1214, NVBug 200619442: Stop using `std::allocator` APIs deprecated in C++17. | |
- #1216, NVBug 200540293: Make `thrust::optional` work with Clang when used | |
with older libstdc++. | |
- #1207, NVBug 200618218: Don't force C++14 with older compilers that don't | |
support it. | |
- #1218: Wrap includes of `<memory>` and `<algorithm>` to avoid circular | |
inclusion with NVC++. | |
# Thrust 1.9.10 (NVIDIA HPC SDK 20.5) | |
## Summary | |
Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release. | |
It adds CMake support for compilation with NVC++ and a number of minor bug fixes | |
for NVC++. | |
It also adds CMake `find_package` support, which replaces the broken 3rd-party | |
legacy `FindThrust.cmake` script. | |
C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. | |
Starting with the upcoming 1.10.0 release, C++03 support will be dropped | |
entirely. | |
## Breaking Changes | |
- #1082: Thrust now checks that it is compatible with the version of CUB found | |
in your include path, generating an error if it is not. | |
If you are using your own version of CUB, it may be too old. | |
It is recommended to simply delete your own version of CUB and use the | |
version of CUB that comes with Thrust. | |
- #1089: C++03 and C++11 are deprecated. | |
Using these dialects will generate a compile-time warning. | |
These warnings can be suppressed by defining | |
`THRUST_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 | |
deprecation warnings) or `THRUST_IGNORE_DEPRECATED_CPP11` (to suppress C++11 | |
deprecation warnings). | |
Suppression is only a short term solution. | |
We will be dropping support for C++03 in the 1.10.0 release and C++11 in the | |
near future. | |
- #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. | |
Using these compilers will generate a compile-time warning. | |
These warnings can be suppressed by defining | |
`THRUST_IGNORE_DEPRECATED_COMPILER`. | |
Suppression is only a short term solution. | |
We will be dropping support for these compilers in the near future. | |
## New Features | |
- #1130: CMake `find_package` support. | |
This is significant because there is a legacy `FindThrust.cmake` script | |
authored by a third party in widespread use in the community which has a | |
bug in how it parses Thrust version numbers which will cause it to | |
incorrectly parse 1.9.10. | |
This script only handles the first digit of each part of the Thrust version | |
number correctly: for example, Thrust 17.17.17 would be interpreted as | |
Thrust 1.1.1701717. | |
You can find directions for using the new CMake `find_package` support and | |
migrating away from the legacy `FindThrust.cmake` [here](https://github.com/thrust/thrust/blob/master/thrust/cmake/README.md) | |
- #1129: Added `thrust::detail::single_device_tls_caching_allocator`, a | |
convenient way to get an MR caching allocator for device memory, which is | |
used by NVC++. | |
## Other Enhancements | |
- #1129: Refactored RDC handling in CMake to be a global option and not create | |
two targets for each example and test. | |
## Bug Fixes | |
- #1129: Fix the legacy `thrust::return_temporary_buffer` API to support | |
passing a size. | |
This was necessary to enable usage of Thrust caching MR allocators with | |
synchronous Thrust algorithms. | |
This change has allowed NVC++’s C++17 Parallel Algorithms implementation to | |
switch to use Thrust caching MR allocators for device temporary storage, | |
which gives a 2x speedup on large multi-GPU systems such as V100 and A100 | |
DGX where `cudaMalloc` is very slow. | |
- #1128: Respect `CUDA_API_PER_THREAD_DEFAULT_STREAM`. | |
Thanks to Rong Ou for this contribution. | |
- #1131: Fix the one-policy overload of `thrust::async::copy` to not copy the | |
policy, resolving use-afer-move issues. | |
- #1145: When cleaning up type names in `unittest::base_class_name`, only call | |
`std::string::replace` if we found the substring we are looking to replace. | |
- #1139: Don't use `cxx::__demangle` in NVC++. | |
- #1102: Don't use `thrust::detail::normal_distribution_nvcc` for Feta because | |
it uses `erfcinv`, a non-standard function that Feta doesn't have. | |
# Thrust 1.9.9 (CUDA Toolkit 11.0) | |
## Summary | |
Thrust 1.9.9 adds support for NVC++, which uses Thrust to implement | |
GPU-accelerated C++17 Parallel Algorithms. | |
`thrust::zip_function` and `thrust::shuffle` were also added. | |
C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated. | |
Starting with the upcoming 1.10.0 release, C++03 support will be dropped | |
entirely. | |
All other deprecated platforms will be dropped in the near future. | |
## Breaking Changes | |
- #1082: Thrust now checks that it is compatible with the version of CUB found | |
in your include path, generating an error if it is not. | |
If you are using your own version of CUB, it may be too old. | |
It is recommended to simply delete your own version of CUB and use the | |
version of CUB that comes with Thrust. | |
- #1089: C++03 and C++11 are deprecated. | |
Using these dialects will generate a compile-time warning. | |
These warnings can be suppressed by defining | |
`THRUST_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11 | |
deprecation warnings) or `THRUST_IGNORE_DEPRECATED_CPP_11` (to suppress C++11 | |
deprecation warnings). | |
Suppression is only a short term solution. | |
We will be dropping support for C++03 in the 1.10.0 release and C++11 in the | |
near future. | |
- #1089: GCC < 5, Clang < 6, and MSVC < 2017 are deprecated. | |
Using these compilers will generate a compile-time warning. | |
These warnings can be suppressed by defining | |
`THRUST_IGNORE_DEPRECATED_COMPILER`. | |
Suppression is only a short term solution. | |
We will be dropping support for these compilers in the near future. | |
## New Features | |
- #1086: Support for NVC++ aka "Feta". | |
The most significant change is in how we use `__CUDA_ARCH__`. | |
Now, there are four macros that must be used: | |
- `THRUST_IS_DEVICE_CODE`, which should be used in an `if` statement around | |
device-only code. | |
- `THRUST_INCLUDE_DEVICE_CODE`, which should be used in an `#if` preprocessor | |
directive inside of the `if` statement mentioned in the prior bullet. | |
- `THRUST_IS_HOST_CODE`, which should be used in an `if` statement around | |
host-only code. | |
- `THRUST_INCLUDE_HOST_CODE`, which should be used in an `#if` preprocessor | |
directive inside of the `if` statement mentioned in the prior bullet. | |
- #1085: `thrust::shuffle`. | |
Thanks to Rory Mitchell for this contribution. | |
- #1029: `thrust::zip_function`, a facility for zipping functions that take N | |
parameters instead of a tuple of N parameters as `thrust::zip_iterator` | |
does. | |
Thanks to Ben Jude for this contribution. | |
- #1068: `thrust::system::cuda::managed_memory_pointer`, a universal memory | |
strongly typed pointer compatible with the ISO C++ Standard Library. | |
## Other Enhancements | |
- #1029: Thrust is now built and tested with NVCC warnings treated as errors. | |
- #1029: MSVC C++11 support. | |
- #1029: `THRUST_DEPRECATED` abstraction for generating compile-time | |
deprecation warning messages. | |
- #1029: `thrust::pointer<T>::pointer_to(reference)`. | |
- #1070: Unit test for `thrust::inclusive_scan` with a user defined types. | |
Thanks to Conor Hoekstra for this contribution. | |
## Bug Fixes | |
- #1088: Allow `thrust::replace` to take functions that have non-`const` | |
`operator()`. | |
- #1094: Add missing `constexpr` to `par_t` constructors. | |
Thanks to Patrick Stotko for this contribution. | |
- #1077: Remove `__device__` from CUDA MR-based device allocators to fix | |
obscure "host function called from host device function" warning that occurs | |
when you use the new Thrust MR-based allocators. | |
- #1029: Remove inconsistently-used `THRUST_BEGIN`/`END_NS` macros. | |
- #1029: Fix C++ dialect detection on newer MSVC. | |
- #1029 Use `_Pragma`/`__pragma` instead of `#pragma` in macros. | |
- #1029: Replace raw `__cplusplus` checks with the appropriate Thrust macros. | |
- #1105: Add a missing `<math.h>` include. | |
- #1103: Fix regression of `thrust::detail::temporary_allocator` with non-CUDA | |
back ends. | |
- #1111: Use Thrust's random number engine instead of `std::`s in device code. | |
- #1108: Get rid of a GCC 9 warning about deprecated generation of copy ctors. | |
# Thrust 1.9.8-1 (NVIDIA HPC SDK 20.3) | |
## Summary | |
Thrust 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 | |
release. | |
It contains modifications necessary to serve as the implementation of NVC++'s | |
GPU-accelerated C++17 Parallel Algorithms when using the CUDA Toolkit 11.0 | |
release. | |
# Thrust 1.9.8 (CUDA Toolkit 11.0 Early Access) | |
## Summary | |
Thrust 1.9.8, which is included in the CUDA Toolkit 11.0 release, removes | |
Thrust's internal derivative of CUB, upstreams all relevant changes too CUB, | |
and adds CUB as a Git submodule. | |
It will now be necessary to do `git clone --recursive` when checking out | |
Thrust, and to update the CUB submodule when pulling in new Thrust changes. | |
Additionally, CUB is now included as a first class citizen in the CUDA toolkit. | |
Thrust 1.9.8 also fixes bugs preventing most Thrust algorithms from working | |
with more than `2^31-1` elements. | |
Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of | |
Thrust) work with large element counts. | |
## Breaking Changes | |
- Thrust will now use the version of CUB in your include path instead of its own | |
internal copy. | |
If you are using your own version of CUB, it may be older and incompatible | |
with Thrust. | |
It is recommended to simply delete your own version of CUB and use the | |
version of CUB that comes with Thrust. | |
## Other Enhancements | |
- Refactor Thrust and CUB to support 64-bit indices in most algorithms. | |
In most cases, Thrust now selects between kernels that use 32-bit indices and | |
64-bit indices at runtime depending on the size of the input. | |
This means large element counts work, but small element counts do not have to | |
pay for the register usage of 64-bit indices if they are not needed. | |
Now, `thrust::reduce`, `thrust::*_scan`, and related algorithms (aka most of | |
Thrust) work with more than `2^31-1` elements. | |
Notably, `thrust::sort` is still limited to less than `2^31-1` elements. | |
- CUB is now a submodule and the internal copy of CUB has been removed. | |
- #1051: Stop specifying the `__launch_bounds__` minimum blocks parameter | |
because it messes up register allocation and increases register pressure, | |
and we don't actually know at compile time how many blocks we will use | |
(aside from single tile kernels). | |
## Bug Fixes | |
- #1020: After making a CUDA API call, always clear the global CUDA error state | |
by calling `cudaGetLastError`. | |
- #1021: Avoid calling destroy in the destructor of a Thrust vector if the | |
vector is empty. | |
- #1046: Actually throw `thrust::bad_alloc` when `thrust::system::cuda::malloc` | |
fails instead of just constructing a temporary and doing nothing with it. | |
- Add missing copy constructor or copy assignment operator to all classes that | |
GCC 9's `-Wdeprecated-copy` complains about | |
- Add missing move operations to `thrust::system::cuda::vector`. | |
- #1015: Check that the backend is CUDA before using CUDA-specifics in | |
`thrust::detail::temporary_allocator`. | |
Thanks to Hugh Winkler for this contribution. | |
- #1055: More correctly detect the presence of aligned/sized `new`/`delete`. | |
- #1043: Fix ill-formed specialization of `thrust::system::is_error_code_enum` | |
for `thrust::event_errc`. | |
Thanks to Toru Niina for this contribution. | |
- #1027: Add tests for `thrust::tuple_for_each` and `thrust::tuple_subset`. | |
Thanks to Ben Jude for this contribution. | |
- #1027: Use correct macro in `thrust::tuple_for_each`. | |
Thanks to Ben Jude for this contribution. | |
- #1026: Use correct MSVC version formatting in CMake. | |
Thanks to Ben Jude for this contribution. | |
- Workaround an NVCC issue with type aliases with template template arguments | |
containing a parameter pack. | |
- Remove unused functions from the CUDA backend which call slow CUDA attribute | |
query APIs. | |
- Replace `CUB_RUNTIME_FUNCTION` with `THRUST_RUNTIME_FUNCTION`. | |
- Correct typo in `thrust::transform` documentation. | |
Thanks to Eden Yefet for this contribution. | |
## Known Issues | |
- `thrust::sort` remains limited to `2^31-1` elements for now. | |
# Thrust 1.9.7-1 (CUDA Toolkit 10.2 for Tegra) | |
## Summary | |
Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release | |
for Tegra. | |
It is nearly identical to 1.9.7. | |
## Bug Fixes | |
- Remove support for GCC's broken nodiscard-like attribute. | |
# Thrust 1.9.7 (CUDA Toolkit 10.2) | |
## Summary | |
Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release. | |
Unfortunately, although the version and patch numbers are identical, one bug | |
fix present in Thrust 1.9.7 (NVBug 2646034: Fix incorrect dependency handling | |
for stream acquisition in `thrust::future`) was not included in the CUDA | |
Toolkit 10.2 preview release for AArch64 SBSA. | |
The tag `cuda-10.2aarch64sbsa` contains the exact version of Thrust present | |
in the CUDA Toolkit 10.2 preview release for AArch64 SBSA. | |
## Bug Fixes | |
- #967, NVBug 2448170: Fix the CUDA backend `thrust::for_each` so that it | |
supports large input sizes with 64-bit indices. | |
- NVBug 2646034: Fix incorrect dependency handling for stream acquisition in | |
`thrust::future`. | |
- Not present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA. | |
- #968, NVBug 2612102: Fix the `thrust::mr::polymorphic_adaptor` to actually | |
use its template parameter. | |
# Thrust 1.9.6-1 (NVIDIA HPC SDK 20.3) | |
## Summary | |
Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3 | |
release. | |
It contains modifications necessary to serve as the implementation of NVC++'s | |
GPU-accelerated C++17 Parallel Algorithms when using the CUDA Toolkit 10.1 | |
Update 2 release. | |
# Thrust 1.9.6 (CUDA Toolkit 10.1 Update 2) | |
## Summary | |
Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2 | |
release. | |
## Bug Fixes | |
- NVBug 2509847: Inconsistent alignment of `thrust::complex` | |
- NVBug 2586774: Compilation failure with Clang + older libstdc++ that doesn't | |
have `std::is_trivially_copyable` | |
- NVBug 200488234: CUDA header files contain Unicode characters which leads | |
compiling errors on Windows | |
- #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822: | |
`thrust::detail::aligned_reinterpret_cast` must be annotated with | |
`__host__ __device__`. | |
- NVBug 2599629: Missing include in the OpenMP sort implementation | |
- NVBug 200513211: Truncation warning in test code under VC142 | |
# Thrust 1.9.5 (CUDA Toolkit 10.1 Update 1) | |
## Summary | |
Thrust 1.9.5 is a minor release accompanying the CUDA Toolkit 10.1 Update 1 | |
release. | |
## Bug Fixes | |
- NVBug 2502854: Fixed assignment of | |
`thrust::device_vector<thrust::complex<T>>` between host and device. | |
# Thrust 1.9.4 (CUDA Toolkit 10.1) | |
## Summary | |
Thrust 1.9.4 adds asynchronous interfaces for parallel algorithms, a new | |
allocator system including caching allocators and unified memory support, as | |
well as a variety of other enhancements, mostly related to | |
C++11/C++14/C++17/C++20 support. | |
The new asynchronous algorithms in the `thrust::async` namespace return | |
`thrust::event` or `thrust::future` objects, which can be waited upon to | |
synchronize with the completion of the parallel operation. | |
## Breaking Changes | |
Synchronous Thrust algorithms now block until all of their operations have | |
completed. | |
Use the new asynchronous Thrust algorithms for non-blocking behavior. | |
## New Features | |
- `thrust::event` and `thrust::future<T>`, uniquely-owned asynchronous handles | |
consisting of a state (ready or not ready), content (some value; for | |
`thrust::future` only), and an optional set of objects that should be | |
destroyed only when the future's value is ready and has been consumed. | |
- The design is loosely based on C++11's `std::future`. | |
- They can be `.wait`'d on, and the value of a future can be waited on and | |
retrieved with `.get` or `.extract`. | |
- Multiple `thrust::event`s and `thrust::future`s can be combined with | |
`thrust::when_all`. | |
- `thrust::future`s can be converted to `thrust::event`s. | |
- Currently, these primitives are only implemented for the CUDA backend and | |
are C++11 only. | |
- New asynchronous algorithms that return `thrust::event`/`thrust::future`s, | |
implemented as C++20 range style customization points: | |
- `thrust::async::reduce`. | |
- `thrust::async::reduce_into`, which takes a target location to store the | |
reduction result into. | |
- `thrust::async::copy`, including a two-policy overload that allows | |
explicit cross system copies which execution policy properties can be | |
attached to. | |
- `thrust::async::transform`. | |
- `thrust::async::for_each`. | |
- `thrust::async::stable_sort`. | |
- `thrust::async::sort`. | |
- By default the asynchronous algorithms use the new caching allocators. | |
Deallocation of temporary storage is deferred until the destruction of | |
the returned `thrust::future`. The content of `thrust::future`s is | |
stored in either device or universal memory and transferred to the host | |
only upon request to prevent unnecessary data migration. | |
- Asynchronous algorithms are currently only implemented for the CUDA | |
system and are C++11 only. | |
- `exec.after(f, g, ...)`, a new execution policy method that takes a set of | |
`thrust::event`/`thrust::future`s and returns an execution policy that | |
operations on that execution policy should depend upon. | |
- New logic and mindset for the type requirements for cross-system sequence | |
copies (currently only used by `thrust::async::copy`), based on: | |
- `thrust::is_contiguous_iterator` and `THRUST_PROCLAIM_CONTIGUOUS_ITERATOR` | |
for detecting/indicating that an iterator points to contiguous storage. | |
- `thrust::is_trivially_relocatable` and | |
`THRUST_PROCLAIM_TRIVIALLY_RELOCATABLE` for detecting/indicating that a | |
type is `memcpy`able (based on principles from | |
[P1144](https://wg21.link/P1144)). | |
- The new approach reduces buffering, increases performance, and increases | |
correctness. | |
- The fast path is now enabled when copying CUDA `__half` and vector types with | |
`thrust::async::copy`. | |
- All Thrust synchronous algorithms for the CUDA backend now actually | |
synchronize. Previously, any algorithm that did not allocate temporary | |
storage (counterexample: `thrust::sort`) and did not have a | |
computation-dependent result (counterexample: `thrust::reduce`) would | |
actually be launched asynchronously. Additionally, synchronous algorithms | |
that allocated temporary storage would become asynchronous if a custom | |
allocator was supplied that did not synchronize on allocation/deallocation, | |
unlike `cudaMalloc`/`cudaFree`. So, now `thrust::for_each`, | |
`thrust::transform`, `thrust::sort`, etc are truly synchronous. In some | |
cases this may be a performance regression; if you need asynchrony, use the | |
new asynchronous algorithms. | |
- Thrust's allocator framework has been rewritten. It now uses a memory | |
resource system, similar to C++17's `std::pmr` but supporting static | |
polymorphism. Memory resources are objects that allocate untyped storage and | |
allocators are cheap handles to memory resources in this new model. The new | |
facilities live in `<thrust/mr/*>`. | |
- `thrust::mr::memory_resource<Pointer>`, the memory resource base class, | |
which takes a (possibly tagged) pointer to `void` type as a parameter. | |
- `thrust::mr::allocator<T, MemoryResource>`, an allocator backed by a memory | |
resource object. | |
- `thrust::mr::polymorphic_adaptor_resource<Pointer>`, a type-erased memory | |
resource adaptor. | |
- `thrust::mr::polymorphic_allocator<T>`, a C++17-style polymorphic allocator | |
backed by a type-erased memory resource object. | |
- New tunable C++17-style caching memory resources, | |
`thrust::mr::(disjoint_)?(un)?synchronized_pool_resource`, designed to | |
cache both small object allocations and large repetitive temporary | |
allocations. The disjoint variants use separate storage for management of | |
the pool, which is necessary if the memory being allocated cannot be | |
accessed on the host (e.g. device memory). | |
- System-specific allocators were rewritten to use the new memory resource | |
framework. | |
- New `thrust::device_memory_resource` for allocating device memory. | |
- New `thrust::universal_memory_resource` for allocating memory that can be | |
accessed from both the host and device (e.g. `cudaMallocManaged`). | |
- New `thrust::universal_host_pinned_memory_resource` for allocating memory | |
that can be accessed from the host and the device but always resides in | |
host memory (e.g. `cudaMallocHost`). | |
- `thrust::get_per_device_resource` and `thrust::per_device_allocator`, which | |
lazily create and retrieve a per-device singleton memory resource. | |
- Rebinding mechanisms (`rebind_traits` and `rebind_alloc`) for | |
`thrust::allocator_traits`. | |
- `thrust::device_make_unique`, a factory function for creating a | |
`std::unique_ptr` to a newly allocated object in device memory. | |
- `<thrust/detail/memory_algorithms>`, a C++11 implementation of the C++17 | |
uninitialized memory algorithms. | |
- `thrust::allocate_unique` and friends, based on the proposed C++23 | |
[`std::allocate_unique`](https://wg21.link/P0211). | |
- New type traits and metaprogramming facilities. Type traits are slowly being | |
migrated out of `thrust::detail::` and `<thrust/detail/*>`; their new home | |
will be `thrust::` and `<thrust/type_traits/*>`. | |
- `thrust::is_execution_policy`. | |
- `thrust::is_operator_less_or_greater_function_object`, which detects | |
`thrust::less`, `thrust::greater`, `std::less`, and `std::greater`. | |
- `thrust::is_operator_plus_function_object``, which detects `thrust::plus` | |
and `std::plus`. | |
- `thrust::remove_cvref(_t)?`, a C++11 implementation of C++20's | |
`thrust::remove_cvref(_t)?`. | |
- `thrust::void_t`, and various other new type traits. | |
- `thrust::integer_sequence` and friends, a C++11 implementation of C++20's | |
`std::integer_sequence` | |
- `thrust::conjunction`, `thrust::disjunction`, and `thrust::disjunction`, a | |
C++11 implementation of C++17's logical metafunctions. | |
- Some Thrust type traits (such as `thrust::is_constructible`) have been | |
redefined in terms of C++11's type traits when they are available. | |
- `<thrust/detail/tuple_algorithms.h>`, new `std::tuple` algorithms: | |
- `thrust::tuple_transform`. | |
- `thrust::tuple_for_each`. | |
- `thrust::tuple_subset`. | |
- Miscellaneous new `std::`-like facilities: | |
- `thrust::optional`, a C++11 implementation of C++17's `std::optional`. | |
- `thrust::addressof`, an implementation of C++11's `std::addressof`. | |
- `thrust::next` and `thrust::prev`, an implementation of C++11's `std::next` | |
and `std::prev`. | |
- `thrust::square`, a `<functional>` style unary function object that | |
multiplies its argument by itself. | |
- `<thrust/limits.h>` and `thrust::numeric_limits`, a customized version of | |
`<limits>` and `std::numeric_limits`. | |
- `<thrust/detail/preprocessor.h>`, new general purpose preprocessor facilities: | |
- `THRUST_PP_CAT[2-5]`, concatenates two to five tokens. | |
- `THRUST_PP_EXPAND(_ARGS)?`, performs double expansion. | |
- `THRUST_PP_ARITY` and `THRUST_PP_DISPATCH`, tools for macro overloading. | |
- `THRUST_PP_BOOL`, boolean conversion. | |
- `THRUST_PP_INC` and `THRUST_PP_DEC`, increment/decrement. | |
- `THRUST_PP_HEAD`, a variadic macro that expands to the first argument. | |
- `THRUST_PP_TAIL`, a variadic macro that expands to all its arguments after | |
the first. | |
- `THRUST_PP_IIF`, bitwise conditional. | |
- `THRUST_PP_COMMA_IF`, and `THRUST_PP_HAS_COMMA`, facilities for adding and | |
detecting comma tokens. | |
- `THRUST_PP_IS_VARIADIC_NULLARY`, returns true if called with a nullary | |
`__VA_ARGS__`. | |
- `THRUST_CURRENT_FUNCTION`, expands to the name of the current function. | |
- New C++11 compatibility macros: | |
- `THRUST_NODISCARD`, expands to `[[nodiscard]]` when available and the best | |
equivalent otherwise. | |
- `THRUST_CONSTEXPR`, expands to `constexpr` when available and the best | |
equivalent otherwise. | |
- `THRUST_OVERRIDE`, expands to `override` when available and the best | |
equivalent otherwise. | |
- `THRUST_DEFAULT`, expands to `= default;` when available and the best | |
equivalent otherwise. | |
- `THRUST_NOEXCEPT`, expands to `noexcept` when available and the best | |
equivalent otherwise. | |
- `THRUST_FINAL`, expands to `final` when available and the best equivalent | |
otherwise. | |
- `THRUST_INLINE_CONSTANT`, expands to `inline constexpr` when available and | |
the best equivalent otherwise. | |
- `<thrust/detail/type_deduction.h>`, new C++11-only type deduction helpers: | |
- `THRUST_DECLTYPE_RETURNS*`, expand to function definitions with suitable | |
conditional `noexcept` qualifiers and trailing return types. | |
- `THRUST_FWD(x)`, expands to `::std::forward<decltype(x)>(x)`. | |
- `THRUST_MVCAP`, expands to a lambda move capture. | |
- `THRUST_RETOF`, expands to a decltype computing the return type of an | |
invocable. | |
- New CMake build system. | |
## New Examples | |
- `mr_basic` demonstrates how to use the new memory resource allocator system. | |
## Other Enhancements | |
- Tagged pointer enhancements: | |
- New `thrust::pointer_traits` specialization for `void const*`. | |
- `nullptr` support to Thrust tagged pointers. | |
- New `explicit operator bool` for Thrust tagged pointers when using C++11 | |
for `std::unique_ptr` interoperability. | |
- Added `thrust::reinterpret_pointer_cast` and `thrust::static_pointer_cast` | |
for casting Thrust tagged pointers. | |
- Iterator enhancements: | |
- `thrust::iterator_system` is now SFINAE friendly. | |
- Removed cv qualifiers from iterator types when using | |
`thrust::iterator_system`. | |
- Static assert enhancements: | |
- New `THRUST_STATIC_ASSERT_MSG`, takes an optional string constant to be | |
used as the error message when possible. | |
- Update `THRUST_STATIC_ASSERT(_MSG)` to use C++11's `static_assert` when | |
it's available. | |
- Introduce a way to test for static assertions. | |
- Testing enhancements: | |
- Additional scalar and sequence types, including non-builtin types and | |
vectors with unified memory allocators, have been added to the list of | |
types used by generic unit tests. | |
- The generation of random input data has been improved to increase the range | |
of values used and catch more corner cases. | |
- New `unittest::truncate_to_max_representable` utility for avoiding the | |
generation of ranges that cannot be represented by the underlying element | |
type in generic unit test code. | |
- The test driver now synchronizes with CUDA devices and check for errors | |
after each test, when switching devices, and after each raw kernel launch. | |
- The `warningtester` uber header is now compiled with NVCC to avoid needing | |
to disable CUDA-specific code with the preprocessor. | |
- Fixed the unit test framework's `ASSERT_*` to print `char`s as `int`s. | |
- New `DECLARE_INTEGRAL_VARIABLE_UNITTEST` test declaration macro. | |
- New `DECLARE_VARIABLE_UNITTEST_WITH_TYPES_AND_NAME` test declaration macro. | |
- `thrust::system_error` in the CUDA backend now print out its `cudaError_t` | |
enumerator in addition to the diagnostic message. | |
- Stopped using conditionally signed types like `char`. | |
## Bug Fixes | |
- #897, NVBug 2062242: Fix compilation error when using `__device__` lambdas | |
with `thrust::reduce` on MSVC. | |
- #908, NVBug 2089386: Static assert that `thrust::generate`/`thrust::fill` | |
isn't operating on const iterators. | |
- #919 Fix compilation failure with `thrust::zip_iterator` and | |
`thrust::complex`. | |
- #924, NVBug 2096679, NVBug 2315990: Fix dispatch for the CUDA backend's | |
`thrust::reduce` to use two functions (one with the pragma for disabling | |
exec checks, one with `THRUST_RUNTIME_FUNCTION`) instead of one. This fixes | |
a regression with device compilation that started in CUDA Toolkit 9.2. | |
- #928, NVBug 2341455: Add missing `__host__ __device__` annotations to a | |
`thrust::complex::operator=` to satisfy GoUDA. | |
- NVBug 2094642: Make `thrust::vector_base::clear` not depend on the element | |
type being default constructible. | |
- NVBug 2289115: Remove flaky `simple_cuda_streams` example. | |
- NVBug 2328572: Add missing `thrust::device_vector` constructor that takes an | |
allocator parameter. | |
- NVBug 2455740: Update the `range_view` example to not use device-side launch. | |
- NVBug 2455943: Ensure that sized unit tests that use | |
`thrust::counting_iterator` perform proper truncation. | |
- NVBug 2455952: Refactor questionable `thrust::copy_if` unit tests. | |
# Thrust 1.9.3 (CUDA Toolkit 10.0) | |
## Summary | |
Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust. | |
## Bug Fixes | |
- #725, #850, #855, #859, #860: Unify the `thrust::iter_swap` interface and fix | |
`thrust::device_reference` swapping. | |
- NVBug 2004663: Add a `data` method to `thrust::detail::temporary_array` and | |
refactor temporary memory allocation in the CUDA backend to be exception | |
and leak safe. | |
- #886, #894, #914: Various documentation typo fixes. | |
- #724: Provide `NVVMIR_LIBRARY_DIR` environment variable to NVCC. | |
- #878: Optimize `thrust::min/max_element` to only use | |
`thrust::detail::get_iterator_value` for non-numeric types. | |
- #899: Make `thrust::cuda::experimental::pinned_allocator`'s comparison | |
operators `const`. | |
- NVBug 2092152: Remove all includes of `<cuda.h>`. | |
- #911: Fix default comparator element type for `thrust::merge_by_key`. | |
## Acknowledgments | |
- Thanks to Andrew Corrigan for contributing fixes for swapping interfaces. | |
- Thanks to Francisco Facioni for contributing optimizations for | |
`thrust::min/max_element`. | |
# Thrust 1.9.2 (CUDA Toolkit 9.2) | |
## Summary | |
Thrust 1.9.2 brings a variety of performance enhancements, bug fixes and test | |
improvements. | |
CUB 1.7.5 was integrated, enhancing the performance of `thrust::sort` on | |
small data types and `thrust::reduce`. | |
Changes were applied to `complex` to optimize memory access. | |
Thrust now compiles with compiler warnings enabled and treated as errors. | |
Additionally, the unit test suite and framework was enhanced to increase | |
coverage. | |
## Breaking Changes | |
- The `fallback_allocator` example was removed, as it was buggy and difficult | |
to support. | |
## New Features | |
- `<thrust/detail/alignment.h>`, utilities for memory alignment: | |
- `thrust::aligned_reinterpret_cast`. | |
- `thrust::aligned_storage_size`, which computes the amount of storage needed | |
for an object of a particular size and alignment. | |
- `thrust::alignment_of`, a C++03 implementation of C++11's | |
`std::alignment_of`. | |
- `thrust::aligned_storage`, a C++03 implementation of C++11's | |
`std::aligned_storage`. | |
- `thrust::max_align_t`, a C++03 implementation of C++11's | |
`std::max_align_t`. | |
## Bug Fixes | |
- NVBug 200385527, NVBug 200385119, NVBug 200385113, NVBug 200349350, NVBug | |
2058778: Various compiler warning issues. | |
- NVBug 200355591: `thrust::reduce` performance issues. | |
- NVBug 2053727: Fixed an ADL bug that caused user-supplied `allocate` to be | |
overlooked but `deallocate` to be called with GCC <= 4.3. | |
- NVBug 1777043: Fixed `thrust::complex` to work with `thrust::sequence`. | |
# Thrust 1.9.1-2 (CUDA Toolkit 9.1) | |
## Summary | |
Thrust 1.9.1-2 integrates version 1.7.4 of CUB and introduces a new CUDA backend | |
for `thrust::reduce` based on CUB. | |
## Bug Fixes | |
- NVBug 1965743: Remove unnecessary static qualifiers. | |
- NVBug 1940974: Fix regression causing a compilation error when using | |
`thrust::merge_by_key` with `thrust::constant_iterator`s. | |
- NVBug 1904217: Allow callables that take non-const refs to be used with | |
`thrust::reduce` and `thrust::*_scan`. | |
# Thrust 1.9.0-5 (CUDA Toolkit 9.0) | |
## Summary | |
Thrust 1.9.0-5 replaces the original CUDA backend (bulk) with a new one | |
written using CUB, a high performance CUDA collectives library. | |
This brings a substantial performance improvement to the CUDA backend across | |
the board. | |
## Breaking Changes | |
- Any code depending on CUDA backend implementation details will likely be | |
broken. | |
## New Features | |
- New CUDA backend based on CUB which delivers substantially higher performance. | |
- `thrust::transform_output_iterator`, a fancy iterator that applies a function | |
to the output before storing the result. | |
## New Examples | |
- `transform_output_iterator` demonstrates use of the new fancy iterator | |
`thrust::transform_output_iterator`. | |
## Other Enhancements | |
- When C++11 is enabled, functors do not have to inherit from | |
`thrust::(unary|binary)_function` anymore to be used with | |
`thrust::transform_iterator`. | |
- Added C++11 only move constructors and move assignment operators for | |
`thrust::detail::vector_base`-based classes, e.g. `thrust::host_vector`, | |
`thrust::device_vector`, and friends. | |
## Bug Fixes | |
- `sin(thrust::complex<double>)` no longer has precision loss to float. | |
## Acknowledgments | |
- Thanks to Manuel Schiller for contributing a C++11 based enhancement | |
regarding the deduction of functor return types, improving the performance | |
of `thrust::unique` and implementing `thrust::transform_output_iterator`. | |
- Thanks to Thibault Notargiacomo for the implementation of move semantics for | |
the `thrust::vector_base`-based classes. | |
- Thanks to Duane Merrill for developing CUB and helping to integrate it into | |
Thrust's backend. | |
# Thrust 1.8.3 (CUDA Toolkit 8.0) | |
## Summary | |
Thrust 1.8.3 is a small bug fix release. | |
## New Examples | |
- `range_view` demonstrates the use of a view (a non-owning wrapper for an | |
iterator range with a container-like interface). | |
## Bug Fixes | |
- `thrust::(min|max|minmax)_element` can now accept raw device pointers when | |
an explicit device execution policy is used. | |
- `thrust::clear` operations on vector types no longer requires the element | |
type to have a default constructor. | |
# Thrust 1.8.2 (CUDA Toolkit 7.5) | |
## Summary | |
Thrust 1.8.2 is a small bug fix release. | |
## Bug Fixes | |
- Avoid warnings and errors concerning user functions called from | |
`__host__ __device__` functions. | |
- #632: Fix an error in `thrust::set_intersection_by_key` with the CUDA backend. | |
- #651: `thrust::copy` between host and device now accepts execution policies | |
with streams attached, i.e. `thrust::::cuda::par.on(stream)`. | |
- #664: `thrust::for_each` and algorithms based on it no longer ignore streams | |
attached to execution policys. | |
## Known Issues | |
- #628: `thrust::reduce_by_key` for the CUDA backend fails for Compute | |
Capability 5.0 devices. | |
# Thrust 1.8.1 (CUDA Toolkit 7.0) | |
## Summary | |
Thrust 1.8.1 is a small bug fix release. | |
## Bug Fixes | |
- #615, #620: Fixed `thrust::for_each` and `thrust::reduce` to no longer fail on | |
large inputs. | |
## Known Issues | |
- #628: `thrust::reduce_by_key` for the CUDA backend fails for Compute | |
Capability 5.0 devices. | |
# Thrust 1.8.0 | |
## Summary | |
Thrust 1.8.0 introduces support for algorithm invocation from CUDA device | |
code, support for CUDA streams, and algorithm performance improvements. | |
Users may now invoke Thrust algorithms from CUDA device code, providing a | |
parallel algorithms library to CUDA programmers authoring custom kernels, as | |
well as allowing Thrust programmers to nest their algorithm calls within | |
functors. | |
The `thrust::seq` execution policy allows users to require sequential algorithm | |
execution in the calling thread and makes a sequential algorithms library | |
available to individual CUDA threads. | |
The `.on(stream)` syntax allows users to request a CUDA stream for kernels | |
launched during algorithm execution. | |
Finally, new CUDA algorithm implementations provide substantial performance | |
improvements. | |
## New Features | |
- Algorithms in CUDA Device Code: | |
- Thrust algorithms may now be invoked from CUDA `__device__` and | |
`__host__` __device__ functions. | |
Algorithms invoked in this manner must be invoked with an execution | |
policy as the first parameter. | |
The following execution policies are supported in CUDA __device__ code: | |
- `thrust::seq` | |
- `thrust::cuda::par` | |
- `thrust::device`, when THRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA. | |
- Device-side algorithm execution may not be parallelized unless CUDA Dynamic | |
Parallelism is available. | |
- Execution Policies: | |
- CUDA Streams | |
- The `thrust::cuda::par.on(stream)` syntax allows users to request that | |
CUDA kernels launched during algorithm execution should occur on a given | |
stream. | |
- Algorithms executed with a CUDA stream in this manner may still | |
synchronize with other streams when allocating temporary storage or | |
returning results to the CPU. | |
- `thrust::seq`, which allows users to require that an algorithm execute | |
sequentially in the calling thread. | |
- `thrust::complex`, a complex number data type. | |
## New Examples | |
- simple_cuda_streams demonstrates how to request a CUDA stream during | |
algorithm execution. | |
- async_reduce demonstrates ways to achieve algorithm invocations which are | |
asynchronous with the calling thread. | |
## Other Enhancements | |
- CUDA sort performance for user-defined types is 300% faster on Tesla K20c for | |
large problem sizes. | |
- CUDA merge performance is 200% faster on Tesla K20c for large problem sizes. | |
- CUDA sort performance for primitive types is 50% faster on Tesla K20c for | |
large problem sizes. | |
- CUDA reduce_by_key performance is 25% faster on Tesla K20c for large problem | |
sizes. | |
- CUDA scan performance is 15% faster on Tesla K20c for large problem sizes. | |
- fallback_allocator example is simpler. | |
## Bug Fixes | |
- #364: Iterators with unrelated system tags may be used with algorithms invoked | |
with an execution policy | |
- #371: Do not redefine `__CUDA_ARCH__`. | |
- #379: Fix crash when dereferencing transform_iterator on the host. | |
- #391: Avoid use of uppercase variable names. | |
- #392: Fix `thrust::copy` between `cusp::complex` and `std::complex`. | |
- #396: Program compiled with gcc < 4.3 hangs during comparison sort. | |
- #406: `fallback_allocator.cu` example checks device for unified addressing support. | |
- #417: Avoid using `std::less<T>` in binary search algorithms. | |
- #418: Avoid various warnings. | |
- #443: Including version.h no longer configures default systems. | |
- #578: NVCC produces warnings when sequential algorithms are used with CPU systems. | |
## Known Issues | |
- When invoked with primitive data types, thrust::sort, thrust::sort_by_key, | |
thrust::stable_sort, & thrust::stable_sort_by_key may | |
- Sometimes linking fails when compiling with `-rdc=true` with NVCC. | |
- The CUDA implementation of thrust::reduce_by_key incorrectly outputs the last | |
element in a segment of equivalent keys instead of the first. | |
## Acknowledgments | |
- Thanks to Sean Baxter for contributing faster CUDA reduce, merge, and scan | |
implementations. | |
- Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation. | |
- Thanks to Filipe Maia for contributing the implementation of thrust::complex. | |
# Thrust 1.7.2 (CUDA Toolkit 6.5) | |
## Summary | |
Thrust 1.7.2 is a minor bug fix release. | |
## Bug Fixes | |
- Avoid use of `std::min` in generic find implementation. | |
# Thrust 1.7.1 (CUDA Toolkit 6.0) | |
## Summary | |
Thrust 1.7.1 is a minor bug fix release. | |
## Bug Fixes | |
- Eliminate identifiers in `set_operations.cu` example with leading underscore. | |
- Eliminate unused variable warning in CUDA `reduce_by_key` implementation. | |
- Avoid deriving function objects from `std::unary_function` and | |
`std::binary_function`. | |
# Thrust 1.7.0 (CUDA Toolkit 5.5) | |
## Summary | |
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as | |
well as several new algorithms and performance improvements. | |
With this new interface, users may directly control how algorithms execute as | |
well as details such as the allocation of temporary storage. | |
Key/value versions of thrust::merge and the set operation algorithms have been | |
added, as well stencil versions of partitioning algorithms. | |
thrust::tabulate has been introduced to tabulate the values of functions taking | |
integers. | |
For 32b types, new CUDA merge and set operations provide 2-15x faster | |
performance while a new CUDA comparison sort provides 1.3-4x faster | |
performance. | |
Finally, a new TBB reduce_by_key implementation provides 80% faster | |
performance. | |
## Breaking Changes | |
- Dispatch: | |
- Custom user backend systems' tag types must now inherit from the | |
corresponding system's execution_policy template (e.g. | |
thrust::cuda::execution_policy) instead of the tag struct (e.g. | |
thrust::cuda::tag). Otherwise, algorithm specializations will silently go | |
unfound during dispatch. See examples/minimal_custom_backend.cu and | |
examples/cuda/fallback_allocator.cu for usage examples. | |
- thrust::advance and thrust::distance are no longer dispatched based on | |
iterator system type and thus may no longer be customized. | |
- Iterators: | |
- iterator_facade and iterator_adaptor's Pointer template parameters have | |
been eliminated. | |
- iterator_adaptor has been moved into the thrust namespace (previously | |
thrust::experimental::iterator_adaptor). | |
- iterator_facade has been moved into the thrust namespace (previously | |
thrust::experimental::iterator_facade). | |
- iterator_core_access has been moved into the thrust namespace (previously | |
thrust::experimental::iterator_core_access). | |
- All iterators' nested pointer typedef (the type of the result of | |
operator->) is now void instead of a pointer type to indicate that such | |
expressions are currently impossible. | |
- Floating point counting_iterators' nested difference_type typedef is now a | |
signed integral type instead of a floating point type. | |
- Other: | |
- normal_distribution has been moved into the thrust::random namespace | |
(previously thrust::random::experimental::normal_distribution). | |
- Placeholder expressions may no longer include the comma operator. | |
## New Features | |
- Execution Policies: | |
- Users may directly control the dispatch of algorithm invocations with | |
optional execution policy arguments. | |
For example, instead of wrapping raw pointers allocated by cudaMalloc with | |
thrust::device_ptr, the thrust::device execution_policy may be passed as | |
an argument to an algorithm invocation to enable CUDA execution. | |
- The following execution policies are supported in this version: | |
- `thrust::host` | |
- `thrust::device` | |
- `thrust::cpp::par` | |
- `thrust::cuda::par` | |
- `thrust::omp::par` | |
- `thrust::tbb::par` | |
- Algorithms: | |
- `thrust::merge_by_key` | |
- `thrust::partition` with stencil | |
- `thrust::partition_copy` with stencil | |
- `thrust::set_difference_by_key` | |
- `thrust::set_intersection_by_key` | |
- `thrust::set_symmetric_difference_by_key` | |
- `thrust::set_union_by_key` | |
- `thrust::stable_partition with stencil` | |
- `thrust::stable_partition_copy with stencil` | |
- `thrust::tabulate` | |
- Memory Allocation: | |
- `thrust::malloc` | |
- `thrust::free` | |
- `thrust::get_temporary_buffer` | |
- `thrust::return_temporary_buffer` | |
## New Examples | |
- uninitialized_vector demonstrates how to use a custom allocator to avoid the | |
automatic initialization of elements in thrust::device_vector. | |
## Other Enhancements | |
- Authors of custom backend systems may manipulate arbitrary state during | |
algorithm dispatch by incorporating it into their execution_policy parameter. | |
- Users may control the allocation of temporary storage during algorithm | |
execution by passing standard allocators as parameters via execution policies | |
such as thrust::device. | |
- THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the | |
device backend. | |
- CUDA merge performance is 2-15x faster. | |
- CUDA comparison sort performance is 1.3-4x faster. | |
- CUDA set operation performance is 1.5-15x faster. | |
- TBB reduce_by_key performance is 80% faster. | |
- Several algorithms have been parallelized with TBB. | |
- Support for user allocators in vectors has been improved. | |
- The sparse_vector example is now implemented with merge_by_key instead of | |
sort_by_key. | |
- Warnings have been eliminated in various contexts. | |
- Warnings about __host__ or __device__-only functions called from __host__ | |
__device__ functions have been eliminated in various contexts. | |
- Documentation about algorithm requirements have been improved. | |
- Simplified the minimal_custom_backend example. | |
- Simplified the cuda/custom_temporary_allocation example. | |
- Simplified the cuda/fallback_allocator example. | |
## Bug Fixes | |
- #248: Fix broken `thrust::counting_iterator<float>` behavior with OpenMP. | |
- #231, #209: Fix set operation failures with CUDA. | |
- #187: Fix incorrect occupancy calculation with CUDA. | |
- #153: Fix broken multi GPU behavior with CUDA. | |
- #142: Eliminate warning produced by `thrust::random::taus88` and MSVC 2010. | |
- #208: Correctly initialize elements in temporary storage when necessary. | |
- #16: Fix compilation error when sorting bool with CUDA. | |
- #10: Fix ambiguous overloads of `thrust::reinterpret_tag`. | |
## Known Issues | |
- GCC 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly | |
causing infinite recursion in examples such as | |
cuda/custom_temporary_allocation. | |
## Acknowledgments | |
- Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing | |
a faster merge implementation for CUDA. | |
- Thanks to Sean Baxter for contributing a faster set operation implementation | |
for CUDA. | |
- Thanks to Cliff Woolley for contributing a correct occupancy calculation | |
algorithm. | |
# Thrust 1.6.0 | |
## Summary | |
Thrust 1.6.0 provides an interface for customization and extension and a new | |
backend system based on the Threading Building Blocks library. | |
With this new interface, programmers may customize the behavior of specific | |
algorithms as well as control the allocation of temporary storage or invent | |
entirely new backends. | |
These enhancements also allow multiple different backend systems | |
such as CUDA and OpenMP to coexist within a single program. | |
Support for TBB allows Thrust programs to integrate more naturally into | |
applications which may already employ the TBB task scheduler. | |
## Breaking Changes | |
- The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to | |
<thrust/system/cuda/experimental/pinned_allocator.h> | |
- thrust::experimental::cuda::pinned_allocator has been moved to | |
thrust::cuda::experimental::pinned_allocator | |
- The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM | |
- The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA | |
- The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP | |
- thrust::host_space_tag has been renamed thrust::host_system_tag | |
- thrust::device_space_tag has been renamed thrust::device_system_tag | |
- thrust::any_space_tag has been renamed thrust::any_system_tag | |
- thrust::iterator_space has been renamed thrust::iterator_system | |
## New Features | |
- Backend Systems | |
- Threading Building Blocks (TBB) is now supported | |
- Algorithms | |
- `thrust::for_each_n` | |
- `thrust::raw_reference_cast` | |
- Types | |
- `thrust::pointer` | |
- `thrust::reference` | |
## New Examples | |
- `cuda/custom_temporary_allocation` | |
- `cuda/fallback_allocator` | |
- `device_ptr` | |
- `expand` | |
- `minimal_custom_backend` | |
- `raw_reference_cast` | |
- `set_operations` | |
## Other Enhancements | |
- thrust::for_each now returns the end of the input range similar to most other algorithms | |
- thrust::pair and thrust::tuple have swap functionality | |
- All CUDA algorithms now support large data types | |
- Iterators may be dereferenced in user __device__ or __global__ functions | |
- The safe use of different backend systems is now possible within a single binary | |
## Bug Fixes | |
- #469 `min_element` and `max_element` algorithms no longer require a const comparison operator | |
## Known Issues | |
- NVCC may crash when parsing TBB headers on Windows. | |
# Thrust 1.5.3 (CUDA Toolkit 5.0) | |
## Summary | |
Thrust 1.5.3 is a minor bug fix release. | |
## Bug Fixes | |
- Avoid warnings about potential race due to `__shared__` non-POD variable | |
# Thrust 1.5.2 (CUDA Toolkit 4.2) | |
## Summary | |
Thrust 1.5.2 is a minor bug fix release. | |
## Bug Fixes | |
- Fixed warning about C-style initialization of structures | |
# Thrust 1.5.1 (CUDA Toolkit 4.1) | |
## Summary | |
Thrust 1.5.1 is a minor bug fix release. | |
## Bug Fixes | |
- Sorting data referenced by permutation_iterators on CUDA produces invalid results | |
# Thrust 1.5.0 | |
## Summary | |
Thrust 1.5.0 provides introduces new programmer productivity and performance | |
enhancements. | |
New functionality for creating anonymous "lambda" functions has been added. | |
A faster host sort provides 2-10x faster performance for sorting arithmetic | |
types on (single-threaded) CPUs. | |
A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a | |
quad-core CPU. | |
When sorting arithmetic types with the OpenMP backend the combined performance | |
improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to | |
14.2x (8-bit types). | |
A new CUDA `reduce_by_key` implementation provides 2-3x faster | |
performance. | |
## Breaking Changes | |
- device_ptr<void> no longer unsafely converts to device_ptr<T> without an | |
explicit cast. | |
Use the expression device_pointer_cast(static_cast<int*>(void_ptr.get())) to | |
convert, for example, device_ptr<void> to device_ptr<int>. | |
## New Features | |
- Algorithms: | |
- Stencil-less `thrust::transform_if`. | |
- Lambda placeholders | |
## New Examples | |
- lambda | |
## Other Enhancements | |
- Host sort is 2-10x faster for arithmetic types | |
- OMP sort provides speedup over host sort | |
- `reduce_by_key` is 2-3x faster | |
- `reduce_by_key` no longer requires O(N) temporary storage | |
- CUDA scan algorithms are 10-40% faster | |
- `host_vector` and `device_vector` are now documented | |
- out-of-memory exceptions now provide detailed information from CUDART | |
- improved histogram example | |
- `device_reference` now has a specialized swap | |
- `reduce_by_key` and scan algorithms are compatible with `discard_iterator` | |
## Bug Fixes | |
- #44: Allow `thrust::host_vector` to compile when `value_type` uses | |
`__align__`. | |
- #198: Allow `thrust::adjacent_difference` to permit safe in-situ operation. | |
- #303: Make thrust thread-safe. | |
- #313: Avoid race conditions in `thrust::device_vector::insert`. | |
- #314: Avoid unintended ADL invocation when dispatching copy. | |
- #365: Fix merge and set operation failures. | |
## Known Issues | |
- None | |
## Acknowledgments | |
- Thanks to Manjunath Kudlur for contributing his Carbon library, from which | |
the lambda functionality is derived. | |
- Thanks to Jean-Francois Bastien for suggesting a fix for #303. | |
# Thrust 1.4.0 (CUDA Toolkit 4.0) | |
## Summary | |
Thrust 1.4.0 is the first release of Thrust to be included in the CUDA Toolkit. | |
Additionally, it brings many feature and performance improvements. | |
New set theoretic algorithms operating on sorted sequences have been added. | |
Additionally, a new fancy iterator allows discarding redundant or otherwise | |
unnecessary output from algorithms, conserving memory storage and bandwidth. | |
## Breaking Changes | |
- Eliminations | |
- `thrust/is_sorted.h` | |
- `thrust/utility.h` | |
- `thrust/set_intersection.h` | |
- `thrust/experimental/cuda/ogl_interop_allocator.h` and the functionality | |
therein | |
- `thrust::deprecated::copy_when` | |
- `thrust::deprecated::absolute_value` | |
- `thrust::deprecated::copy_when` | |
- `thrust::deprecated::absolute_value` | |
- `thrust::deprecated::copy_when` | |
- `thrust::deprecated::absolute_value` | |
- `thrust::gather` and `thrust::scatter` from host to device and vice versa | |
are no longer supported. | |
- Operations which modify the elements of a thrust::device_vector are no longer | |
available from source code compiled without nvcc when the device backend | |
is CUDA. | |
Instead, use the idiom from the cpp_interop example. | |
## New Features | |
- Algorithms: | |
- `thrust::copy_n` | |
- `thrust::merge` | |
- `thrust::set_difference` | |
- `thrust::set_symmetric_difference` | |
- `thrust::set_union` | |
- Types | |
- `thrust::discard_iterator` | |
- Device Support: | |
- Compute Capability 2.1 GPUs. | |
## New Examples | |
- run_length_decoding | |
## Other Enhancements | |
- Compilation warnings are substantially reduced in various contexts. | |
- The compilation time of thrust::sort, thrust::stable_sort, | |
thrust::sort_by_key, and thrust::stable_sort_by_key are substantially | |
reduced. | |
- A fast sort implementation is used when sorting primitive types with | |
thrust::greater. | |
- The performance of thrust::set_intersection is improved. | |
- The performance of thrust::fill is improved on SM 1.x devices. | |
- A code example is now provided in each algorithm's documentation. | |
- thrust::reverse now operates in-place | |
## Bug Fixes | |
- #212: `thrust::set_intersection` works correctly for large input sizes. | |
- #275: `thrust::counting_iterator` and `thrust::constant_iterator` work | |
correctly with OpenMP as the backend when compiling with optimization. | |
- #256: `min` and `max` correctly return their first argument as a tie-breaker | |
- #248: `NDEBUG` is interpreted incorrectly | |
## Known Issues | |
- NVCC may generate code containing warnings when compiling some Thrust | |
algorithms. | |
- When compiling with `-arch=sm_1x`, some Thrust algorithms may cause NVCC to | |
issue benign pointer advisories. | |
- When compiling with `-arch=sm_1x` and -G, some Thrust algorithms may fail to | |
execute correctly. | |
- `thrust::inclusive_scan`, `thrust::exclusive_scan`, | |
`thrust::inclusive_scan_by_key`, and `thrust::exclusive_scan_by_key` are | |
currently incompatible with `thrust::discard_iterator`. | |
## Acknowledgments | |
- Thanks to David Tarjan for improving the performance of set_intersection. | |
- Thanks to Duane Merrill for continued help with sort. | |
- Thanks to Nathan Whitehead for help with CUDA Toolkit integration. | |
# Thrust 1.3.0 | |
## Summary | |
Thrust 1.3.0 provides support for CUDA Toolkit 3.2 in addition to many feature | |
and performance enhancements. | |
Performance of the sort and sort_by_key algorithms is improved by as much as 3x | |
in certain situations. | |
The performance of stream compaction algorithms, such as copy_if, is improved | |
by as much as 2x. | |
CUDA errors are now converted to runtime exceptions using the system_error | |
interface. | |
Combined with a debug mode, also new in 1.3, runtime errors can be located with | |
greater precision. | |
Lastly, a few header files have been consolidated or renamed for clarity. | |
See the deprecations section below for additional details. | |
## Breaking Changes | |
- Promotions | |
- thrust::experimental::inclusive_segmented_scan has been renamed | |
thrust::inclusive_scan_by_key and exposes a different interface | |
- thrust::experimental::exclusive_segmented_scan has been renamed | |
thrust::exclusive_scan_by_key and exposes a different interface | |
- thrust::experimental::partition_copy has been renamed | |
thrust::partition_copy and exposes a different interface | |
- thrust::next::gather has been renamed thrust::gather | |
- thrust::next::gather_if has been renamed thrust::gather_if | |
- thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy | |
- Deprecations | |
- thrust::copy_when has been renamed thrust::deprecated::copy_when | |
- thrust::absolute_value has been renamed thrust::deprecated::absolute_value | |
- The header thrust/set_intersection.h is now deprecated; use | |
thrust/set_operations.h instead | |
- The header thrust/utility.h is now deprecated; use thrust/swap.h instead | |
- The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead | |
- Eliminations | |
- thrust::deprecated::gather | |
- thrust::deprecated::gather_if | |
- thrust/experimental/arch.h and the functions therein | |
- thrust/sorting/merge_sort.h | |
- thrust/sorting/radix_sort.h | |
- NVCC 2.3 is no longer supported | |
## New Features | |
- Algorithms: | |
- `thrust::exclusive_scan_by_key` | |
- `thrust::find` | |
- `thrust::find_if` | |
- `thrust::find_if_not` | |
- `thrust::inclusive_scan_by_key` | |
- `thrust::is_partitioned` | |
- `thrust::is_sorted_until` | |
- `thrust::mismatch` | |
- `thrust::partition_point` | |
- `thrust::reverse` | |
- `thrust::reverse_copy` | |
- `thrust::stable_partition_copy` | |
- Types: | |
- `thrust::system_error` and related types. | |
- `thrust::experimental::cuda::ogl_interop_allocator`. | |
- `thrust::bit_and`, `thrust::bit_or`, and `thrust::bit_xor`. | |
- Device Support: | |
- GF104-based GPUs. | |
## New Examples | |
- opengl_interop.cu | |
- repeated_range.cu | |
- simple_moving_average.cu | |
- sparse_vector.cu | |
- strided_range.cu | |
## Other Enhancements | |
- Performance of thrust::sort and thrust::sort_by_key is substantially improved | |
for primitive key types | |
- Performance of thrust::copy_if is substantially improved | |
- Performance of thrust::reduce and related reductions is improved | |
- THRUST_DEBUG mode added | |
- Callers of Thrust functions may detect error conditions by catching | |
thrust::system_error, which derives from std::runtime_error | |
- The number of compiler warnings generated by Thrust has been substantially | |
reduced | |
- Comparison sort now works correctly for input sizes > 32M | |
- min & max usage no longer collides with <windows.h> definitions | |
- Compiling against the OpenMP backend no longer requires nvcc | |
- Performance of device_vector initialized in .cpp files is substantially | |
improved in common cases | |
- Performance of thrust::sort_by_key on the host is substantially improved | |
## Bug Fixes | |
- Debug device code now compiles correctly | |
- thrust::uninitialized_copy and thrust::uninitialized_fill now dispatch | |
constructors on the device rather than the host | |
## Known Issues | |
- #212 set_intersection is known to fail for large input sizes | |
- partition_point is known to fail for 64b types with nvcc 3.2 | |
Acknowledgments | |
- Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation | |
- Thanks to Erich Elsen for contributing an implementation of find_if | |
- Thanks to Andrew Corrigan for contributing changes which allow the OpenMP | |
backend to compile in the absence of nvcc | |
- Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez | |
Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for | |
bug reports | |
- Thanks to Cliff Woolley for help with testing | |
# Thrust 1.2.1 | |
## Summary | |
Small fixes for compatibility for the CUDA Toolkit 3.1. | |
## Known Issues | |
- `thrust::inclusive_scan` and `thrust::exclusive_scan` may fail with very | |
large types. | |
- MSVC may fail to compile code using both sort and binary search algorithms. | |
- `thrust::uninitialized_fill` and `thrust::uninitialized_copy` dispatch | |
constructors on the host rather than the device. | |
- #109: Some algorithms may exhibit poor performance with the OpenMP backend | |
with large numbers (>= 6) of CPU threads. | |
- `thrust::default_random_engine::discard` is not accelerated with NVCC 2.3 | |
- NVCC 3.1 may fail to compile code using types derived from | |
`thrust::subtract_with_carry_engine`, such as `thrust::ranlux24` and | |
`thrust::ranlux48`. | |
# Thrust 1.2.0 | |
## Summary | |
Thrust 1.2 introduces support for compilation to multicore CPUs and the Ocelot | |
virtual machine, and several new facilities for pseudo-random number | |
generation. | |
New algorithms such as set intersection and segmented reduction have also been | |
added. | |
Lastly, improvements to the robustness of the CUDA backend ensure correctness | |
across a broad set of (uncommon) use cases. | |
## Breaking Changes | |
- `thrust::gather`'s interface was incorrect and has been removed. | |
The old interface is deprecated but will be preserved for Thrust version 1.2 | |
at `thrust::deprecated::gather` and `thrust::deprecated::gather_if`. | |
The new interface is provided at `thrust::next::gather` and | |
`thrust::next::gather_if`. | |
The new interface will be promoted to `thrust::` in Thrust version 1.3. | |
For more details, please refer to [this thread](http://groups.google.com/group/thrust-users/browse_thread/thread/f5f0583cb97b51fd). | |
- The `thrust::sorting` namespace has been deprecated in favor of the top-level | |
sorting functions, such as `thrust::sort` and `thrust::sort_by_key`. | |
- Removed support for `thrust::equal` between host & device sequences. | |
- Removed support for `thrust::scatter` between host & device sequences. | |
## New Features | |
- Algorithms: | |
- `thrust::reduce_by_key` | |
- `thrust::set_intersection` | |
- `thrust::unique_copy` | |
- `thrust::unique_by_key` | |
- `thrust::unique_copy_by_key` | |
- Types | |
- Random Number Generation: | |
- `thrust::discard_block_engine` | |
- `thrust::default_random_engine` | |
- `thrust::linear_congruential_engine` | |
- `thrust::linear_feedback_shift_engine` | |
- `thrust::subtract_with_carry_engine` | |
- `thrust::xor_combine_engine` | |
- `thrust::minstd_rand` | |
- `thrust::minstd_rand0` | |
- `thrust::ranlux24` | |
- `thrust::ranlux48` | |
- `thrust::ranlux24_base` | |
- `thrust::ranlux48_base` | |
- `thrust::taus88` | |
- `thrust::uniform_int_distribution` | |
- `thrust::uniform_real_distribution` | |
- `thrust::normal_distribution` (experimental) | |
- Function Objects: | |
- `thrust::project1st` | |
- `thrust::project2nd` | |
- `thrust::tie` | |
- Fancy Iterators: | |
- `thrust::permutation_iterator` | |
- `thrust::reverse_iterator` | |
- Vector Functions: | |
- `operator!=` | |
- `rbegin` | |
- `crbegin` | |
- `rend` | |
- `crend` | |
- `data` | |
- `shrink_to_fit` | |
- Device Support: | |
- Multicore CPUs via OpenMP. | |
- Fermi-class GPUs. | |
- Ocelot virtual machines. | |
- Support for NVCC 3.0. | |
## New Examples | |
- `cpp_integration` | |
- `histogram` | |
- `mode` | |
- `monte_carlo` | |
- `monte_carlo_disjoint_sequences` | |
- `padded_grid_reduction` | |
- `permutation_iterator` | |
- `row_sum` | |
- `run_length_encoding` | |
- `segmented_scan` | |
- `stream_compaction` | |
- `summary_statistics` | |
- `transform_iterator` | |
- `word_count` | |
## Other Enhancements | |
- Integer sorting performance is improved when max is large but (max - min) is | |
small and when min is negative | |
- Performance of `thrust::inclusive_scan` and `thrust::exclusive_scan` is | |
improved by 20-25% for primitive types. | |
## Bug Fixes | |
- #8 cause a compiler error if the required compiler is not found rather than a | |
mysterious error at link time | |
- #42 device_ptr & device_reference are classes rather than structs, | |
eliminating warnings on certain platforms | |
- #46 gather & scatter handle any space iterators correctly | |
- #51 thrust::experimental::arch functions gracefully handle unrecognized GPUs | |
- #52 avoid collisions with common user macros such as BLOCK_SIZE | |
- #62 provide better documentation for device_reference | |
- #68 allow built-in CUDA vector types to work with device_vector in pure C++ | |
mode | |
- #102 eliminated a race condition in device_vector::erase | |
- various compilation warnings eliminated | |
## Known Issues | |
- inclusive_scan & exclusive_scan may fail with very large types | |
- MSVC may fail to compile code using both sort and binary search algorithms | |
- uninitialized_fill & uninitialized_copy dispatch constructors on the host | |
rather than the device | |
- #109 some algorithms may exhibit poor performance with the OpenMP backend | |
with large numbers (>= 6) of CPU threads | |
- default_random_engine::discard is not accelerated with nvcc 2.3 | |
## Acknowledgments | |
- Thanks to Gregory Diamos for contributing a CUDA implementation of | |
set_intersection | |
- Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit | |
tests and examples against Ocelot | |
- Thanks to Tom Bradley for contributing an implementation of normal_distribution | |
- Thanks to Joseph Rhoads for contributing the example summary_statistics | |
# Thrust 1.1.1 | |
## Summary | |
Small fixes for compatibility with CUDA Toolkit 2.3a and Mac OSX Snow Leopard. | |
# Thrust 1.1.0 | |
## Summary | |
Thrust 1.1.0 introduces fancy iterators, binary search functions, and several | |
specialized reduction functions. | |
Experimental support for segmented scans has also been added. | |
## Breaking Changes | |
- `thrust::counting_iterator` has been moved into the `thrust` namespace | |
(previously `thrust::experimental`). | |
## New Features | |
- Algorithms: | |
- `thrust::copy_if` | |
- `thrust::lower_bound` | |
- `thrust::upper_bound` | |
- `thrust::vectorized lower_bound` | |
- `thrust::vectorized upper_bound` | |
- `thrust::equal_range` | |
- `thrust::binary_search` | |
- `thrust::vectorized binary_search` | |
- `thrust::all_of` | |
- `thrust::any_of` | |
- `thrust::none_of` | |
- `thrust::minmax_element` | |
- `thrust::advance` | |
- `thrust::inclusive_segmented_scan` (experimental) | |
- `thrust::exclusive_segmented_scan` (experimental) | |
- Types: | |
- `thrust::pair` | |
- `thrust::tuple` | |
- `thrust::device_malloc_allocator` | |
- Fancy Iterators: | |
- `thrust::constant_iterator` | |
- `thrust::counting_iterator` | |
- `thrust::transform_iterator` | |
- `thrust::zip_iterator` | |
## New Examples | |
- Computing the maximum absolute difference between vectors. | |
- Computing the bounding box of a two-dimensional point set. | |
- Sorting multiple arrays together (lexicographical sorting). | |
- Constructing a summed area table. | |
- Using `thrust::zip_iterator` to mimic an array of structs. | |
- Using `thrust::constant_iterator` to increment array values. | |
## Other Enhancements | |
- Added pinned memory allocator (experimental). | |
- Added more methods to host_vector & device_vector (issue #4). | |
- Added variant of remove_if with a stencil argument (issue #29). | |
- Scan and reduce use cudaFuncGetAttributes to determine grid size. | |
- Exceptions are reported when temporary device arrays cannot be allocated. | |
## Bug Fixes | |
- #5: Make vector work for larger data types | |
- #9: stable_partition_copy doesn't respect OutputIterator concept semantics | |
- #10: scans should return OutputIterator | |
- #16: make algorithms work for larger data types | |
- #27: Dispatch radix_sort even when comp=less<T> is explicitly provided | |
## Known Issues | |
- Using functors with Thrust entry points may not compile on Mac OSX with gcc | |
4.0.1. | |
- `thrust::uninitialized_copy` and `thrust::uninitialized_fill` dispatch | |
constructors on the host rather than the device. | |
- `thrust::inclusive_scan`, `thrust::inclusive_scan_by_key`, | |
`thrust::exclusive_scan`, and `thrust::exclusive_scan_by_key` may fail when | |
used with large types with the CUDA Toolkit 3.1. | |
# Thrust 1.0.0 | |
## Breaking Changes | |
- Rename top level namespace `komrade` to `thrust`. | |
- Move `thrust::partition_copy` & `thrust::stable_partition_copy` into | |
`thrust::experimental` namespace until we can easily provide the standard | |
interface. | |
- Rename `thrust::range` to `thrust::sequence` to avoid collision with | |
Boost.Range. | |
- Rename `thrust::copy_if` to `thrust::copy_when` due to semantic differences | |
with C++0x `std::copy_if`. | |
## New Features | |
- Add C++0x style `cbegin` & `cend` methods to `thrust::host_vector` and | |
`thrust::device_vector`. | |
- Add `thrust::transform_if` function. | |
- Add stencil versions of `thrust::replace_if` & `thrust::replace_copy_if`. | |
- Allow `counting_iterator` to work with `thrust::for_each`. | |
- Allow types with constructors in comparison `thrust::sort` and | |
`thrust::reduce`. | |
## Other Enhancements | |
- `thrust::merge_sort` and `thrust::stable_merge_sort` are now 2x to 5x faster | |
when executed on the parallel device. | |
## Bug Fixes | |
- Komrade 6: Workaround an issue where an incremented iterator causes NVCC to | |
crash. | |
- Komrade 7: Fix an issue where `const_iterator`s could not be passed to | |
`thrust::transform`. | |