Spaces:

ma-xu
/

LIVE

Runtime error

App Files Files Community

LIVE / thrust /dependencies /cub /CHANGELOG.md

Xu Ma

update

1c3c0d9 over 2 years ago

preview code

raw

history blame

29 kB

	# CUB 1.9.10-1 (NVIDIA HPC SDK 20.7, CUDA Toolkit 11.1)

	## Summary

	CUB 1.9.10-1 is the minor release accompanying the NVIDIA HPC SDK 20.7 release
	and the CUDA Toolkit 11.1 release.

	## Bug Fixes

	- #1217: Move static local in `cub::DeviceCount` to a separate host-only
	function because NVC++ doesn't support static locals in host-device
	functions.

	# CUB 1.9.10 (NVIDIA HPC SDK 20.5)

	## Summary

	Thrust 1.9.10 is the release accompanying the NVIDIA HPC SDK 20.5 release.
	It adds CMake `find_package` support.
	C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
	Starting with the upcoming 1.10.0 release, C++03 support will be dropped
	entirely.

	## Breaking Changes

	- Thrust now checks that it is compatible with the version of CUB found
	in your include path, generating an error if it is not.
	If you are using your own version of CUB, it may be too old.
	It is recommended to simply delete your own version of CUB and use the
	version of CUB that comes with Thrust.
	- C++03 and C++11 are deprecated.
	Using these dialects will generate a compile-time warning.
	These warnings can be suppressed by defining
	`CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11
	deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP_11` (to suppress C++11
	deprecation warnings).
	Suppression is only a short term solution.
	We will be dropping support for C++03 in the 1.10.0 release and C++11 in the
	near future.
	- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
	Using these compilers will generate a compile-time warning.
	These warnings can be suppressed by defining
	`CUB_IGNORE_DEPRECATED_COMPILER`.
	Suppression is only a short term solution.
	We will be dropping support for these compilers in the near future.

	## New Features

	- CMake `find_package` support.
	Just point CMake at the `cmake` folder in your CUB include directory
	(ex: `cmake -DCUB_DIR=/usr/local/cuda/include/cub/cmake/ .`) and then you
	can add CUB to your CMake project with `find_package(CUB REQUIRED CONFIG)`.

	# CUB 1.9.9 (CUDA 11.0)

	## Summary

	CUB 1.9.9 is the release accompanying the CUDA Toolkit 11.0 release.
	It introduces CMake support, version macros, platform detection machinery,
	and support for NVC++, which uses Thrust (and thus CUB) to implement
	GPU-accelerated C++17 Parallel Algorithms.
	Additionally, the scan dispatch layer was refactored and modernized.
	C++03, C++11, GCC < 5, Clang < 6, and MSVC < 2017 are now deprecated.
	Starting with the upcoming 1.10.0 release, C++03 support will be dropped
	entirely.

	## Breaking Changes

	- Thrust now checks that it is compatible with the version of CUB found
	in your include path, generating an error if it is not.
	If you are using your own version of CUB, it may be too old.
	It is recommended to simply delete your own version of CUB and use the
	version of CUB that comes with Thrust.
	- C++03 and C++11 are deprecated.
	Using these dialects will generate a compile-time warning.
	These warnings can be suppressed by defining
	`CUB_IGNORE_DEPRECATED_CPP_DIALECT` (to suppress C++03 and C++11
	deprecation warnings) or `CUB_IGNORE_DEPRECATED_CPP11` (to suppress C++11
	deprecation warnings).
	Suppression is only a short term solution.
	We will be dropping support for C++03 in the 1.10.0 release and C++11 in the
	near future.
	- GCC < 5, Clang < 6, and MSVC < 2017 are deprecated.
	Using these compilers will generate a compile-time warning.
	These warnings can be suppressed by defining
	`CUB_IGNORE_DEPRECATED_COMPILER`.
	Suppression is only a short term solution.
	We will be dropping support for these compilers in the near future.

	## New Features

	- CMake support.
	Thanks to Francis Lemaire for this contribution.
	- Refactorized and modernized scan dispatch layer.
	Thanks to Francis Lemaire for this contribution.
	- Policy hooks for device-wide reduce, scan, and radix sort facilities
	to simplify tuning and allow users to provide custom policies.
	Thanks to Francis Lemaire for this contribution.
	- `<cub/version.cuh>`: `CUB_VERSION`, `CUB_VERSION_MAJOR`, `CUB_VERSION_MINOR`,
	`CUB_VERSION_SUBMINOR`, and `CUB_PATCH_NUMBER`.
	- Platform detection machinery:
	- `<cub/util_cpp_dialect.cuh>`: Detects the C++ standard dialect.
	- `<cub/util_compiler.cuh>`: host and device compiler detection.
	- `<cub/util_deprecated.cuh>`: `CUB_DEPRECATED`.
	- <cub/config.cuh>`: Includes `<cub/util_arch.cuh>`,
	`<cub/util_compiler.cuh>`, `<cub/util_cpp_dialect.cuh>`,
	`<cub/util_deprecated.cuh>`, `<cub/util_macro.cuh>`,
	`<cub/util_namespace.cuh>`
	- `cub::DeviceCount` and `cub::DeviceCountUncached`, caching abstractions for
	`cudaGetDeviceCount`.

	## Other Enhancements

	- Lazily initialize the per-device CUDAattribute caches, because CUDA context
	creation is expensive and adds up with large CUDA binaries on machines with
	many GPUs.
	Thanks to the NVIDIA PyTorch team for bringing this to our attention.
	- Make `cub::SwitchDevice` avoid setting/resetting the device if the current
	device is the same as the target device.

	## Bug Fixes

	- Add explicit failure parameter to CAS in the CUB attribute cache to workaround
	a GCC 4.8 bug.
	- Revert a change in reductions that changed the signedness of the `lane_id`
	variable to suppress a warning, as this introduces a bug in optimized device
	code.
	- Fix initialization in `cub::ExclusiveSum`.
	Thanks to Conor Hoekstra for this contribution.
	- Fix initialization of the `std::array` in the CUB attribute cache.
	- Fix `-Wsign-compare` warnings.
	Thanks to Elias Stehle for this contribution.
	- Fix `test_block_reduce.cu` to build without parameters.
	Thanks to Francis Lemaire for this contribution.
	- Add missing includes to `grid_even_share.cuh`.
	Thanks to Francis Lemaire for this contribution.
	- Add missing includes to `thread_search.cuh`.
	Thanks to Francis Lemaire for this contribution.
	- Add missing includes to `cub.cuh`.
	Thanks to Felix Kallenborn for this contribution.

	# CUB 1.9.8-1 (NVIDIA HPC SDK 20.3)

	## Summary

	CUB 1.9.8-1 is a variant of 1.9.8 accompanying the NVIDIA HPC SDK 20.3 release.
	It contains modifications necessary to serve as the implementation of NVC++'s
	GPU-accelerated C++17 Parallel Algorithms.

	# CUB 1.9.8 (CUDA 11.0 Early Access)

	## Summary

	CUB 1.9.8 is the first release of CUB to be officially supported and included
	in the CUDA Toolkit.
	When compiling CUB in C++11 mode, CUB now caches calls to CUDA attribute query
	APIs, which improves performance of these queries by 20x to 50x when they
	are called concurrently by multiple host threads.

	## Enhancements

	- (C++11 or later) Cache calls to `cudaFuncGetAttributes` and
	`cudaDeviceGetAttribute` within `cub::PtxVersion` and `cub::SmVersion`.
	These CUDA APIs acquire locks to CUDA driver/runtime mutex and perform
	poorly under contention; with the caching, they are 20 to 50x faster when
	called concurrently.
	Thanks to Bilge Acun for bringing this issue to our attention.
	- `DispatchReduce` now takes an `OutputT` template parameter so that users can
	specify the intermediate type explicitly.
	- Radix sort tuning policies updates to fix performance issues for element
	types smaller than 4 bytes.

	## Bug Fixes

	- Change initialization style from copy initialization to direct initialization
	(which is more permissive) in `AgentReduce` to allow a wider range of types
	to be used with it.
	- Fix bad signed/unsigned comparisons in `WarpReduce`.
	- Fix computation of valid lanes in warp-level reduction primitive to correctly
	handle the case where there are 0 input items per warp.

	# CUB 1.8.0

	## Summary

	CUB 1.8.0 introduces changes to the `cub::Shuffle*` interfaces.

	## Breaking Changes

	- The interfaces of `cub::ShuffleIndex`, `cub::ShuffleUp`, and
	`cub::ShuffleDown` have been changed to allow for better computation of the
	PTX SHFL control constant for logical warps smaller than 32 threads.

	## Bug Fixes

	- #112: Fix `cub::WarpScan`'s broadcast of warp-wide aggregate for logical
	warps smaller than 32 threads.

	# CUB 1.7.5

	## Summary

	CUB 1.7.5 adds support for radix sorting `__half` keys and improved sorting
	performance for 1 byte keys.
	It was incorporated into Thrust 1.9.2.

	## Enhancements

	- Radix sort support for `__half` keys.
	- Radix sort tuning policy updates to improve 1 byte key performance.

	## Bug Fixes

	- Syntax tweaks to mollify Clang.
	- #127: `cub::DeviceRunLengthEncode::Encode` returns incorrect results.
	- #128: 7-bit sorting passes fail for SM61 with large values.

	# CUB 1.7.4

	## Summary

	CUB 1.7.4 is a minor release that was incorporated into Thrust 1.9.1-2.

	## Bug Fixes

	- #114: Can't pair non-trivially-constructible values in radix sort.
	- #115: `cub::WarpReduce` segmented reduction is broken in CUDA 9 for logical
	warp sizes smaller than 32.

	# CUB 1.7.3

	## Summary

	CUB 1.7.3 is a minor release.

	## Bug Fixes

	- #110: `cub::DeviceHistogram` null-pointer exception bug for iterator inputs.

	# CUB 1.7.2

	## Summary

	CUB 1.7.2 is a minor release.

	## Bug Fixes

	- #104: Device-wide reduction is now "run-to-run" deterministic for
	pseudo-associative reduction operators (like floating point addition).

	# CUB 1.7.1

	## Summary

	CUB 1.7.1 delivers improved radix sort performance on SM7x (Volta) GPUs and a
	number of bug fixes.

	## Enhancements

	- Radix sort tuning policies updated for SM7x (Volta).

	## Bug Fixes

	- #104: `uint64_t` `cub::WarpReduce` broken for CUB 1.7.0 on CUDA 8 and older.
	- #103: Can't mix Thrust from CUDA 9.0 and CUB.
	- #102: CUB pulls in `windows.h` which defines `min`/`max` macros that conflict
	with `std::min`/`std::max`.
	- #99: Radix sorting crashes NVCC on Windows 10 for SM52.
	- #98: cuda-memcheck: --tool initcheck failed with lineOfSight.
	- #94: Git clone size.
	- #93: Accept iterators for segment offsets.
	- #87: CUB uses anonymous unions which is not valid C++.
	- #44: Check for C++11 is incorrect for Visual Studio 2013.

	# CUB 1.7.0

	## Summary

	CUB 1.7.0 brings support for CUDA 9.0 and SM7x (Volta) GPUs.
	It is compatible with independent thread scheduling.
	It was incorporated into Thrust 1.9.0-5.

	## Breaking Changes

	- Remove `cub::WarpAll` and `cub::WarpAny`.
	These functions served to emulate `__all` and `__any` functionality for
	SM1x devices, which did not have those operations.
	However, SM1x devices are now deprecated in CUDA, and the interfaces of these
	two functions are now lacking the lane-mask needed for collectives to run on
	SM7x and newer GPUs which have independent thread scheduling.

	## Other Enhancements

	- Remove any assumptions of implicit warp synchronization to be compatible with
	SM7x's (Volta) independent thread scheduling.

	## Bug Fixes

	- #86: Incorrect results with reduce-by-key.

	# CUB 1.6.4

	## Summary

	CUB 1.6.4 improves radix sorting performance for SM5x (Maxwell) and SM6x
	(Pascal) GPUs.

	## Enhancements

	- Radix sort tuning policies updated for SM5x (Maxwell) and SM6x (Pascal) -
	3.5B and 3.4B 32 byte keys/s on TitanX and GTX 1080, respectively.

	## Bug Fixes

	- Restore fence work-around for scan (reduce-by-key, etc.) hangs in CUDA 8.5.
	- #65: `cub::DeviceSegmentedRadixSort` should allow inputs to have
	pointer-to-const type.
	- Mollify Clang device-side warnings.
	- Remove out-dated MSVC project files.

	# CUB 1.6.3

	## Summary

	CUB 1.6.3 improves support for Windows, changes
	`cub::BlockLoad`/`cub::BlockStore` interface to take the local data type,
	and enhances radix sort performance for SM6x (Pascal) GPUs.

	## Breaking Changes

	- `cub::BlockLoad` and `cub::BlockStore` are now templated by the local data
	type, instead of the `Iterator` type.
	This allows for output iterators having `void` as their `value_type` (e.g.
	discard iterators).

	## Other Enhancements

	- Radix sort tuning policies updated for SM6x (Pascal) GPUs - 6.2B 4 byte
	keys/s on GP100.
	- Improved support for Windows (warnings, alignment, etc).

	## Bug Fixes

	- #74: `cub::WarpReduce` executes reduction operator for out-of-bounds items.
	- #72: `cub:InequalityWrapper::operator` should be non-const.
	- #71: `cub::KeyValuePair` won't work if `Key` has non-trivial constructor.
	- #69: cub::BlockStore::Store` doesn't compile if `OutputIteratorT::value_type`
	isn't `T`.
	- #68: `cub::TilePrefixCallbackOp::WarpReduce` doesn't permit PTX arch
	specialization.

	# CUB 1.6.2 (previously 1.5.5)

	## Summary

	CUB 1.6.2 (previously 1.5.5) improves radix sort performance for SM6x (Pascal)
	GPUs.

	## Enhancements

	- Radix sort tuning policies updated for SM6x (Pascal) GPUs.

	## Bug Fixes

	- Fix AArch64 compilation of `cub::CachingDeviceAllocator`.

	# CUB 1.6.1 (previously 1.5.4)

	## Summary

	CUB 1.6.1 (previously 1.5.4) is a minor release.

	## Bug Fixes

	- Fix radix sorting bug introduced by scan refactorization.

	# CUB 1.6.0 (previously 1.5.3)

	## Summary

	CUB 1.6.0 changes the scan and reduce interfaces.
	Exclusive scans now accept an "initial value" instead of an "identity value".
	Scans and reductions now support differing input and output sequence types.
	Additionally, many bugs have been fixed.

	## Breaking Changes

	- Device/block/warp-wide exclusive scans have been revised to now accept an
	"initial value" (instead of an "identity value") for seeding the computation
	with an arbitrary prefix.
	- Device-wide reductions and scans can now have input sequence types that are
	different from output sequence types (as long as they are convertible).

	## Other Enhancements

	- Reduce repository size by moving the doxygen binary to doc repository.
	- Minor reduction in `cub::BlockScan` instruction counts.

	## Bug Fixes

	- Issue #55: Warning in `cub/device/dispatch/dispatch_reduce_by_key.cuh`.
	- Issue #59: `cub::DeviceScan::ExclusiveSum` can't prefix sum of float into
	double.
	- Issue #58: Infinite loop in `cub::CachingDeviceAllocator::NearestPowerOf`.
	- Issue #47: `cub::CachingDeviceAllocator` needs to clean up CUDA global error
	state upon successful retry.
	- Issue #46: Very high amount of needed memory from the
	`cub::DeviceHistogram::HistogramEven`.
	- Issue #45: `cub::CachingDeviceAllocator` fails with debug output enabled

	# CUB 1.5.2

	## Summary

	CUB 1.5.2 enhances `cub::CachingDeviceAllocator` and improves scan performance
	for SM5x (Maxwell).

	## Enhancements

	- Improved medium-size scan performance on SM5x (Maxwell).
	- Refactored `cub::CachingDeviceAllocator`:
	- Now spends less time locked.
	- Uses C++11's `std::mutex` when available.
	- Failure to allocate a block from the runtime will retry once after
	freeing cached allocations.
	- Now respects max-bin, fixing an issue where blocks in excess of max-bin
	were still being retained in the free cache.

	## Bug fixes:

	- Fix for generic-type reduce-by-key `cub::WarpScan` for SM3x and newer GPUs.

	# CUB 1.5.1

	## Summary

	CUB 1.5.1 is a minor release.

	## Bug Fixes

	- Fix for incorrect `cub::DeviceRadixSort` output for some small problems on
	SM52 (Mawell) GPUs.
	- Fix for macro redefinition warnings when compiling `thrust::sort`.

	# CUB 1.5.0

	CUB 1.5.0 introduces segmented sort and reduction primitives.

	## New Features:

	- Segmented device-wide operations for device-wide sort and reduction primitives.

	## Bug Fixes:

	- #36: `cub::ThreadLoad` generates compiler errors when loading from
	pointer-to-const.
	- #29: `cub::DeviceRadixSort::SortKeys<bool>` yields compiler errors.
	- #26: Misaligned address after `cub::DeviceRadixSort::SortKeys`.
	- #25: Fix for incorrect results and crashes when radix sorting 0-length
	problems.
	- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and
	warp-reduction on non-primitive data types (e.g. user-defined structs).
	- Fix small radix sorting problems where 0 temporary bytes were required and
	users code was invoking `malloc(0)` on some systems where that returns
	`NULL`.
	CUB assumed the user was asking for the size again and not running the sort.

	# CUB 1.4.1

	## Summary

	CUB 1.4.1 is a minor release.

	## Enhancements

	- Allow `cub::DeviceRadixSort` and `cub::BlockRadixSort` on bool types.

	## Bug Fixes

	- Fix minor CUDA 7.0 performance regressions in `cub::DeviceScan` and
	`cub::DeviceReduceByKey`.
	- Remove requirement for callers to define the `CUB_CDP` macro
	when invoking CUB device-wide rountines using CUDA dynamic parallelism.
	- Fix headers not being included in the proper order (or missing includes)
	for some block-wide functions.

	# CUB 1.4.0

	## Summary

	CUB 1.4.0 adds `cub::DeviceSpmv`, `cub::DeviceRunLength::NonTrivialRuns`,
	improves `cub::DeviceHistogram`, and introduces support for SM5x (Maxwell)
	GPUs.

	## New Features:

	- `cub::DeviceSpmv` methods for multiplying sparse matrices by
	dense vectors, load-balanced using a merge-based parallel decomposition.
	- `cub::DeviceRadixSort` sorting entry-points that always return
	the sorted output into the specified buffer, as opposed to the
	`cub::DoubleBuffer` in which it could end up in either buffer.
	- `cub::DeviceRunLengthEncode::NonTrivialRuns` for finding the starting
	offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in
	a given sequence.
	Useful for top-down partitioning algorithms like MSD sorting of very-large
	keys.

	## Other Enhancements

	- Support and performance tuning for SM5x (Maxwell) GPUs.
	- Updated cub::DeviceHistogram implementation that provides the same
	"histogram-even" and "histogram-range" functionality as IPP/NPP.
	Provides extremely fast and, perhaps more importantly, very uniform
	performance response across diverse real-world datasets, including
	pathological (homogeneous) sample distributions.

	# CUB 1.3.2

	## Summary

	CUB 1.3.2 is a minor release.

	## Bug Fixes

	- Fix `cub::DeviceReduce` where reductions of small problems (small enough to
	only dispatch a single thread block) would run in the default stream (stream
	zero) regardless of whether an alternate stream was specified.

	# CUB 1.3.1

	## Summary

	CUB 1.3.1 is a minor release.

	## Bug Fixes

	- Workaround for a benign WAW race warning reported by cuda-memcheck
	in `cub::BlockScan` specialized for `BLOCK_SCAN_WARP_SCANS` algorithm.
	- Fix bug in `cub::DeviceRadixSort` where the algorithm may sort more
	key bits than the caller specified (up to the nearest radix digit).
	- Fix for ~3% `cub::DeviceRadixSort` performance regression on SM2x (Fermi) and
	SM3x (Kepler) GPUs.

	# CUB 1.3.0

	## Summary

	CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide
	primitives and adds an enhanced version of `cub::WarpScan`.

	## Breaking Changes

	- CUB's collective (block-wide, warp-wide) primitives underwent a minor
	interface refactoring:
	- To provide the appropriate support for multidimensional thread blocks,
	The interfaces for collective classes are now template-parameterized by
	X, Y, and Z block dimensions (with `BLOCK_DIM_Y` and `BLOCK_DIM_Z` being
	optional, and `BLOCK_DIM_X` replacing `BLOCK_THREADS`).
	Furthermore, the constructors that accept remapped linear
	thread-identifiers have been removed: all primitives now assume a
	row-major thread-ranking for multidimensional thread blocks.
	- To allow the host program (compiled by the host-pass) to accurately
	determine the device-specific storage requirements for a given collective
	(compiled for each device-pass), the interfaces for collective classes
	are now (optionally) template-parameterized by the desired PTX compute
	capability.
	This is useful when aliasing collective storage to shared memory that has
	been allocated dynamically by the host at the kernel call site.
	- Most CUB programs having typical 1D usage should not require any
	changes to accomodate these updates.

	## New Features

	- Added "combination" `cub::WarpScan` methods for efficiently computing
	both inclusive and exclusive prefix scans (and sums).

	## Bug Fixes

	- Fix for bug in `cub::WarpScan` (which affected `cub::BlockScan` and
	`cub::DeviceScan`) where incorrect results (e.g., NAN) would often be
	returned when parameterized for floating-point types (fp32, fp64).
	- Workaround for ptxas error when compiling with with -G flag on Linux (for
	debug instrumentation).
	- Fixes for certain scan scenarios using custom scan operators where code
	compiled for SM1x is run on newer GPUs of higher compute-capability: the
	compiler could not tell which memory space was being used collective
	operations and was mistakenly using global ops instead of shared ops.

	# CUB 1.2.3

	## Summary

	CUB 1.2.3 is a minor release.

	## Bug Fixes

	- Fixed access violation bug in `cub::DeviceReduce::ReduceByKey` for
	non-primitive value types.
	- Fixed code-snippet bug in `ArgIndexInputIteratorT` documentation.

	# CUB 1.2.2

	## Summary

	CUB 1.2.2 adds a new variant of `cub::BlockReduce` and MSVC project solections
	for examples.

	## New Features

	- MSVC project solutions for device-wide and block-wide examples
	- New algorithmic variant of cub::BlockReduce for improved performance
	when using commutative operators (e.g., numeric addition).

	## Bug Fixes

	- Inclusion of Thrust headers in a certain order prevented CUB device-wide
	primitives from working properly.

	# CUB 1.2.0

	## Summary

	CUB 1.2.0 adds `cub::DeviceReduce::ReduceByKey` and
	`cub::DeviceReduce::RunLengthEncode` and support for CUDA 6.0.

	## New Features

	- `cub::DeviceReduce::ReduceByKey`.
	- `cub::DeviceReduce::RunLengthEncode`.

	## Other Enhancements

	- Improved `cub::DeviceScan`, `cub::DeviceSelect`, `cub::DevicePartition`
	performance.
	- Documentation and testing:
	- Added performance-portability plots for many device-wide primitives.
	- Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and
	Thrust 1.6 (and older).
	- Revised the operation of temporary tile status bookkeeping for
	`cub::DeviceScan` (and similar) to be safe for current code run on future
	platforms (now uses proper fences).

	## Bug Fixes

	- Fix `cub::DeviceScan` bug where Windows alignment disagreements between host
	and device regarding user-defined data types would corrupt tile status.
	- Fix `cub::BlockScan` bug where certain exclusive scans on custom data types
	for the `BLOCK_SCAN_WARP_SCANS` variant would return incorrect results for
	the first thread in the block.
	- Added workaround to make `cub::TexRefInputIteratorT` work with CUDA 6.0.

	# CUB 1.1.1

	## Summary

	CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting,
	`cub::DeviceSelect`, `cub::DevicePartition`, `cub::Shuffle*`, and
	`cub::MaxSMOccupancy`.
	Additionally, scan and sort performance for older GPUs has been improved and
	many bugs have been fixed.

	## Breaking Changes

	- Refactored block-wide I/O (`cub::BlockLoad` and `cub::BlockStore`), removing
	cache-modifiers from their interfaces.
	`cub::CacheModifiedInputIterator` and `cub::CacheModifiedOutputIterator`
	should now be used with `cub::BlockLoad` and `cub::BlockStore` to effect that
	behavior.

	## New Features

	- `cub::TexObjInputIterator`, `cub::TexRefInputIterator`,
	`cub::CacheModifiedInputIterator`, and `cub::CacheModifiedOutputIterator`
	types for loading & storing arbitrary types through the cache hierarchy.
	They are compatible with Thrust.
	- Descending sorting for `cub::DeviceRadixSort` and `cub::BlockRadixSort`.
	- Min, max, arg-min, and arg-max operators for `cub::DeviceReduce`.
	- `cub::DeviceSelect` (select-unique, select-if, and select-flagged).
	- `cub::DevicePartition` (partition-if, partition-flagged).
	- Generic `cub::ShuffleUp`, `cub::ShuffleDown`, and `cub::ShuffleIndex` for
	warp-wide communication of arbitrary data types (SM3x and up).
	- `cub::MaxSmOccupancy` for accurately determining SM occupancy for any given
	kernel function pointer.

	## Other Enhancements

	- Improved `cub::DeviceScan` and `cub::DeviceRadixSort` performance for older
	GPUs (SM1x to SM3x).
	- Renamed device-wide `stream_synchronous` param to `debug_synchronous` to
	avoid confusion about usage.
	- Documentation improvements:
	- Added simple examples of device-wide methods.
	- Improved doxygen documentation and example snippets.
	- Improved test coverege to include up to 21,000 kernel variants and 851,000
	unit tests (per architecture, per platform).

	## Bug Fixes

	- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when
	operating on non-primitive types for older architectures SM1x.
	- SHFL-based scans and reductions produced incorrect results for multi-word
	types (size > 4B) on Linux.
	- For `cub::WarpScan`-based scans, not all threads in the first warp were
	entering the prefix callback functor.
	- `cub::DeviceRadixSort` had a race condition with key-value pairs for pre-SM35
	architectures.
	- `cub::DeviceRadixSor` bitfield-extract behavior with long keys on 64-bit
	Linux was incorrect.
	- `cub::BlockDiscontinuity` failed to compile for types other than
	`int32_t`/`uint32_t`.
	- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide
	methods now report the same temporary storage allocation size requirement as
	their host-callable counterparts.

	# CUB 1.0.2

	## Summary

	CUB 1.0.2 is a minor release.

	## Bug Fixes

	- Corrections to code snippet examples for `cub::BlockLoad`, `cub::BlockStore`,
	and `cub::BlockDiscontinuity`.
	- Cleaned up unnecessary/missing header includes.
	You can now safely include a specific .cuh (instead of `cub.cuh`).
	- Bug/compilation fixes for `cub::BlockHistogram`.

	# CUB 1.0.1

	## Summary

	CUB 1.0.1 adds `cub::DeviceRadixSort` and `cub::DeviceScan`.
	Numerous other performance and correctness fixes and included.

	## Breaking Changes

	- New collective interface idiom (specialize/construct/invoke).

	## New Features

	- `cub::DeviceRadixSort`.
	Implements short-circuiting for homogenous digit passes.
	- `cub::DeviceScan`.
	Implements single-pass "adaptive-lookback" strategy.

	## Other Enhancements

	- Significantly improved documentation (with example code snippets).
	- More extensive regression test suit for aggressively testing collective
	variants.
	- Allow non-trially-constructed types (previously unions had prevented aliasing
	temporary storage of those types).
	- Improved support for SM3x SHFL (collective ops now use SHFL for types larger
	than 32 bits).
	- Better code generation for 64-bit addressing within
	`cub::BlockLoad`/`cub::BlockStore`.
	- `cub::DeviceHistogram` now supports histograms of arbitrary bins.
	- Updates to accommodate CUDA 5.5 dynamic parallelism.

	## Bug Fixes

	- Workarounds for SM10 codegen issues in uncommonly-used
	`cub::WarpScan`/`cub::WarpReduce` specializations.

	# CUB 0.9.4

	## Summary

	CUB 0.9.3 is a minor release.

	## Enhancements

	- Various documentation updates and corrections.

	## Bug Fixes

	- Fixed compilation errors for SM1x.
	- Fixed compilation errors for some WarpScan entrypoints on SM3x and up.

	# CUB 0.9.3

	## Summary

	CUB 0.9.3 adds histogram algorithms and work management utility descriptors.

	## New Features

	- `cub::DevicHistogram256`.
	- `cub::BlockHistogram256`.
	- `cub::BlockScan` algorithm variant `BLOCK_SCAN_RAKING_MEMOIZE`, which
	trades more register consumption for less shared memory I/O.
	- `cub::GridQueue`, `cub::GridEvenShare`, work management utility descriptors.

	## Other Enhancements

	- Updates to `cub::BlockRadixRank` to use `cub::BlockScan`, which improves
	performance on SM3x by using SHFL.
	- Allow types other than builtin types to be used in `cub::WarpScan::*Sum`
	methods if they only have `operator+` overloaded.
	Previously they also required to support assignment from `int(0)`.
	- Update `cub::BlockReduce`'s `BLOCK_REDUCE_WARP_REDUCTIONS` algorithm to work
	even when block size is not an even multiple of warp size.
	- Refactoring of `cub::DeviceAllocator` interface and
	`cub::CachingDeviceAllocator` implementation.

	# CUB 0.9.2

	## Summary

	CUB 0.9.2 adds `cub::WarpReduce`.

	## New Features

	- `cub::WarpReduce`, which uses the SHFL instruction when applicable.
	`cub::BlockReduce` now uses this `cub::WarpReduce` instead of implementing
	its own.

	## Enhancements

	- Documentation updates and corrections.

	## Bug Fixes

	- Fixes for 64-bit Linux compilation warnings and errors.

	# CUB 0.9.1

	## Summary

	CUB 0.9.1 is a minor release.

	## Bug Fixes

	- Fix for ambiguity in `cub::BlockScan::Reduce` between generic reduction and
	summation.
	Summation entrypoints are now called `::Sum()`, similar to the
	convention in `cub::BlockScan`.
	- Small edits to documentation and download tracking.

	# CUB 0.9.0

	## Summary

	Initial preview release.
	CUB is the first durable, high-performance library of cooperative block-level,
	warp-level, and thread-level primitives for CUDA kernel programming.