Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: openzfs/zfs
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: hpc/zfs
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: zia
Choose a head ref
Can’t automatically merge. Don’t worry, you can still create the pull request.
  • 1 commit
  • 57 files changed
  • 1 contributor

Commits on Jan 17, 2025

  1. ZFS Interface for Accelerators (Z.I.A.)

    The ZIO pipeline has been modified to allow for external,
    alternative implementations of existing operations to be
    used. The original ZFS functions remain in the code as
    fallback in case the external implementation fails.
    
    Definitions:
        Accelerator - an entity (usually hardware) that is
                      intended to accelerate operations
        Offloader   - synonym of accelerator; used interchangeably
        Data Processing Unit Services Module (DPUSM)
                    - https://github.com/hpc/dpusm
                    - defines a "provider API" for accelerator
                      vendors to set up
                    - defines a "user API" for accelerator consumers
                      to call
                    - maintains list of providers and coordinates
                      interactions between providers and consumers.
        Provider    - a DPUSM wrapper for an accelerator's API
        Offload     - moving data from ZFS/memory to the accelerator
        Onload      - the opposite of offload
    
    In order for Z.I.A. to be extensible, it does not directly
    communicate with a fixed accelerator. Rather, Z.I.A. acquires
    a handle to a DPUSM, which is then used to acquire handles
    to providers.
    
    Using ZFS with Z.I.A.:
        1. Build and start the DPUSM
        2. Implement, build, and register a provider with the DPUSM
        3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
        4. Rebuild and start ZFS
        5. Create a zpool
        6. Select the provider
               zpool set zia_provider=<provider name> <zpool>
        7. Select operations to offload
               zpool set zia_<property>=on <zpool>
    
    The operations that have been modified are:
        - compression
            - non-raw-writes only
        - decompression
        - checksum
            - not handling embedded checksums
            - checksum compute and checksum error call the same function
        - raidz
            - generation
            - reconstruction
        - vdev_file
            - open
            - write
            - close
        - vdev_disk
            - open
            - invalidate
            - write
            - flush
            - close
    
    Successful operations do not bring data back into memory after
    they complete, allowing for subsequent offloader operations
    reuse the data. This results in only one data movement per ZIO
    at the beginning of a pipeline that is necessary for getting
    data from ZFS to the accelerator.
    
    When errors ocurr and the offloaded data is still accessible,
    the offloaded data will be onloaded (or dropped if it still
    matches the in-memory copy) for that ZIO pipeline stage and
    processed with ZFS. This will cause thrashing if a later
    operation offloads data. This should not happen often, as
    constant errors (resulting in data movement) is not expected
    to be the norm.
    
    Unrecoverable errors such as hardware failures will trigger
    pipeline restarts (if necessary) in order to complete the
    original ZIO using the software path.
    
    The modifications to ZFS can be thought of as two sets of changes:
        - The ZIO write pipeline
            - compression, checksum, RAIDZ generation, and write
            - Each stage starts by offloading data that was not
              previously offloaded
                - This allows for ZIOs to be offloaded at any point
                  in the pipeline
        - Resilver
            - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
              RAIDZ generation), and write
            - Because the core of resilver is vdev_raidz_io_done, data
              is only offloaded once at the beginning of
              vdev_raidz_io_done
                - Errors cause data to be onloaded, but will not
                  re-offload in subsequent steps within resilver
                - Write is a separate ZIO pipeline stage, so it will
                  attempt to offload data
    
    The zio_decompress function has been modified to allow for
    offloading but the ZIO read pipeline as a whole has not, so it
    is not part of the above list.
    
    An example provider implementation can be found in
    module/zia-software-provider
        - The provider's "hardware" is actually software - data is
          "offloaded" to memory not owned by ZFS
        - Calls ZFS functions in order to not reimplement operations
        - Has kernel module parameters that can be used to trigger
          ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.
    
    abd_t, raidz_row_t, and vdev_t have each been given an additional
    "void *<prefix>_zia_handle" member. These opaque handles point to
    data that is located on an offloader. abds are still allocated,
    but their payloads are expected to diverge from the offloaded copy
    as operations are run.
    
    Encryption and deduplication are disabled for zpools with Z.I.A.
    operations enabled
    
    Aggregation is disabled for offloaded abds
    
    RPMs will build with Z.I.A.
    
    Signed-off-by: Jason Lee <[email protected]>
    calccrypto committed Jan 17, 2025
    Copy the full SHA
    f42b511 View commit details
Showing with 5,499 additions and 70 deletions.
  1. +2 −0 Makefile.am
  2. +1 −0 config/Rules.am
  3. +8 −1 config/zfs-build.m4
  4. +45 −0 config/zia.m4
  5. +3 −0 include/Makefile.am
  6. +1 −0 include/sys/abd.h
  7. +13 −0 include/sys/fs/zfs.h
  8. +3 −0 include/sys/spa_impl.h
  9. +8 −0 include/sys/vdev_disk.h
  10. +4 −0 include/sys/vdev_file.h
  11. +2 −0 include/sys/vdev_impl.h
  12. +5 −0 include/sys/vdev_raidz.h
  13. +1 −0 include/sys/vdev_raidz_impl.h
  14. +1 −1 include/sys/zap_impl.h
  15. +225 −0 include/sys/zia.h
  16. +51 −0 include/sys/zia_cddl.h
  17. +75 −0 include/sys/zia_private.h
  18. +5 −0 include/sys/zio.h
  19. +8 −0 include/sys/zio_compress.h
  20. +14 −1 lib/libzfs/libzfs.abi
  21. +2 −0 lib/libzpool/Makefile.am
  22. +36 −0 man/man7/zpoolprops.7
  23. +19 −0 module/Kbuild.in
  24. +3 −3 module/Makefile.in
  25. +84 −13 module/os/linux/zfs/vdev_disk.c
  26. +51 −3 module/os/linux/zfs/vdev_file.c
  27. +2 −0 module/os/linux/zfs/zfs_debug.c
  28. +45 −0 module/zcommon/zpool_prop.c
  29. +42 −0 module/zfs/THIRDPARTYLICENSE.zia
  30. +1 −0 module/zfs/THIRDPARTYLICENSE.zia.descrip
  31. +9 −1 module/zfs/abd.c
  32. +3 −0 module/zfs/dmu.c
  33. +5 −2 module/zfs/lz4_zfs.c
  34. +198 −0 module/zfs/spa.c
  35. +4 −0 module/zfs/vdev.c
  36. +1 −0 module/zfs/vdev_draid.c
  37. +333 −24 module/zfs/vdev_raidz.c
  38. +1,754 −0 module/zfs/zia.c
  39. +208 −0 module/zfs/zia_cddl.c
  40. +188 −14 module/zfs/zio.c
  41. +56 −2 module/zfs/zio_checksum.c
  42. +921 −0 module/zia-software-provider/kernel_offloader.c
  43. +152 −0 module/zia-software-provider/kernel_offloader.h
  44. +453 −0 module/zia-software-provider/provider.c
  45. +9 −1 rpm/generic/zfs-kmod.spec.in
  46. +9 −1 rpm/generic/zfs.spec.in
  47. +8 −1 rpm/redhat/zfs-kmod.spec.in
  48. +4 −0 tests/runfiles/linux.run
  49. +2 −0 tests/zfs-tests/include/commands.cfg
  50. +9 −2 tests/zfs-tests/tests/Makefile.am
  51. +34 −0 tests/zfs-tests/tests/functional/zia/cleanup.ksh
  52. +40 −0 tests/zfs-tests/tests/functional/zia/setup.ksh
  53. +37 −0 tests/zfs-tests/tests/functional/zia/zia.cfg
  54. +136 −0 tests/zfs-tests/tests/functional/zia/zia.kshlib
  55. +54 −0 tests/zfs-tests/tests/functional/zia/zia_props.ksh
  56. +65 −0 tests/zfs-tests/tests/functional/zia/zia_raidz_resilver.ksh
  57. +47 −0 tests/zfs-tests/tests/functional/zia/zia_write_pipeline.ksh
2 changes: 2 additions & 0 deletions Makefile.am
Original file line number Diff line number Diff line change
@@ -57,6 +57,8 @@ dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2
dist_noinst_DATA += module/os/linux/spl/THIRDPARTYLICENSE.gplv2.descrip
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.cityhash.descrip
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia
dist_noinst_DATA += module/zfs/THIRDPARTYLICENSE.zia.descrip

@CODE_COVERAGE_RULES@

1 change: 1 addition & 0 deletions config/Rules.am
Original file line number Diff line number Diff line change
@@ -44,6 +44,7 @@ AM_CPPFLAGS += -DPKGDATADIR=\"$(pkgdatadir)\"
AM_CPPFLAGS += $(DEBUG_CPPFLAGS)
AM_CPPFLAGS += $(CODE_COVERAGE_CPPFLAGS)
AM_CPPFLAGS += -DTEXT_DOMAIN=\"zfs-@ac_system_l@-user\"
AM_CPPFLAGS += $(ZIA_CPPFLAGS)

if ASAN_ENABLED
AM_CPPFLAGS += -DZFS_ASAN_ENABLED
9 changes: 8 additions & 1 deletion config/zfs-build.m4
Original file line number Diff line number Diff line change
@@ -263,6 +263,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
AC_SUBST(TEST_JOBS)
])
ZFS_AC_ZIA
ZFS_INIT_SYSV=
ZFS_INIT_SYSTEMD=
ZFS_WANT_MODULES_LOAD_D=
@@ -294,7 +296,8 @@ AC_DEFUN([ZFS_AC_CONFIG], [
[test "x$qatsrc" != x ])
AM_CONDITIONAL([WANT_DEVNAME2DEVID], [test "x$user_libudev" = xyes ])
AM_CONDITIONAL([WANT_MMAP_LIBAIO], [test "x$user_libaio" = xyes ])
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes])
AM_CONDITIONAL([PAM_ZFS_ENABLED], [test "x$enable_pam" = xyes ])
AM_CONDITIONAL([ZIA_ENABLED], [test "x$enable_zia" = xyes ])
])

dnl #
@@ -342,6 +345,10 @@ AC_DEFUN([ZFS_AC_RPM], [
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "__strip /bin/true"'
])
AS_IF([test "x$enable_zia" = xyes], [
RPM_DEFINE_COMMON=${RPM_DEFINE_COMMON}' --define "$(WITH_ZIA) 1" --define "DPUSM_ROOT $(DPUSM_ROOT)"'
])
RPM_DEFINE_UTIL=' --define "_initconfdir $(initconfdir)"'
dnl # Make the next three RPM_DEFINE_UTIL additions conditional, since
45 changes: 45 additions & 0 deletions config/zia.m4
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
dnl # Adds --with-zia=PATH to configuration options
dnl # The path provided should point to the DPUSM
dnl # root and contain Module.symvers.
AC_DEFUN([ZFS_AC_ZIA], [
AC_ARG_WITH([zia],
AS_HELP_STRING([--with-zia=PATH],
[Path to Data Processing Services Module]),
[
DPUSM_ROOT="$withval"
AS_IF([test "x$DPUSM_ROOT" != "xno"],
[enable_zia=yes],
[enable_zia=no])
],
[enable_zia=no]
)
AS_IF([test "x$enable_zia" == "xyes"],
AS_IF([! test -d "$DPUSM_ROOT"],
[AC_MSG_ERROR([--with-zia=PATH requires the DPUSM root directory])]
)
DPUSM_SYMBOLS="$DPUSM_ROOT/Module.symvers"
AS_IF([test -r $DPUSM_SYMBOLS],
[
AC_MSG_RESULT([$DPUSM_SYMBOLS])
ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
KERNEL_ZIA_CPPFLAGS="-DZIA=1 -I$DPUSM_ROOT/include"
WITH_ZIA="_with_zia"
AC_SUBST(WITH_ZIA)
AC_SUBST(KERNEL_ZIA_CPPFLAGS)
AC_SUBST(ZIA_CPPFLAGS)
AC_SUBST(DPUSM_SYMBOLS)
AC_SUBST(DPUSM_ROOT)
],
[
AC_MSG_ERROR([
*** Failed to find Module.symvers in:
$DPUSM_SYMBOLS
])
]
)
)
])
3 changes: 3 additions & 0 deletions include/Makefile.am
Original file line number Diff line number Diff line change
@@ -143,6 +143,9 @@ COMMON_H = \
sys/zfs_vfsops.h \
sys/zfs_vnops.h \
sys/zfs_znode.h \
sys/zia.h \
sys/zia_cddl.h \
sys/zia_private.h \
sys/zil.h \
sys/zil_impl.h \
sys/zio.h \
1 change: 1 addition & 0 deletions include/sys/abd.h
Original file line number Diff line number Diff line change
@@ -65,6 +65,7 @@ typedef struct abd {
list_t abd_gang_chain;
} abd_gang;
} abd_u;
void *abd_zia_handle;
} abd_t;

typedef int abd_iter_func_t(void *buf, size_t len, void *priv);
13 changes: 13 additions & 0 deletions include/sys/fs/zfs.h
Original file line number Diff line number Diff line change
@@ -266,6 +266,19 @@ typedef enum {
ZPOOL_PROP_DEDUP_TABLE_QUOTA,
ZPOOL_PROP_DEDUPCACHED,
ZPOOL_PROP_LAST_SCRUBBED_TXG,
ZPOOL_PROP_ZIA_AVAILABLE,
ZPOOL_PROP_ZIA_PROVIDER,
ZPOOL_PROP_ZIA_COMPRESS,
ZPOOL_PROP_ZIA_DECOMPRESS,
ZPOOL_PROP_ZIA_CHECKSUM,
ZPOOL_PROP_ZIA_RAIDZ1_GEN,
ZPOOL_PROP_ZIA_RAIDZ2_GEN,
ZPOOL_PROP_ZIA_RAIDZ3_GEN,
ZPOOL_PROP_ZIA_RAIDZ1_REC,
ZPOOL_PROP_ZIA_RAIDZ2_REC,
ZPOOL_PROP_ZIA_RAIDZ3_REC,
ZPOOL_PROP_ZIA_FILE_WRITE,
ZPOOL_PROP_ZIA_DISK_WRITE,
ZPOOL_NUM_PROPS
} zpool_prop_t;

3 changes: 3 additions & 0 deletions include/sys/spa_impl.h
Original file line number Diff line number Diff line change
@@ -52,6 +52,7 @@
#include <sys/zfeature.h>
#include <sys/zthr.h>
#include <sys/dsl_deadlist.h>
#include <sys/zia.h>
#include <zfeature_common.h>

#ifdef __cplusplus
@@ -484,6 +485,8 @@ struct spa {
*/
spa_config_lock_t spa_config_lock[SCL_LOCKS]; /* config changes */
zfs_refcount_t spa_refcount; /* number of opens */

zia_props_t spa_zia_props;
};

extern char *spa_config_path;
8 changes: 8 additions & 0 deletions include/sys/vdev_disk.h
Original file line number Diff line number Diff line change
@@ -42,5 +42,13 @@

#ifdef _KERNEL
#include <sys/vdev.h>

#ifdef __linux__
int __vdev_classic_physio(struct block_device *bdev, zio_t *zio,
size_t io_size, uint64_t io_offset, int rw, int flags);
int vdev_disk_io_flush(struct block_device *bdev, zio_t *zio);
void vdev_disk_error(zio_t *zio);
#endif /* __linux__ */

#endif /* _KERNEL */
#endif /* _SYS_VDEV_DISK_H */
4 changes: 4 additions & 0 deletions include/sys/vdev_file.h
Original file line number Diff line number Diff line change
@@ -40,6 +40,10 @@ typedef struct vdev_file {
extern void vdev_file_init(void);
extern void vdev_file_fini(void);

#ifdef __linux__
extern mode_t vdev_file_open_mode(spa_mode_t spa_mode);
#endif

#ifdef __cplusplus
}
#endif
2 changes: 2 additions & 0 deletions include/sys/vdev_impl.h
Original file line number Diff line number Diff line change
@@ -467,6 +467,8 @@ struct vdev {
uint64_t vdev_io_t;
uint64_t vdev_slow_io_n;
uint64_t vdev_slow_io_t;

void *vdev_zia_handle;
};

#define VDEV_PAD_SIZE (8 << 10)
5 changes: 5 additions & 0 deletions include/sys/vdev_raidz.h
Original file line number Diff line number Diff line change
@@ -169,6 +169,11 @@ extern int vdev_raidz_load(vdev_t *);
#define RAIDZ_EXPAND_PAUSE_SCRATCH_POST_REFLOW_1 6
#define RAIDZ_EXPAND_PAUSE_SCRATCH_POST_REFLOW_2 7

void vdev_raidz_generate_parity_p(struct raidz_row *);
void vdev_raidz_generate_parity_pq(struct raidz_row *);
void vdev_raidz_generate_parity_pqr(struct raidz_row *);
void vdev_raidz_reconstruct_general(struct raidz_row *, int *, int);

#ifdef __cplusplus
}
#endif
1 change: 1 addition & 0 deletions include/sys/vdev_raidz_impl.h
Original file line number Diff line number Diff line change
@@ -136,6 +136,7 @@ typedef struct raidz_row {
uint64_t rr_offset; /* Logical offset for *_io_verify() */
uint64_t rr_size; /* Physical size for *_io_verify() */
#endif
void *rr_zia_handle;
raidz_col_t rr_col[]; /* Flexible array of I/O columns */
} raidz_row_t;

2 changes: 1 addition & 1 deletion include/sys/zap_impl.h
Original file line number Diff line number Diff line change
@@ -61,7 +61,7 @@ typedef struct mzap_phys {
uint64_t mz_salt;
uint64_t mz_normflags;
uint64_t mz_pad[5];
mzap_ent_phys_t mz_chunk[1];
mzap_ent_phys_t mz_chunk[];
/* actually variable size depending on block size */
} mzap_phys_t;

Loading