分享一例Android内存碎片OOM

问题介绍

最近有遇到一种OOM问题

image.png

意思就是出现Java 内存碎片了，明明还有231MB空间可以使用，可是分配1.2M内存失败了。

普通的分析思路

这个问题如何分析？

最直接的思路可能就是认为Java内存使用出问题了，最可能的就是OOM堆栈这块内存使用不合理了。然后找到关键字"碎片"，既然是碎片，那么就搞一个类似于对象池的机制，就可以解决了。不过这儿会有3个疑问：

应用如何告诉sdk可以将对象回收到对象池里了？技术不难实现，可是在不修改接口前提下做到就比较复杂了
应用何时告诉sdk可以回收对象了？
这个对象池如何管理，允许管理多少对象？

这样想想就会发现需要考虑的东西还很多，虽然可以通过虚引用的技术可以在对象被释放的时候被我们感知到，勉勉强强可以实现1，2的部分功能，可是如果要真正可以做到释放对象后让对象池托管还是比较难。

到了这儿可能会觉得这个思路行不通了，确实是。而且还有最关键的一个问题，还有231M的内存，为什么就分配不出来了？即使是有碎片，我还有231M的内存，分配1.2M还是很容易啊。

看起来就需要走正规的分析思路了。

正规的分析思路

分析问题如同中医的看病，走望闻问切的套路。首先是望，望就是看现象，在bugly看看发生oom时候的现场： [图片上传失败...(image-5f4472-1662605466284)]

从该信息上，可以看出来的结论如下：

DirectBuffer 分配内存失败了；
实际上虚拟机还有足够的空间
设备是32位的，继续观察发生OOM的机器，可以发现都是32位的

接下来是闻，看看发生OOM时的场景等，比如内存使用，用户操作等，前后台等，这儿最关键的就是内存使用，看了内存后基本结论如下：

OOM发生在运行了一段时间后，尤其是时间长时候概率最高
应用占用内存在稳步上升，可以确定一定有内存泄漏问题

接下来是问，咨询下客户是否可以重现，这块客户反馈测试重现不了，没有有效信息。最后就是切了，开始分析找rootcause了。从DirectBuffer 分配内存失败开始，最有效的方法就是从源码中找原因：

代码语言：javascript复制

 public static ByteBuffer allocateDirect(int capacity) {
        // Android-changed: Android's DirectByteBuffers carry a MemoryRef.
        // return new DirectByteBuffer(capacity);
        DirectByteBuffer.MemoryRef memoryRef = new DirectByteBuffer.MemoryRef(capacity);
        return new DirectByteBuffer(capacity, memoryRef);
    }

 MemoryRef(int capacity) {
            VMRuntime runtime = VMRuntime.getRuntime();
            buffer = (byte[]) runtime.newNonMovableArray(byte.class, capacity   7);
            allocatedAddress = runtime.addressOf(buffer);
            // Offset is set to handle the alignment: http://b/16449607
            offset = (int) (((allocatedAddress   7) &amp; ~(long) 7) - allocatedAddress);
            isAccessible = true;
            isFreed = false;
            originalBufferObject = null;
  }

可以看到是在nonmoving 空间中分配内存了

代码语言：javascript复制

static jobject VMRuntime_newNonMovableArray(JNIEnv* env, jobject, jclass javaElementClass,
                                            jint length) {
  ScopedFastNativeObjectAccess soa(env);
  if (UNLIKELY(length < 0)) {
    ThrowNegativeArraySizeException(length);
    return nullptr;
  }
  ObjPtr<mirror::Class> element_class = soa.Decode<mirror::Class>(javaElementClass);
  if (UNLIKELY(element_class == nullptr)) {
    ThrowNullPointerException("element class == null");
    return nullptr;
  }
  Runtime* runtime = Runtime::Current();
  ObjPtr<mirror::Class> array_class =
      runtime->GetClassLinker()->FindArrayClass(soa.Self(), element_class);
  if (UNLIKELY(array_class == nullptr)) {
    return nullptr;
  }
  gc::AllocatorType allocator = runtime->GetHeap()->GetCurrentNonMovingAllocator();
  ObjPtr<mirror::Array> result = mirror::Array::Alloc(soa.Self(),
                                                      array_class,
                                                      length,
                                                      array_class->GetComponentSizeShift(),
                                                      allocator);
  return soa.AddLocalReference<jobject>(result);
}

这儿的allocator是kAllocatorTypeNonMoving，接下来就是虚拟机分配内存的流程了：

代码语言：javascript复制

template <bool kIsInstrumented, bool kFillUsable>
inline ObjPtr<Array> Array::Alloc(Thread* self,
                                  ObjPtr<Class> array_class,
                                  int32_t component_count,
                                  size_t component_size_shift,
                                  gc::AllocatorType allocator_type) {
  DCHECK(allocator_type != gc::kAllocatorTypeLOS);
  DCHECK(array_class != nullptr);
  DCHECK(array_class->IsArrayClass());
  DCHECK_EQ(array_class->GetComponentSizeShift(), component_size_shift);
  DCHECK_EQ(array_class->GetComponentSize(), (1U << component_size_shift));
  size_t size = ComputeArraySize(component_count, component_size_shift);
#ifdef __LP64__
  // 64-bit. No size_t overflow.
  DCHECK_NE(size, 0U);
#else
  // 32-bit.
  if (UNLIKELY(size == 0)) {
    self->ThrowOutOfMemoryError(android::base::StringPrintf("%s of length %d would overflow",
                                                            array_class->PrettyDescriptor().c_str(),
                                                            component_count).c_str());
    return nullptr;
  }
#endif
  gc::Heap* heap = Runtime::Current()->GetHeap();
  ObjPtr<Array> result;
  if (!kFillUsable) {
    SetLengthVisitor visitor(component_count);
    result = ObjPtr<Array>::DownCast(
        heap->AllocObjectWithAllocator<kIsInstrumented>(
            self, array_class, size, allocator_type, visitor));
  } else {
    SetLengthToUsableSizeVisitor visitor(component_count,
                                         DataOffset(1U << component_size_shift).SizeValue(),
                                         component_size_shift);
    result = ObjPtr<Array>::DownCast(
        heap->AllocObjectWithAllocator<kIsInstrumented>(
            self, array_class, size, allocator_type, visitor));
  }
  if (kIsDebugBuild &amp;&amp; result != nullptr &amp;&amp; Runtime::Current()->IsStarted()) {
    array_class = result->GetClass();  // In case the array class moved.
    CHECK_EQ(array_class->GetComponentSize(), 1U << component_size_shift);
    if (!kFillUsable) {
      CHECK_EQ(result->SizeOf(), size);
    } else {
      CHECK_GE(result->SizeOf(), size);
    }
  }
  return result;
}

可以看到最终还是走到了heap中的AllocObjectWithAllocator开始真正的分配，流程也就是先分配，失败后就会gc，然后再分配，再gc 扩容，然后再分配，如果还失败，有必要的话还会整理下内存，处理碎片，对于nonmoving就不会整理了，直接会抛异常。接下来我们就可以看到OOM中的内存碎片是如何来的了。看一下关键流程：

代码语言：javascript复制

mirror::Object* Heap::AllocateInternalWithGc(Thread* self,
                                             AllocatorType allocator,
                                             bool instrumented,
                                             size_t alloc_size,
                                             size_t* bytes_allocated,
                                             size_t* usable_size,
                                             size_t* bytes_tl_bulk_allocated,
                                             ObjPtr<mirror::Class>* klass) {
 
  bool was_default_allocator = allocator == GetCurrentAllocator();
  // Make sure there is no pending exception since we may need to throw an OOME.
  self->AssertNoPendingException();
  DCHECK(klass != nullptr);

  StackHandleScope<1> hs(self);
  HandleWrapperObjPtr<mirror::Class> h_klass(hs.NewHandleWrapper(klass));

  auto send_object_pre_alloc =
      [&amp;]() REQUIRES_SHARED(Locks::mutator_lock_) REQUIRES(!Roles::uninterruptible_) {
        if (UNLIKELY(instrumented)) {
          AllocationListener* l = alloc_listener_.load(std::memory_order_seq_cst);
          if (UNLIKELY(l != nullptr) &amp;&amp; UNLIKELY(l->HasPreAlloc())) {
            l->PreObjectAllocated(self, h_klass, &amp;alloc_size);
          }
        }
      };
#define PERFORM_SUSPENDING_OPERATION(op)                                          
  [&amp;]() REQUIRES(Roles::uninterruptible_) REQUIRES_SHARED(Locks::mutator_lock_) { 
    ScopedAllowThreadSuspension ats;                                              
    auto res = (op);                                                              
    send_object_pre_alloc();                                                      
    return res;                                                                   
  }()

  // The allocation failed. If the GC is running, block until it completes, and then retry the
  // allocation.
  collector::GcType last_gc =
      PERFORM_SUSPENDING_OPERATION(WaitForGcToComplete(kGcCauseForAlloc, self));
  // If we were the default allocator but the allocator changed while we were suspended,
  // abort the allocation.
  if ((was_default_allocator &amp;&amp; allocator != GetCurrentAllocator()) ||
      (!instrumented &amp;&amp; EntrypointsInstrumented())) {
    return nullptr;
  }
  uint32_t starting_gc_num = GetCurrentGcNum();
  if (last_gc != collector::kGcTypeNone) {
    // A GC was in progress and we blocked, retry allocation now that memory has been freed.
    mirror::Object* ptr = TryToAllocate<true, false>(self, allocator, alloc_size, bytes_allocated,
                                                     usable_size, bytes_tl_bulk_allocated);
    if (ptr != nullptr) {
      return ptr;
    }
  }
// 判断是否回收了足够的内存，如果剩余空间够，那么在分配失败时，还会继续扩容再分配，这儿就是搞明白上述一系列问题的关键
  auto have_reclaimed_enough = [&amp;]() {
    size_t curr_bytes_allocated = GetBytesAllocated();
    double curr_free_heap =
        static_cast<double>(growth_limit_ - curr_bytes_allocated) / growth_limit_;
    return curr_free_heap >= kMinFreeHeapAfterGcForAlloc;
  };
  // We perform one GC as per the next_gc_type_ (chosen in GrowForUtilization),
  // if it's not already tried. If that doesn't succeed then go for the most
  // exhaustive option. Perform a full-heap collection including clearing
  // SoftReferences. In case of ConcurrentCopying, it will also ensure that
  // all regions are evacuated. If allocation doesn't succeed even after that
  // then there is no hope, so we throw OOME.
  collector::GcType tried_type = next_gc_type_;
  if (last_gc < tried_type) {
    const bool gc_ran = PERFORM_SUSPENDING_OPERATION(
        CollectGarbageInternal(tried_type, kGcCauseForAlloc, false, starting_gc_num   1)
        != collector::kGcTypeNone);

    if ((was_default_allocator &amp;&amp; allocator != GetCurrentAllocator()) ||
        (!instrumented &amp;&amp; EntrypointsInstrumented())) {
      return nullptr;
    }
    if (gc_ran &amp;&amp; have_reclaimed_enough()) {
      mirror::Object* ptr = TryToAllocate<true, false>(self, allocator,
                                                       alloc_size, bytes_allocated,
                                                       usable_size, bytes_tl_bulk_allocated);
      if (ptr != nullptr) {
        return ptr;
      }
    }
  }
  // Most allocations should have succeeded by now, so the heap is really full, really fragmented,
  // or the requested size is really big. Do another GC, collecting SoftReferences this time. The
  // VM spec requires that all SoftReferences have been collected and cleared before throwing
  // OOME.
  VLOG(gc) << "Forcing collection of SoftReferences for " << PrettySize(alloc_size)
           << " allocation";
  // TODO: Run finalization, but this may cause more allocations to occur.
  // We don't need a WaitForGcToComplete here either.
  // TODO: Should check whether another thread already just ran a GC with soft
  // references.
  DCHECK(!gc_plan_.empty());
  pre_oome_gc_count_.fetch_add(1, std::memory_order_relaxed);
  PERFORM_SUSPENDING_OPERATION(
      CollectGarbageInternal(gc_plan_.back(), kGcCauseForAlloc, true, GC_NUM_ANY));
  if ((was_default_allocator &amp;&amp; allocator != GetCurrentAllocator()) ||
      (!instrumented &amp;&amp; EntrypointsInstrumented())) {
    return nullptr;
  }
  mirror::Object* ptr = nullptr;
  // 这儿肯定扩容了，毕竟我们OOM时还剩下好几百M空间，可以判断出来是这儿返回的ptr一定是null，这样才会走到后面的ThrowOutOfMemoryError。
  if (have_reclaimed_enough()) {
    ptr = TryToAllocate<true, true>(self, allocator, alloc_size, bytes_allocated,
                                    usable_size, bytes_tl_bulk_allocated);
  }

  if (ptr == nullptr) {
    const uint64_t current_time = NanoTime();
    switch (allocator) {
      case kAllocatorTypeRosAlloc:
        // Fall-through.
      case kAllocatorTypeDlMalloc: {
        if (use_homogeneous_space_compaction_for_oom_ &amp;&amp;
            current_time - last_time_homogeneous_space_compaction_by_oom_ >
            min_interval_homogeneous_space_compaction_by_oom_) {
          last_time_homogeneous_space_compaction_by_oom_ = current_time;
          HomogeneousSpaceCompactResult result =
              PERFORM_SUSPENDING_OPERATION(PerformHomogeneousSpaceCompact());
          // Thread suspension could have occurred.
          if ((was_default_allocator &amp;&amp; allocator != GetCurrentAllocator()) ||
              (!instrumented &amp;&amp; EntrypointsInstrumented())) {
            return nullptr;
          }
          switch (result) {
            case HomogeneousSpaceCompactResult::kSuccess:
              // If the allocation succeeded, we delayed an oom.
              ptr = TryToAllocate<true, true>(self, allocator, alloc_size, bytes_allocated,
                                              usable_size, bytes_tl_bulk_allocated);
              if (ptr != nullptr) {
                count_delayed_oom_  ;
              }
              break;
            case HomogeneousSpaceCompactResult::kErrorReject:
              // Reject due to disabled moving GC.
              break;
            case HomogeneousSpaceCompactResult::kErrorVMShuttingDown:
              // Throw OOM by default.
              break;
            default: {
              UNIMPLEMENTED(FATAL) << "homogeneous space compaction result: "
                  << static_cast<size_t>(result);
              UNREACHABLE();
            }
          }
          // Always print that we ran homogeneous space compation since this can cause jank.
          VLOG(heap) << "Ran heap homogeneous space compaction, "
                    << " requested defragmentation "
                    << count_requested_homogeneous_space_compaction_.load()
                    << " performed defragmentation "
                    << count_performed_homogeneous_space_compaction_.load()
                    << " ignored homogeneous space compaction "
                    << count_ignored_homogeneous_space_compaction_.load()
                    << " delayed count = "
                    << count_delayed_oom_.load();
        }
        break;
      }
      default: {
        // Do nothing for others allocators.
      }
    }
  }
#undef PERFORM_SUSPENDING_OPERATION
  // If the allocation hasn't succeeded by this point, throw an OOM error.
  if (ptr == nullptr) {
    ScopedAllowThreadSuspension ats;
    ThrowOutOfMemoryError(self, alloc_size, allocator);
  }
  return ptr;

TryToAllocate会根据不同的allocator调用对应的Alloc，我们只看nonmoving的就行:

代码语言：javascript复制

case kAllocatorTypeNonMoving: {
      ret = non_moving_space_->Alloc(self,
                                     alloc_size,
                                     bytes_allocated,
                                     usable_size,
                                     bytes_tl_bulk_allocated);

nonmovingspace 实际上是dlmallocspace, 通过层层调用，最终调用的地方如下：

代码语言：javascript复制

inline mirror::Object* DlMallocSpace::AllocWithoutGrowthLocked(
    Thread* /*self*/, size_t num_bytes,
    size_t* bytes_allocated,
    size_t* usable_size,
    size_t* bytes_tl_bulk_allocated) {
  mirror::Object* result = reinterpret_cast<mirror::Object*>(mspace_malloc(mspace_, num_bytes));
  if (LIKELY(result != nullptr)) {
    if (kDebugSpaces) {
      CHECK(Contains(result)) << "Allocation (" << reinterpret_cast<void*>(result)
            << ") not in bounds of allocation space " << *this;
    }
    size_t allocation_size = AllocationSizeNonvirtual(result, usable_size);
    DCHECK(bytes_allocated != nullptr);
    *bytes_allocated = allocation_size;
    *bytes_tl_bulk_allocated = allocation_size;
  }
  return result;
}

可以看到，是mspace_malloc失败了。mspace_malloc可以看成是在指定的space上malloc。接下来再看下如果这儿失败了，那OOM会如何处理：

代码语言：javascript复制

void Heap::ThrowOutOfMemoryError(Thread* self, size_t byte_count, AllocatorType allocator_type) {
  // If we're in a stack overflow, do not create a new exception. It would require running the
  // constructor, which will of course still be in a stack overflow.
  if (self->IsHandlingStackOverflow()) {
    self->SetException(
        Runtime::Current()->GetPreAllocatedOutOfMemoryErrorWhenHandlingStackOverflow());
    return;
  }

  std::ostringstream oss;
  size_t total_bytes_free = GetFreeMemory();
  // 这个就是我们看到的OOM 日志的前一半了，重点是要找后一半
  oss << "Failed to allocate a " << byte_count << " byte allocation with " << total_bytes_free
      << " free bytes and " << PrettySize(GetFreeMemoryUntilOOME()) << " until OOM,"
      << " target footprint " << target_footprint_.load(std::memory_order_relaxed)
      << ", growth limit "
      << growth_limit_;
  // If the allocation failed due to fragmentation, print out the largest continuous allocation.
  // 只要剩余空间大于申请空间，那么就会继续从对应space中看是否是碎片问题。
  if (total_bytes_free >= byte_count) {
    space::AllocSpace* space = nullptr;
    if (allocator_type == kAllocatorTypeNonMoving) {
      space = non_moving_space_;
    } else if (allocator_type == kAllocatorTypeRosAlloc ||
               allocator_type == kAllocatorTypeDlMalloc) {
      space = main_space_;
    } else if (allocator_type == kAllocatorTypeBumpPointer ||
               allocator_type == kAllocatorTypeTLAB) {
      space = bump_pointer_space_;
    } else if (allocator_type == kAllocatorTypeRegion ||
               allocator_type == kAllocatorTypeRegionTLAB) {
      space = region_space_;
    }

    // There is no fragmentation info to log for large-object space.
    if (allocator_type != kAllocatorTypeLOS) {
      CHECK(space != nullptr) << "allocator_type:" << allocator_type
                              << " byte_count:" << byte_count
                              << " total_bytes_free:" << total_bytes_free;
      // LogFragmentationAllocFailure returns true if byte_count is greater than
      // the largest free contiguous chunk in the space. Return value false
      // means that we are throwing OOME because the amount of free heap after
      // GC is less than kMinFreeHeapAfterGcForAlloc in proportion of the heap-size.
      // Log an appropriate message in that case.
      if (!space->LogFragmentationAllocFailure(oss, byte_count)) {
        oss << "; giving up on allocation because <"
            << kMinFreeHeapAfterGcForAlloc * 100
            << "% of heap free after GC.";
      }
    }
  }
  self->ThrowOutOfMemoryError(oss.str().c_str());
}

由于我们的allcator是nonmoving，而nonmoving又是AlmallocSpace，内部的逻辑如下：

代码语言：javascript复制

bool DlMallocSpace::LogFragmentationAllocFailure(std::ostream&amp; os,
                                                 size_t failed_alloc_bytes) {
  Thread* const self = Thread::Current();
  size_t max_contiguous_allocation = 0;
  // To allow the Walk/InspectAll() to exclusively-lock the mutator
  // lock, temporarily release the shared access to the mutator
  // lock here by transitioning to the suspended state.
  Locks::mutator_lock_->AssertSharedHeld(self);
  ScopedThreadSuspension sts(self, ThreadState::kSuspended);
  Walk(MSpaceChunkCallback, &amp;max_contiguous_allocation);
  if (failed_alloc_bytes > max_contiguous_allocation) {
    os << "; failed due to fragmentation (largest possible contiguous allocation "
       <<  max_contiguous_allocation << " bytes)";
    return true;
  }
  return false;
}

现在的最大连续空余空间肯定会小于我们的申请空间，否则我们就分配成功了。所以这儿就会报碎片了。到了这儿基本所有的疑惑都可以解答了。

为什么OOM？是因为mspace_malloc失败了，是malloc失败无非就是无可用的虚拟地址用来分配了，也就是native层内存泄漏了；
为什么是报碎片问题，因为虚拟机以为现在还有足够的空余空间，而最大连续空间又小于申请空间，于是就认为是碎片了

接下来做一个实验验证下分析结论：

代码语言：javascript复制

extern "C" JNIEXPORT jstring JNICALL
Java_com_example_memleakdemo_MainActivity_stringFromJNI(
        JNIEnv* env,
        jobject /* this */) {
    for (int i =0; i < 2000; i   ) {
        void *p = malloc(1 * 1024 * 1024);
    }
    std::string hello = "Hello from C  ";
    return env->NewStringUTF(hello.c_str());
}

代码语言：javascript复制

private List<ByteBuffer> list = new ArrayList<>();

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);

    binding = ActivityMainBinding.inflate(getLayoutInflater());
    setContentView(binding.getRoot());

    // Example of a call to a native method
    TextView tv = binding.sampleText;
    new Thread(new Runnable() {
        @Override
        public void run() {
            stringFromJNI();
            allocateBuffer();
        }
    }).start();
}

private void allocateBuffer() {
    int i = 0;
    while (true) {
        ByteBuffer bb = ByteBuffer.allocateDirect(10 * 1024 * 1024);
        list.add(bb);
        i   ;
        Log.i("lhr", "allocate "   i * 10   " MB");
    }
}

找一个32位的手机运行下，结果如下：

image.png

验证了分析结论。

java gc image malloc png

0 人点赞